Rambles around computer science

Diverting trains of thought, wasting precious time

Mon, 22 Aug 2011

Pipelines (are lazy functional composition with recombination)

Unix pipelines are often held up as a paragon of compositional virtue. It's amazing how much they can do, how efficiently and how (relatively) simple to use they can be.

Programming languages people especially like this sort of thing, and functional programming clearly has some synergy with pipelines---which are, clearly, a kind of functional composition. Microsoft's F# language even uses the pipe to denote functional composition.

But pipes are not just compositions of functions mapped over lists. They have two other properties that are critical. The first is well-known, and the second less so.

First is laziness. Since each process in the pipeline explicitly forces its input, pipelines can be used on infinite streams (or portions thereof) just the same as with finite data.

The second is what I call recombination. Each stage in the pipeline can have radically different ideas about the structure of data. For example, pipelines typically combine characterwise (tr) linewise (sed) and whole-file (sort) operations in one go.

This is much harder to achieve in a functional setting because you don't have access to an underlying representation (cf. character streams on Unix): if your part of the pipeline understands data differently than the previous one, you have to map one abstraction to the other by writing a function.

Meanwhile, the shell programmer has a lower-level set of tools: familiar recipes for dealing with lower-level idioms such as line-breaking and field separation that recur across many different kinds of abstract data but may be recombined using the same tricks in each.

The upside of the abstract, functional approach is that you preserve the meaning of your data, even in the face of some heavyweight transformations. Meanwhile, shell programmers are often driven to frustration by their separator-munging not working properly once things get complex.

The downside is that it's more work, and more repeated work, because you might (in the worst case) have the quadratic problem of having to map n abstractions to n-1 other abstractions. By contrast, simple pipelines can often be quickly hacked together using a few standard idioms. They can even be written to be robust to changes in the underlying abstraction (such as numbers of fields), although this takes some extra care.

A middle ground might be to characterise those idioms and formalise them as a functional abstraction, into which many higher-level abstractions could be serialized and deserialized, but without going all the way down to bytes. Perhaps such a system already exists... it sounds a bit like a reflective metamodel of functional records or somesuch. Hmm... perhaps Lispers have been doing this for decades?

[/research] permanent link contact


Powered by blosxom

validate this page