Rambles around computer science

Diverting trains of thought, wasting precious time

Sun, 05 Apr 2009

Algebraic data types and typecasing

Yes, that's typecasing (sic) in the title.

People often say that functional programming languages are great for writing compilers. (Hardly a coincidence, I always remark, since the people who design them tend to be compiler guys of one sort or another.) One of their strengths is that they provide algebraic data types, and have neat syntax for pattern-matching. An ADT is just a possibly-recursive data structure featuring one or more unions (discriminated, naturally).

Relative to functional languages, in conventional languages pattern-matching is tedious---we have to do switch, which in C-like languages can only be on integers. We can also only test one value at a time. In Java and the like, it's even worse because we don't have explicit unions---we only have subclassing. It's not hard to see how we can use subclassing to get the effect of discriminated unions. But then, supposing we have some ADT Expr in the form of a set of subclasses of class Expr, and want to pattern-match against the constructor (subclass) that our value is an instance of. How do we do this test? Well, as my wording gave away, we use instanceof. But wait a minute---isn't that bad form? Weren't we taught that we should be using virtual functions instead of a series of instanceof tests (which is what I mean by typecasing)? How can those functional programming guys get away with saying that algebraic data types and pattern matching are so great, if they amount to typecasing?

The answer is that typecasing isn't always bad---it depends on whether you're building a system that is open or closed. ADTs are “closed” in the sense that implicitly, adding another arm (constructor) to the union is very likely going to mean revisiting all the functions that know about your type, and probably adding another pattern-handler. Of course, the default case (the “_” pattern) might do the right thing, but in most cases it won't. If your functions are inherently tightly coupled to the complete ADT definition, then you have a closed system, and typecasing is okay. This is common where you want to perform some complex algorithm over a well-defined input domain---which is exactly the case for compilers, the classic application of functional languages.

Meanwhile, in OO-land, we usually push the functions into the definition of each arm. That way, our set of arms is “open”---we can add a new one, together with its own definitions of the various functions it provides (as overrides), and any algorithm that walks the structure and invokes the functions will work without change. On the other hand, we've moved the coupling to a difference place: we've closed the set of functions defined on the data structure. Our walker sees a common interface, the supertype of the ADT, and can only invoke the functions that are defined by it. This is common if you want to perform simple algorithms over some heterogeneous, evolving input domain---which is exactly the case for most business applications (domain modelling, process automation, resource management, and so on), the classic application of imperative languages.

The conclusion, if there is one, is simply that different problem domains are best tackled using different approaches. There's no reason why both functional and OO languages can't support both approaches equally well, but owing to their originating culture and target application classes, their language features have evolved to offer stronger support for one or the other. Imperative languages could definitely do better in their support for rich expression of algorithms, and languages like Python are leading the way---although much as I like Python, I'd be more of a fan if it provided type-checking and compilation to fast code. Meanwhile, functional languages make a lot of “closed”-style assumptions---the very idea that programs are functions, or Turing machines, is only true of a closed system (since a chunk of a Turing machine, communicating with an unspecified outside world, is a computationally different object, sometimes called an interaction machine; the debate rages on). I've yet to really “grok” (as they say) what's good and bad about functional languages in this regard, and particularly their handling of I/O---the monadic approach seems to be just a way of shoehorning explicit sequencing into a Haskell-like language semantics, which is hardly revolutionary, but maybe there's something else to it.

As I've been finding in my recent experiments with antlr, a tool deriving very much from the imperative and, by extension, OO schools, a lot could be done (but isn't) to generate OO syntax trees that are amenable to typecasing. It'd be a start if it actually generated type definitions (I mean classes) for each kind of node, so that I could use names rather than indices to access the child nodes. The best ANTLR does, as far as I've found, is something called “heterogeneous nodes”, which lets you specify, in the grammar definition, the type used for each node of the AST on a per-rule basis. You can optionally specify constructor arguments too. But this only seems to work for grammar rules using tree construction operators (rather than those falling back on the default behaviour of passing through their children in a list), which necessitates some extra grammar tweaking and boilerplate (usually adding more “imaginary tokens”). Even worse, the killer is that you have to define the types yourself! It wouldn't be hard to generate them automatically, especially if the grammar author was forced to provide a name for every top-level alternative in a rule. Tree rewrite rules already look so much like constructor invocations that we really shouldn't have to specify this separately.

Somewhat separately, there's an interesting debate about how tree-walking should be done. Visitors are one well-understood option, and despite being limited (because when the action is fired, you don't get any contextual information about the node being visited), they're still handy for certain tasks, so it'd be nice if antlr generated them. Meanwhile, Terence Parr says that tree grammars are the best approach. You get to match patterns on your AST and fire actions accordingly; since these patterns can span more than one node they're more precise than visitors, and can give you exactly the contextual information you need. On the other hand, Andy Tripp argues that manual tree walkers are better because it doesn't entail yet another language, and yet another set of snippets of your source language embedded into a grammar. (The latter is a practice I too find distasteful, and I've already gone to some length to avoid it both in my use of antlr and my rejection of other parser generators, as I ranted about in an earlier post.) Anyway, in this particular debate, as usual, I can see both arguments and it really depends what you're trying to do. Parr seems to be talking about the specific case of writing a source-to-source translator, whereas Tripp is talking about a more general case.

Now that I've just wasted half an hour writing this post, it's back to some real work. I mean lunch.

[/research] permanent link contact

Powered by blosxom

validate this page