Rambles around computer science

Diverting trains of thought, wasting precious time

Fri, 13 Mar 2009

Parser generators again---a rant

It's a sad but inescapable fact that people who design programming tools don't always get it right. You'd think that people who design programming tools for people who build programming tools would be among the better such people, and would therefore be more likely to get their tools right more of the time. So I'm wondering why, months after my first foray into this area, I still haven't found a good parser generator.

Parsers are simple to understand conceptually. They turn a token stream---a list-structured thing---into a syntax tree, which is unsurprisingly a tree-structured thing. They are undoubtedly tricky to implement efficiently, in the sense of generating efficient code. These days that's little excuse for the extreme leakiness of abstraction they provide---I'm talking about things like left-factoring, different techniques for operator precedence and associativity, poor debugging messages, and so on, which are well-known weaknesses of parser generators. On the other hand, these leaks are still manageable in number, and there can only be so many tricks one needs to master in order to be able to describe a given grammar in a way that's acceptable to a given parser generator.

It took me a while to realise that traditional tools like yacc are not actually parser generators. Rather, they are recogniser generators: they generate code which recognises an input language. If you want to build a tree out of the input, you have to do it yourself in the semantic actions tacked on to the grammar. This is why yacc is coupled to the C language---if you want to do anything with the input, even just building a tree out of it, you have to embed snippets of C into your grammar. Similar tools have similar problems---ocamlyacc for example. Even more infuriatingly, the classical teaching example of building a calculator directly in yacc is exactly what you don't want---how often in practice is your interpreter going to be so simple that you want to embed it into the semantic actions of your parser? Please, just show how to make the parser build the tree, then walk the tree in a separate bit of code!

(The use of semantic actions has the obvious downside that you can't use the tool unless you know the language it embeds. There's also a more taken-for-granted downside to this practice of language-embedding---the expectation that the generated code will only be conveniently accessible from clients written in that same language, so if you want to generate a parser for use in your Java program, say, you need to use a parser generator that's “for Java”. This needn't actually be so---I look forward to the day when I can use ocamlyacc to generate a parser for use within a Java program, but I wouldn't like to attempt that right now. On the other hand, perhaps in the .NET world F♯ and C♯ can do this already....)

Now, it's quite reasonable to want to describe your language in such a fashion as can be used to generate parsers in a variety of host languages. Since most formal languages are context-free, there's nothing about turning token streams into syntax trees that requires any Turing-complete language. In other words we needn't be polluting our grammars with snippets of C or any other language. The domain of context-free parsers is already very constrained, and moreover, every context-free grammar already describes one way of turning a conforming token stream into a tree. That tree is the parse tree, and contains one parent node for each production invocation that was made during the recognition process, and one leaf node for every token that was matched.

Usually the parse tree contains more detail than the language implementor needs. Many of the twists and turns made during recognition are artifacts of the grammar's internal structure---recall that there are infinitely many CFGs for a given context-free language. We can therefore throw away a lot of detail in the parse tree, ending up with a more concise tree, that is, the abstract syntax tree. By controlling what we throw away to get this tree, and exploiting what we know about the language's semantics while we do so, we can end up with a much simpler tree. For example, if we have a list separator, say the comma, then we can represent sequences of comma-separated elements as sibling nodes under a variadic parent. By contrast, the parse tree would most likely have represented the list as a descending structure of binary nodes, given the comma would most likely be defined by a recursive rule capturing one concrete token (the leaf) and, recursively, a shorter sublist (the subtree). (Of course there would also be a termination case, of the empty or singleton list, and the recursive rule could be either left- or right-recursive.) Anyway, in our grammar we'd like a way of specifying the transformation from this more concrete, complex representation to the simpler abstract one.

Since these kinds of transformations are inherently dependent on the semantics of the language being described, one could be fooled into believing that we must fall back on semantic actions for describing them---and indeed that's the approach of yacc-like parser generators. But there's absolutely no reason why these actions need to be expressed in a Turing-complete language, when a very simple language of tree transformations would suffice, and again avoid coupling our grammar to a particular host programming language. ANTLR provides such a language, and for me this is one of its main attractions---together with the its multiple back-end host languages. It's not perfect---in practice, certain semantic actions are still necessary, for example in the lexer to skip whitespace tokens. More annoyingly, it's customary to specify the host language in the header of the grammar file rather than separating it out into, say, a command-line argument. ANTLR is also annoying in generating only recursive-descent parsers and therefore requiring ugly expressions of left-recursive syntaxes and operator precedence. However, it's getting closer to a sensible design... I just have no idea why the truly sensible ways of doing this haven't been standard practice for decades.

[/research] permanent link contact

validate this page