Ambiguities in Xtext grammars – part 1
In this blog in two instalments, I’ll discuss a few common sources of ambiguity in Xtext grammars in the hopes that it will allow the reader to recognise and fix these situations when they arise. This instalment constitutes the theoretical bit, while the next one will discuss a concrete example.
By default (at least: the default for the Itemis distro) Xtext relies on ANTLR to produce a so-called LL(k) parser, where LL(k) stands for “Left-to-right Leftmost-derivation with lookahead k“, where k is a positive integer or * = ∞ – see the Wikipedia article for more information (with a definite “academic” feel, so be warned). This means that grammars which are not LL(k) yield ANTLR errors during the Xtext generation stating that certain alternatives of a decision have been switched off. These are actual errors: the generated parser quite probably does not accept the full language as implied by the grammar (disregarding the fact that it’s not actually LL(k)) because the parser doesn’t follow certain decision paths to try and parse the input. We say that this grammar is ambiguous.
Xtext and “LL(1) conflicts”
Remember that first of all that the parser tokenizes its input into a linear stream of tokens (by default: keywords, IDs, STRINGs and all kinds of white space and comments) before it parses it into an EMF model (in the case of Xtext, but into an AST for general parsers). The problem is that an Xtext grammar specifies more than just the tokenizing and parsing behavior: it also specifies cross-references whose syntax often introduce an ambiguity on the parsing level by consuming the same token (ID, by default). The second parsing phase (still completely generated by ANTLR from an ANTLR grammar) only uses information on the token type, but doesn’t use additional information, such as a symbol table it may have built up. Such a strategy wouldn’t work with forward references anyway as the parser is essentially one-pass: references are resolved lazily only after parsing. This means that we have to beware especially of language constructs which start off by consuming the same token types (such as IDs) left-to-right but whose following syntax are totally different. This is a typical example of the FIRST/FIRST conflict type for a grammar; see also the Wikipedia article. The other non-recursive conflict type is FIRST/FOLLOW and is a tad more subtle, but it can be dealt with in the same way as the FIRST/FIRST conflict: by left-factorization.
A grammar is left-recursive if (and only if) it contains a parser rule which can recursively call itself without first consuming a token. A left-recursive grammar is incompatible with LL(k) tech and Xtext or ANTLR will warn you about your grammar being left-recursive: Xtext detects left-recursion at the parser rule level while ANTLR detects left-recursion at the token level. Expression languages provide excellent examples of left-recursive grammars when trying to implement them in Xtext naively. For expression languages there’s a special pattern (ultimately also based on left-factorization) to deal both with the left-recursion, the precedence levels and creating a useable expression tree at the same time: see Sven’s oft-referenced blog and two of my blogs.
There are several ways to deal with ambiguities, one of which is enabling backtracking in the ANTLR parser generator. To understand what backtracking does, have a look at the documentation: essentially, it introduces a recovery strategy to try other alternatives from a parser rule, even if the input already matched with one alternative based on the specified lookahead. (See also this blog for some examples of the backtracking semantics and the difference with the grammar being LL(*).) I’m not enamored of backtracking because ANTLR analysis doesn’t report any errors anymore during analysis, so while it may resolve some ambiguities, it will not warn you about other/new ones. (It also tends to cause a bit of a performance hit, unless memoization is switched on using the memoize option.) In case you do really need backtracking, you should have a good language testing strategy in place with both positive and negative tests to check whether the parser accepts the intended language.
It’s my experience that very few DSLs actually require backtracking. In fact, if your DSL does really need it, chances are that you’re actually implementing something of a GPL which you should think about twice anyway. A quite common case requiring backtracking is when your language uses the same delimiter pair for two different semantics, e.g. expression grouping and type casting in most of the C-derived languages. Using different delimiters is an obvious strategy, but you might as well think hard about why you actually need to push something as unsafe as type casting on your DSL users.
Configuring the lookahead
To manually configure the lookahead used in the ANTLR grammar generated Xtext fragment (instead of relying on ANTLR’s defaults), you’ll have to do a bit of hacking: you have to create a suitable custom implementation of such a fragment, because MWE2 doesn’t have syntax for integer literals or revert to using MWE(1) to configure that. I’ll present a good illustration of this in the worked example in the next instalment which contains nested type specifications (or “path expressions”, as I called them in an earlier blog) which can have an arbitrary nesting depth. Using left-factorization we can rewrite the grammar to be LL(1), at the cost of some extra indirection structure in the meta model and some extra effort in implementing scoping and validation.
The Dangling Else-problem
A(nother) common source of ambiguity is known as The Dangling Else-problem (see the article for a definition) which is a “true” ambiguity in the sense that it doesn’t fall in one of the LL(1) conflicts categories described above. The only way to deal with that type of ambiguity in Xtext 1.0.x is to have a language (unit) test to check whether the dangling else ends up in the correct place – “usually”, that’s as else-part for the innermost if. Note that Xtext 2.0 has (some) support for syntactic predicates which allow you to deal with this declaratively in the grammar.
Next time, a concrete, worked example!