Using syntactic predicates in Xtext, part 1
Xtext 2.x comes with the possibility to define syntactic predicates in the grammar. But what exactly are these syntactic predicates and how can they be used to avoid or resolve ambiguities in your grammar? The reference documentation is characteristically succinct on the subject. This might mean that it’s either very simple or very complicated 😉
In short: syntactic predicates provide a way to force the parser to make certain choices by annotating the grammar using a ‘=>‘.
Fortunately, it’s actually quite simple but you have to dive a little deeper into the parsing technology used by Xtext to really understand it. Xtext uses ANTLR* ‘under the hood’ to generate the lexer and recursive-descent parser. To leverage ANTLR, Xtext generates an** ANTLR grammar from an Xtext one. As such, it is ANTLR that does most of the heavy lifting while the Xtext runtime sort-of piggybacks on the ‘stuff’ ANTLR generates to build a full model from the parsed text and provide the functionality that ANTLR doesn’t.
During the generation of lexer and parser, ANTLR performs a thorough analysis of the grammar generated by Xtext to check for non-LL(*) behavior (i.e., left-recursion) and nondeterminism (i.e., ambiguities) in the grammar. The former it deals with by reporting an error “[fatal] rule xxx has non-LL(*) decision due to recursive rule invocations reachable from alts n, m, …. Resolve by left-factoring or using syntactic predicates or using backtrack=true option.” for every left-recursive rule and quitting the process, leaving you with a broken Xtext project. Left-recursion usually originates from trying to implement an expression language along the lines of
Expression: Literal | '(' Expression ')' | left=Expression op=('+'|'-'|'*'|'/') right=Expression
There’s a string of material (see here, here and here) detailing the ‘right’ patterns for writing such languages in a non-left-recursive manner in Xtext which also takes care of precedence and associativity. Since those patterns don’t use syntactic predicates (well, they can but it’s not essential), I won’t talk about these any more here.
Switching on backtracking should really be the very last option you try, as it doesn’t guarantee to solve the problem your grammar has but it does guarantee to obscure any problem, simply by not reporting any, even the ones that are easy to fix. Furthermore, backtracking ‘merely’ tries all the possible options, picking the first one that works: in essence it’s a ‘precognitive’ syntactic predicate, but at the expense of time and memory. If we can tweak our grammar with syntactic predicates so that no backtracking is required, we get a parser that performs better and more predictable if only because we’ve documented part of its behavior in the grammar.
The perfunctory example: the dangling else-problem
The most well-known application of syntactic predicates is also the simplest. Consider this grammar (header stuff omitted):
Model: statement+=IfStatement*; IfStatement: 'if' condition=Expression 'then' then=Expression ('else' else=Expression)?; Expression: IfStatement | {ValueReference} name=ID;
When having Xtext generate the language infrastructure for this grammar, you’ll get a warning from ANTLR saying “Decision can match input such as “‘else'” using multiple alternatives: 1, 2 As a result, alternative(s) 2 were disabled for that input“. This means that there are is an ambiguity in the grammar. ANTLR detects this and makes a choice for you, because otherwise it would have to return a forest of parse trees instead of just one per parse, or roll a dice to cope with the nondeterminism. We’ll see in a minute that a syntactic predicate allows you to choose yourself, instead of having to rely on ANTLR to pick the right choice – with the chance of your luck running out.
Of course, we already were expecting this behavior, so let’s fire up ANTLRWorks on the InternalMyDsl.g file in the principal/non-UI Xtext project (easily findable using the Ctrl/-Shift-R shortcut) to see how we might use that in general. First, ask ANTLRWorks to perform the same analysis ANTLR itself does during parser generation through Ctrl/-R. Then, click the ruleIfStatement (conveniently marked in red) to see the Syntax Diagram for it. This will look like this:
Since ANTLR already reported to only use alternative 1, this is the way that the if-statement will be parsed: the optional else-part will be matched as part of the current invocation of the IfStatement rule. For the canonical example input “if a then if b then c else d”, it means that the parse will be equivalent to “if a then (if b then c else d)”, i.e. the else-part belongs to the second, inner if-statement and not the first, outer if-statement. This result is what we usually would want since it complies with most existing languages and also because the else-part is visually closer to the inner if so it’s more natural that it binds to that instead of the outer if.
By unchecking alternative 1 and checking alternative 2, we get the following:
Now, these ‘faulty’ diagrams in ANTLRWorks are usually a bit funky to interpret because the arrows don’t really seem to start/end in logical places. In this case, we should read this as: the optional else-part can also be matched as part of the invocation of the IfStatement rule invoking the IfStatement rule for a second time – it’s probably convenient to think of the outer, respectively, inner invocation. For our ubiquitous example input, it would mean that the parse is equivalent to “if a then (if b then c) else d” – with the else-part belonging to the first, outer if-statement and not the inner if-statement.
Note that it’s a bit hard to implement a recursive-descent parser with this behavior, since the execution of the inner IfStatement rule should somehow decide to leave the matching and consumption of the following ‘else‘ keyword to the execution of an (not necessarily the direct caller rule!) outer IfStatement rule. ANTLR tends to favor direct matching and consuming tokens as soon as possible, by the currently-called parser rule, over a more circuitous approach.
You can influence the alternatives-picking behavior by placing syntactic predicates in the Xtext grammar. One advantage is that make the choice explicit in your grammar, which both serves to document it as well to eradicate the corresponding warning(s). Another advantage might be is that you make a different choice from the one ANTLR would make: in fact, you can ‘trigger’ a syntactic predicate not only from a single token but also from a series of tokens – more on that in a next blog. Note that syntactic predicates favor the quickest match as well – by design.
Syntactic predicates in an Xtext grammar consist of a ‘=>‘ keyword in front of a keyword, rule call, assignment (i.e., an assigned rule call) or a grouped parse expression (including any cardinality). So, in our case the IfStatement rule becomes:
IfStatement: 'if' condition=Expression 'then' then=Expression (=>'else' else=Expression)?;
The ‘=>‘ now forces ANTLR to not consider the second alternative path at all and always match and directly consume an ‘else‘ and an ensuing Expression, which happens to match the situation without a syntactic predicate – but now this behavior is clearly intentional and not a happenstance.
Since this blog already runs to some length, I’m deferring some more examples, insights and hints & tips to a next blog. One of the examples will revolve around some common GPL-like language features which can be difficult to implement without syntactic predicates but are blissfully uncomplicated with them.
*) Not entirely by default, but it’s thoroughly recommended: see this explanation for more details on that matter.
**) Actually, Xtext generates two ANTLR grammars: one for full parsing, and one which extracts just enough information to provide the content assist functionality with. They’re essentially the same as far as the pure ANTLR part is concerned.
Excellent. Looking forward to part 2.
Looking forward to the second part as well! 🙂
speaking of ambiguities: reading “For the canonical example input “if a then if b then c else d”, it means that the parse will be equivalent to “if a then (if b then c else d)”, i.e. the else-part belongs to the second, inner if-statement and not the first, outer if-statement – which is usually what we want.” I had parsed it as “we usually want the else part to belong to the outer if-statement” which kind of confused me 😉
Fixed! Thanks for pointing it out 🙂
I didn’t mean that it required modification, but I admit that as you modified it sounds really great! 🙂