Archive
Ambiguitiy in Xtext grammars – part 2
In this continuation of the previous instalment, we’re going to take an ambiguous grammar and resolve its ambiguity.
As an example, consider the situation that we have a (arguably slightly stupid) language involving expressions and statements, two of which are variable declaration and assignment. (Let’s assume that all other statements start off by consuming an appropriate keyword token.) So, the following is valid, Java-like syntax (SomeClass is the identifier of a class thingy defined elsewhere):
SomeClass.SomeInnerClass localVar := ... localVar.intField := 42
Now, let’s implement a “naive” Xtext grammar fragment for this:
Variable: name=ID; Statement: VariableDeclaration | Assignment; VariableDeclaration: typeRef=ClassRef variable=Variable (':=' value=Expression)?; ClassRef: type=[Class] tail=FeatureRefTail?; Assignment: lhs=AssignableSite ':=' value=Expression; AssignableSite: var=[VariableDeclaration] tail=FeatureRefTail?; FeatureRefTail: '.' feature=[Feature] tail=FeatureRefTail?;
Here, Class and Feature are quite standard types that both have String-valued ‘name’ features and have corresponding syntax elsewhere. Expression references an expression sub language which is at least able to do integer literals. Note that a Variable is contained by a VariableDeclaration so you can refer to a variable without needing to refer to its declaration. (You can find this grammar on GitHub.)
Now, let’s run this through the Xtext generator:
error(211): ../nl.dslmeinte.xtext.ambiguity/src-gen/nl/dslmeinte/xtext/ambiguity/parser/antlr/internal/InternalMyPL.g:415:1: [fatal] rule ruleStatement has non-LL(*) decision due to recursive rule invocations reachable from alts 1,2. Resolve by left-factoring or using syntactic predicates or using backtrack=true option. error(211): ../nl.dslmeinte.xtext.ambiguity.ui/src-gen/nl/dslmeinte/xtext/ambiguity/ui/contentassist/antlr/internal/InternalMyPL.g:472:1: [fatal] rule rule__Statement__Alternatives has non-LL(*) decision due to recursive rule invocations reachable from alts 1,2. Resolve by left-factoring or using syntactic predicates or using backtrack=true option.
Even though Xtext itself doesn’t warn us about any problem (upfront), ANTLR spits out two errors back at us, and flat-out refuses to generate a parser after which the Xtext generation process crashes completely. The problem is best illustrated with the example DSL proza: its first line corresponds to a token stream ID-Keyword(‘.’)-ID-Keyword(‘:=’)-… while the second line corresponds to a stream ID-Keyword(‘.’)-ID-Keyword(‘:=’)-INT(42). (Note that whitespace is usually irrelevant and therefore, typically hidden which is Xtext’s default anyway.) Both lines start with consuming an ID token and because of the k=1 lookahead, the parser doesn’t stand a chance of distinguishing the variable declaration parser rule from the assignment one: only the fourth token reveals the distinction ID vs. Keyword(‘:=’). Note that since the nesting can be arbitrarily deep, any finite lookahead wouldn’t suffice meaning that we’d have to switch on the backtracking – one could think of this as setting k=∞.
To recap the situation with the token streams in comments:
SomeClass.SomeInnerClass localVar := ... // ID-Keyword('.')-ID-[WS]-ID-[WS]-Keyword(':=')-[WS]-... localVar.intField := 42 // ID-Keyword('.')-ID-[WS]-Keyword(':=')-INT(42)
So, how do we deal with this ambiguity? One answer is to left-factor(ize) the grammar – as is already suggested by the ANTLR output. The trade-off is that our grammar becomes more complicated and we might have to do some heavy lifting outside of the grammar. But that is only to be expected since the grammar deals first and foremost with the syntax – what Xtext provides extra has everything to do with inference of the Ecore meta model (to which the EMF models conform) and only marginally so with semantics, by means of the default behavior for lazily-resolved cross-references.
Analogous to the left-factorized pattern for expression grammars, we’re going to implement the lookahead manually and rewrite nodes in the parsing tree to have the appropriate type. First note that our statements always begin with an ID token which either equals a variable name or a class name. After that any number of Keyword(‘.’)-ID sequences follow (we don’t care about whitespace, comments and such for now) until we either encounter an ID-Keyword(‘:=’) sequence or a Keyword(‘:=’) token, in both cases followed by an expression of sorts.
So, the idea is to first parse the ID-(Keyword(‘.’)-ID)* token sequence (which we’ll call the head) and then rewrite the tree according to whether we encounter an ID or the Keyword(‘:=’) token first. In Xtext, there’s a distinction between parser and type rules but only type rules give us code completion through scoping out-of-the-box, so we would like to use a type rule for the head. The head starts with either a reference to a Class or to a VariableDeclaration. Unfortunately, we can’t distinguish between these two at parse level so we have to have a common super type:
HeadTarget: Class | Variable;
However, due to the way that Xtext tries to “lift” or automatically Refactor identical features (having the same name, type, etc.), we need to introduce an additional type (that’s used nowhere) to suppress the corresponding errors:
Named: Class | Variable | Attribute;
Now we can make the Head grammar rule, reusing the FeatureRefTail rule we already had:
Head: target=[HeadTarget] tail=FeatureRefTail?;
And finally, the new grammar rule to handle both Assignment and VariableDeclaration:
AssignmentOrVariableDeclaration: Head ( ({VariableDeclaration.assignableSite=current} name=ID ':=' (value=Expression)?) | ({Assignment.lhs=current} ':=' value=Expression) );
This works as follows:
- Try to parse and construct a Head model element without actually creating a model element containing that Head;
- When the first step is successful, determine whether we’re in a variable declaration or an assignment by looking at the next tokens;
- Create a model element of the corresponding type and assign the Head instance to the right feature.
This is commonly referred to a “tree rewriting” but in the case of Xtext that’s actually slightly misleading, as no trees are rewritten. (In fact, Xtext produces models which are only trees as long as there are no unresolved references.)
To complete the example, we have to implement the scoping (which can also be found on GitHub). I’ve already covered that (with slightly different type names) in a previous blog post, but I will rephrase that here. Essentially, scoping separates into two parts:
- Determining the features of the type of a variable. This type is specified by the typeRef feature (of type Head) of a VariableDeclaration. This is a actually a type system computation as the Head instance in the VariableDeclaration should already be completely resolved.
- Determining the features of the previous element of a Head instance as possible values of the current FeatureRefTail.feature. For this we only want the “direct features” since we’re actively computing a scope.
(The scoping implementation uses a type SpecElement which is defined as a super type of Head and FeatureRefTail, but this is merely for convenience and type-safety of said implementation.)
In conclusion, we’ve rewritten an ambiguous grammar as an unambiguous one so we didn’t need to use backtracking with all its associated disadvantages: less performance, ANTLR reports no warnings about unreachable alternatives, “magic”, etc. We also found that this didn’t really complicate the grammar: it expresses intent and mechanism quite clearly and doesn’t feel like as kluge.
Ambiguities in Xtext grammars – part 1
In this blog in two instalments, I’ll discuss a few common sources of ambiguity in Xtext grammars in the hopes that it will allow the reader to recognise and fix these situations when they arise. This instalment constitutes the theoretical bit, while the next one will discuss a concrete example.
By default (at least: the default for the Itemis distro) Xtext relies on ANTLR to produce a so-called LL(k) parser, where LL(k) stands for “Left-to-right Leftmost-derivation with lookahead k“, where k is a positive integer or * = ∞ – see the Wikipedia article for more information (with a definite “academic” feel, so be warned). This means that grammars which are not LL(k) yield ANTLR errors during the Xtext generation stating that certain alternatives of a decision have been switched off. These are actual errors: the generated parser quite probably does not accept the full language as implied by the grammar (disregarding the fact that it’s not actually LL(k)) because the parser doesn’t follow certain decision paths to try and parse the input. We say that this grammar is ambiguous.
Xtext and “LL(1) conflicts”
Remember that first of all that the parser tokenizes its input into a linear stream of tokens (by default: keywords, IDs, STRINGs and all kinds of white space and comments) before it parses it into an EMF model (in the case of Xtext, but into an AST for general parsers). The problem is that an Xtext grammar specifies more than just the tokenizing and parsing behavior: it also specifies cross-references whose syntax often introduce an ambiguity on the parsing level by consuming the same token (ID, by default). The second parsing phase (still completely generated by ANTLR from an ANTLR grammar) only uses information on the token type, but doesn’t use additional information, such as a symbol table it may have built up. Such a strategy wouldn’t work with forward references anyway as the parser is essentially one-pass: references are resolved lazily only after parsing. This means that we have to beware especially of language constructs which start off by consuming the same token types (such as IDs) left-to-right but whose following syntax are totally different. This is a typical example of the FIRST/FIRST conflict type for a grammar; see also the Wikipedia article. The other non-recursive conflict type is FIRST/FOLLOW and is a tad more subtle, but it can be dealt with in the same way as the FIRST/FIRST conflict: by left-factorization.
Left-recursion
A grammar is left-recursive if (and only if) it contains a parser rule which can recursively call itself without first consuming a token. A left-recursive grammar is incompatible with LL(k) tech and Xtext or ANTLR will warn you about your grammar being left-recursive: Xtext detects left-recursion at the parser rule level while ANTLR detects left-recursion at the token level. Expression languages provide excellent examples of left-recursive grammars when trying to implement them in Xtext naively. For expression languages there’s a special pattern (ultimately also based on left-factorization) to deal both with the left-recursion, the precedence levels and creating a useable expression tree at the same time: see Sven’s oft-referenced blog and two of my blogs.
Backtracking
There are several ways to deal with ambiguities, one of which is enabling backtracking in the ANTLR parser generator. To understand what backtracking does, have a look at the documentation: essentially, it introduces a recovery strategy to try other alternatives from a parser rule, even if the input already matched with one alternative based on the specified lookahead. (See also this blog for some examples of the backtracking semantics and the difference with the grammar being LL(*).) I’m not enamored of backtracking because ANTLR analysis doesn’t report any errors anymore during analysis, so while it may resolve some ambiguities, it will not warn you about other/new ones. (It also tends to cause a bit of a performance hit, unless memoization is switched on using the memoize option.) In case you do really need backtracking, you should have a good language testing strategy in place with both positive and negative tests to check whether the parser accepts the intended language.
It’s my experience that very few DSLs actually require backtracking. In fact, if your DSL does really need it, chances are that you’re actually implementing something of a GPL which you should think about twice anyway. A quite common case requiring backtracking is when your language uses the same delimiter pair for two different semantics, e.g. expression grouping and type casting in most of the C-derived languages. Using different delimiters is an obvious strategy, but you might as well think hard about why you actually need to push something as unsafe as type casting on your DSL users.
Configuring the lookahead
To manually configure the lookahead used in the ANTLR grammar generated Xtext fragment (instead of relying on ANTLR’s defaults), you’ll have to do a bit of hacking: you have to create a suitable custom implementation of such a fragment, because MWE2 doesn’t have syntax for integer literals or revert to using MWE(1) to configure that. I’ll present a good illustration of this in the worked example in the next instalment which contains nested type specifications (or “path expressions”, as I called them in an earlier blog) which can have an arbitrary nesting depth. Using left-factorization we can rewrite the grammar to be LL(1), at the cost of some extra indirection structure in the meta model and some extra effort in implementing scoping and validation.
The Dangling Else-problem
A(nother) common source of ambiguity is known as The Dangling Else-problem (see the article for a definition) which is a “true” ambiguity in the sense that it doesn’t fall in one of the LL(1) conflicts categories described above. The only way to deal with that type of ambiguity in Xtext 1.0.x is to have a language (unit) test to check whether the dangling else ends up in the correct place – “usually”, that’s as else-part for the innermost if. Note that Xtext 2.0 has (some) support for syntactic predicates which allow you to deal with this declaratively in the grammar.
Next time, a concrete, worked example!
A (slightly) better switch statement in JavaScript
The switch statement in JavaScript suffers from the usual problems associated with C-style switch statements: fall through. This means that each case guard needs to be expressly closed with a break statement to avoid falling through to the first executable code after that – no matter which case that code belongs to. Fall through has been the source of very many bugs. Unfortunately, the static code analysis for JavaScript (JSLint, JSHint and Google’s Closure compiler) do not check for potential fall through (yet?).
Today I thought I could improve the switch statement slightly with the following code pattern:
var result = (function(it) { switch(it) { case 'x': return 1; case 'y': return 2; /* ... */ default: return 0; } })(my_it);
(Apologies for the lack of indentation: couldn’t get that to work…)
The advantage of using return statements is two-fold:
- it exits the switch statement immediately,
- it usually comes right after the case guard, making visual inspection and verification much easier than hunting for a break either in- or outside of a nice pair of curly braces.
This approach also has a definite functional programming flavor, as we’ve effectively turned the switch statement into an expression, since the switch statement is executed as part of a function invocation.
Postscript
Yes, I do write JavaScript from time to time. I usually don’t like the experience very much, mostly because of inadequate tool support and the lack of static typing (and the combination thereof: e.g. the JS plug-ins for Eclipse often have a hard time making sense of the code at all). But we do what we can to get by 😉
Polymorphic dispatch in Xtend
Polymorphic dispatch (or http://en.wikipedia.org/wiki/Multiple_dispatch or multimethods as its also called) is a programming language construct which chooses a code path based on runtime types instead of types that are inferred at compile. The poor man’s method of achieving such behavior would be to litter your code with prose like this:
if( x instanceof TypeA ) { (x as TypeA).exprA } else if( x instanceof TypeB ) { (x as TypeB).exprB } else ...
Xtend 2.x offers two much better constructs to do polymorphic dispatching:
- Through the use of the dispatch modifier for function defs – this construct is Xtend’s “official” polymorphic dispatch.
- Through the use of the switch statement and referring to types instead of cases.
These are better than the poor man’s method because they are declarative, i.e.: they express intent much more clearly and succinctly. Though both constructs have a lot in common, there are some marked differences and Best Practices for safe guarding type safety which I’ll discuss in the blog.
The example
Consider the following Xtend code – note that the syntax coloring is lacking a bit, but WordPress doesn’t fully understand Xtend – …yet…. Also note that since Xtend2.3 you can have more than one Xtend class in one file.
class CommonSuperType { ... } class TypeA extends CommonSuperType { ... } class TypeB extends CommonSuperType { ... } class TypeC extends CommonSuperType { ... } class UnrelatedType { ... } class Handler { def dispatch foo(TypeA it) { it.exprA } def dispatch foo(TypeB it) { it.exprB } def bar(CommonSuperType it) { switch it { TypeA: it.exprA TypeB: it.exprB } } }
For simplicity’s sake, let’s assume that exprA and exprB both return an int. The Xtend compiler generates two public methods in the Java class Handler – one for foo, one for bar. Both of these have the same signature: int f(CommonSuperType), where f = foo or bar. In addition, for each foo dispatch function, Xtend generates a public method with signature int _foo(t), where t=TypeA or TypeB – note the prefixed underscore. The actual polymorphic dispatch then happens in “combined” foo(CommonSuperType) method, actually through the previously demonstrated poor man’s method.
By the way: a user-friendly way to inspect the “combined” method is the Outline which will group the dispatch functions belonging together under the combined signature.
Note that the foo and bar method ends up with CommonSuperType as the type of its parameter. This is because CommonSuperType is the most specific common super type of TypeA and TypeB – deftly implied by the name – and Xtend infers that as the parameter’s type for the “combined” method. In general, Xtend will compute the most specific common super type across all dispatch functions, on a per-argument basis. In case of the bar method we declared ourselves what the parameter type is.
As demonstration, add the following code to the Handler class and see what happens:
def dispatch foo(TypeC it) { it.exprC }
(Assume that exprC again returns an int.)
The generated foo and bar methods are functionally nearly identical, the difference being that foo explicitly throws an IllegalArgumentException mentioning the unhandled parameter type(s) in its message, in case you called it with something that is a CommonSuperType but neither of TypeA nor of TypeB. The bar method does no such thing and simply falls through the switch, returning the appropriate default value: typically null but 0 in our int-case. To remedy that, you’ll have to add a default case which throws a similar exception, like so:
def bar(CommonSuperType it) { switch it { TypeA: it.exprA TypeB: it.exprB default: throw new IllegalArgumentException("don't how to handle sub type " + it.^class.simpleName) } }
In case you already have sensible default case, you’re basically out of luck.
Potential mistakes
Both approaches have their respective (dis-)advantages which I’ll list comprehensively below. In both cases, though, it’s relatively easy to make programmers’ mistakes. The most common and obvious ones are:
- The parameter type of the “combined” foo method is inferred, so if you add a dispatch function having a parameter type which does not extend CommonSuperType, then the foo method will wind up with a more general parameter type – potentially Object. This means that the foo method will accept a lot more types than usually intended and failing miserably (through a thrown IllegalArgumentException) on most of them. This is especially dangerous for public (which happens to be the default visibility!) function defs.
- Xtend will not warn you at editor/compile time about the “missing” case TypeC: it’s a sub type of CommonSuperType but not of TypeA nor of TypeB. At runtime, the bar method will simply fall through and return 0.
- The return type of the combined method is also inferred as the most specific common super type of the various return types – again, potentially Object. This is usually much less of a problem because that inferred type is checked against the parameter type of clients of the combined method.
This shows that these constructs require us to do a little extra to safe guard the type safety we so appreciate in Xtend.
Advantages and disadvantages of both constructs
We list some advantages and disadvantages of both constructs. Advantages of the dispatch construct:
- Provides more visual code space. This is useful if the handling of the separate types typically needs more than 1 line of code.
- Explicit handling of unhandled cases at runtime.
Disadvantages of the dispatch construct:
- Automagically infers parameter types of the “combined” method as the most common super types. In case of a programmer error, this may be (much) too wide.
- Takes up more visual code space/more syntactic noise.
Advantages of the switch construct:
- Takes up less visual code space. This is useful if the handling of the separate types doesn’t need more than one line of code.
- It’s a single expression, so you can use it as such inside the function it’s living in. Also, you can precompute “stuff” that’s useful for more than one case.
Disadvantages of the switch construct:
- Fall-through of unhandled cases at runtime, resulting in an (often) non-sensical return value. You have to add an explicit default case to detect fall-through.
Mixing polymorphic and ordinary dispatch
Since Xtend version 2.3, you are warned about dispatch functions having a compatible signature as a non-dispatch function and vice versa. As an example, consider the following addition to the Handler class:
def foo(TypeC it) { it.exprC } def foo(UnrelatedType it) { it.someExpr }
Here, exprC again returns an int, but someExpr may return anything. Note that both functions are not of the dispatch persuasion.
The first line is flagged with the warning “Dispatch method has same name and number of parameters as non-dispatch method”, which is a just warning in my book. However, this warning is also given for the second line, as well as for the first two foo functions. (Note that the warnings are also given with only one of these extra functions present.) In that case, it’s not always a helpful warning but it does riddle your code file with warnings.
To get rid of the warnings, I frequently make use of the following technique:
- “Hide” all dispatch functions by giving them (and only these) an alternate name. My personal preference is to postfix the name with an underscore, since the extra _ it’s visually inconspicuous enough to not dilute the intended meaning. Also, give them private visibility to prevent prying eyes.
- Create an additional function with the same signature as the “combined” method for the dispatch functions, calling those.
The net result is that you get rid of the warnings, because there’s no more mixture of dispatch and non-dispatch functions with compatible signatures. Another upshot is that the signature of the “combined” method is now explicitly checked by the additional function calling it – more type safety, yeah! Of course, a disadvantage is that you need an extra function but that typically only is one line of code.
In the context of our example, the original two foo functions are replaced by the following code:
def foo(CommonSuperType it) { foo_ } def private dispatch foo_(TypeA it) { it.exprA } def private dispatch foo_(TypeB it) { it.exprB }
Using syntactic predicates in Xtext, part 2
This blog is a continuation of the previous one about how to use syntactic predicates in Xtext. As promised, I’ll provide a few more examples, most of which come from the realm of GPL-like languages.
But first, a little summary is in order. As stated in the previous blog, a syntactic predicate is an annotation in an Xtext grammar which indicates to the ANTLR parser generator how a (potential) ambiguity should be resolved by picking the (first) one which is decorated with ‘=>‘. The annotation can be applied to:
- a(n individual) keyword (such as ‘else‘),
- a rule call (unassigned or as part of an assignment) and
- a grouped parse expression, i.e. a parse expression between parentheses.
One thing to keep in mind -not only for syntactic predicates but in general- that an Xtext grammar has at least three and often four responsibilities:
- defining the lexing behavior through definition and inclusion of terminals;
- defining the parsing behavior through parser rules which determine how tokens are matched and consumed;
- defining how the model is populated;
- (when not using an existing Ecore model) defining the meta model.
Syntactic predicates influence the second of these but not the others. It is, after all, a syntactic predicate, not a semantic one – which Xtext doesn’t have in any case. Just as without using syntactic predicates, parsing behavior is not influenced by how the model is populated: instead, it is governed solely by the types of the tokens it receives from the lexer. This is easily forgotten when you’re trying to write grammars with cross-references like this:
SomeParserRule: Alternative1 | Alternative2; Alternative1: ref1=[ReferencedType1|ID]; Alternative1: ref2=[ReferencedType2|ID];
In this case, the parser will always consume the ID token as part of Alternative1 even if its value is the (qualified) name of something of ReferencedType2. In fact, ANTLR will issue a warning about alternative 2 being unreachable so it is disabled. For a workaround this problem, see this older blog: it uses a slightly different use case as motivation but the details are the same. The only thing a syntactic predicate can do here is to explicitly favor one alternative over the other.
Some examples from Xbase
The Xtend and the Xbase languages that Xtext ships with both use plenty of syntactic predicates to avoid ambiguities in their grammars and to avoid having to use backtracking altogether. This already indicates that syntactic predicates are a necessary tool, especially when creating GPL-like or otherwise quite expressive DSLs. Note again that syntactic predicates are typically found near/inside optional parts of grammar rules since optionality automatically implies an alternative parsing route.
A good example can be found in the Xbase grammar in the form of the XReturnExpression rule: see GitHub. It uses a syntactic predicate on an assignment to force the optional XExpression following the ‘return‘ keyword to be parsed as part of the XReturnExpression rather than being an XExpression all on its own – which would have totally different semantics, but could be a viable interpretation considering Xtend doesn’t require separating/ending semi-colons.
The Xbase grammar also shows that syntactic predicates are an effective way to disambiguate the use of pairs of parentheses for denoting a list of arguments to a function call from that for grouping inside an expression: once again, see GitHub – here, the syntactic predicate applies to a grouped parse expression, i.e. everything between the parentheses pair starting just before the ‘=>‘.
Unforeseen consequences
Even if you don’t (have to) use syntactic predicates yourself, it’s important to know of their existence. As an example, the other day I was prototyping a DSL which used the JvmTypeReference type rule from Xbase followed by an angled bracket pair (‘<‘, ‘>’) which held ID tokens functioning as cross-references. I was momentarily surprised to see parse errors arise in my example along the lines of “Couldn't resolve reference to JvmType 'administrator'.” The stuff between the angled brackets was being interpreted as a generic type parameter!
It turns out that the JvmTypeReference parser rule uses a syntactic predicate on an angled bracket pair surrounding generic type parameters. This explains both the behavior and the lack of warnings by ANTLR about grammar ambiguities. You’d probably have a hard time figuring out this behavior before finding an innocuous ‘=>‘ here. In the end, I changed “my” angled brackets to square brackets to resolve this. This shows that syntactic predicates, just like backtracking, can be a double-edged sword: it can solve some of your problems but you have to really know how it works to be able to understand what’s going on.
I hope that this was useful for you: please let me know whether it is! I’m not planning on a third installment but you never know: a particular enticing use case might just do the trick.
Using syntactic predicates in Xtext, part 1
Xtext 2.x comes with the possibility to define syntactic predicates in the grammar. But what exactly are these syntactic predicates and how can they be used to avoid or resolve ambiguities in your grammar? The reference documentation is characteristically succinct on the subject. This might mean that it’s either very simple or very complicated 😉
In short: syntactic predicates provide a way to force the parser to make certain choices by annotating the grammar using a ‘=>‘.
Fortunately, it’s actually quite simple but you have to dive a little deeper into the parsing technology used by Xtext to really understand it. Xtext uses ANTLR* ‘under the hood’ to generate the lexer and recursive-descent parser. To leverage ANTLR, Xtext generates an** ANTLR grammar from an Xtext one. As such, it is ANTLR that does most of the heavy lifting while the Xtext runtime sort-of piggybacks on the ‘stuff’ ANTLR generates to build a full model from the parsed text and provide the functionality that ANTLR doesn’t.
During the generation of lexer and parser, ANTLR performs a thorough analysis of the grammar generated by Xtext to check for non-LL(*) behavior (i.e., left-recursion) and nondeterminism (i.e., ambiguities) in the grammar. The former it deals with by reporting an error “[fatal] rule xxx has non-LL(*) decision due to recursive rule invocations reachable from alts n, m, …. Resolve by left-factoring or using syntactic predicates or using backtrack=true option.” for every left-recursive rule and quitting the process, leaving you with a broken Xtext project. Left-recursion usually originates from trying to implement an expression language along the lines of
Expression: Literal | '(' Expression ')' | left=Expression op=('+'|'-'|'*'|'/') right=Expression
There’s a string of material (see here, here and here) detailing the ‘right’ patterns for writing such languages in a non-left-recursive manner in Xtext which also takes care of precedence and associativity. Since those patterns don’t use syntactic predicates (well, they can but it’s not essential), I won’t talk about these any more here.
Switching on backtracking should really be the very last option you try, as it doesn’t guarantee to solve the problem your grammar has but it does guarantee to obscure any problem, simply by not reporting any, even the ones that are easy to fix. Furthermore, backtracking ‘merely’ tries all the possible options, picking the first one that works: in essence it’s a ‘precognitive’ syntactic predicate, but at the expense of time and memory. If we can tweak our grammar with syntactic predicates so that no backtracking is required, we get a parser that performs better and more predictable if only because we’ve documented part of its behavior in the grammar.
The perfunctory example: the dangling else-problem
The most well-known application of syntactic predicates is also the simplest. Consider this grammar (header stuff omitted):
Model: statement+=IfStatement*; IfStatement: 'if' condition=Expression 'then' then=Expression ('else' else=Expression)?; Expression: IfStatement | {ValueReference} name=ID;
When having Xtext generate the language infrastructure for this grammar, you’ll get a warning from ANTLR saying “Decision can match input such as “‘else'” using multiple alternatives: 1, 2 As a result, alternative(s) 2 were disabled for that input“. This means that there are is an ambiguity in the grammar. ANTLR detects this and makes a choice for you, because otherwise it would have to return a forest of parse trees instead of just one per parse, or roll a dice to cope with the nondeterminism. We’ll see in a minute that a syntactic predicate allows you to choose yourself, instead of having to rely on ANTLR to pick the right choice – with the chance of your luck running out.
Of course, we already were expecting this behavior, so let’s fire up ANTLRWorks on the InternalMyDsl.g file in the principal/non-UI Xtext project (easily findable using the Ctrl/-Shift-R shortcut) to see how we might use that in general. First, ask ANTLRWorks to perform the same analysis ANTLR itself does during parser generation through Ctrl/-R. Then, click the ruleIfStatement (conveniently marked in red) to see the Syntax Diagram for it. This will look like this:
Since ANTLR already reported to only use alternative 1, this is the way that the if-statement will be parsed: the optional else-part will be matched as part of the current invocation of the IfStatement rule. For the canonical example input “if a then if b then c else d”, it means that the parse will be equivalent to “if a then (if b then c else d)”, i.e. the else-part belongs to the second, inner if-statement and not the first, outer if-statement. This result is what we usually would want since it complies with most existing languages and also because the else-part is visually closer to the inner if so it’s more natural that it binds to that instead of the outer if.
By unchecking alternative 1 and checking alternative 2, we get the following:
Now, these ‘faulty’ diagrams in ANTLRWorks are usually a bit funky to interpret because the arrows don’t really seem to start/end in logical places. In this case, we should read this as: the optional else-part can also be matched as part of the invocation of the IfStatement rule invoking the IfStatement rule for a second time – it’s probably convenient to think of the outer, respectively, inner invocation. For our ubiquitous example input, it would mean that the parse is equivalent to “if a then (if b then c) else d” – with the else-part belonging to the first, outer if-statement and not the inner if-statement.
Note that it’s a bit hard to implement a recursive-descent parser with this behavior, since the execution of the inner IfStatement rule should somehow decide to leave the matching and consumption of the following ‘else‘ keyword to the execution of an (not necessarily the direct caller rule!) outer IfStatement rule. ANTLR tends to favor direct matching and consuming tokens as soon as possible, by the currently-called parser rule, over a more circuitous approach.
You can influence the alternatives-picking behavior by placing syntactic predicates in the Xtext grammar. One advantage is that make the choice explicit in your grammar, which both serves to document it as well to eradicate the corresponding warning(s). Another advantage might be is that you make a different choice from the one ANTLR would make: in fact, you can ‘trigger’ a syntactic predicate not only from a single token but also from a series of tokens – more on that in a next blog. Note that syntactic predicates favor the quickest match as well – by design.
Syntactic predicates in an Xtext grammar consist of a ‘=>‘ keyword in front of a keyword, rule call, assignment (i.e., an assigned rule call) or a grouped parse expression (including any cardinality). So, in our case the IfStatement rule becomes:
IfStatement: 'if' condition=Expression 'then' then=Expression (=>'else' else=Expression)?;
The ‘=>‘ now forces ANTLR to not consider the second alternative path at all and always match and directly consume an ‘else‘ and an ensuing Expression, which happens to match the situation without a syntactic predicate – but now this behavior is clearly intentional and not a happenstance.
Since this blog already runs to some length, I’m deferring some more examples, insights and hints & tips to a next blog. One of the examples will revolve around some common GPL-like language features which can be difficult to implement without syntactic predicates but are blissfully uncomplicated with them.
*) Not entirely by default, but it’s thoroughly recommended: see this explanation for more details on that matter.
**) Actually, Xtext generates two ANTLR grammars: one for full parsing, and one which extracts just enough information to provide the content assist functionality with. They’re essentially the same as far as the pure ANTLR part is concerned.
Implementing existing DSLs with Xtext – a case study, part 1
Unlike the previous installment, which held no technical details whatsoever, this one’s going to get a tad greasier, but not much.
Obtaining the language specs
First item on the agenda is obtaining the specifications of the language, if they exist. If they don’t exist, you want to get hold of as many people as you can which might have the specification in their heads. In both cases, you also want to get hold of ‘proza’, i.e. actual content written using the language. It’s important to verify whether that proza is complete and valid, and if not: what error messages are expected. Also, pre-existing tooling can and should be used to validate your own efforts by comparing output from both on the same input.
In the case of CSS it’s relatively easy: all specifications can be found on the W3C web site. However, here it’s already clear that it’s not going to be a walk in the park: the specification is divided into numerous modules which have different status (recommendation vs. official) and varying applicability (through profiles/media: HTML, printing, hearing assistance, etc.).
For Less it’s slightly more difficult: the web site provides a lot of examples but it doesn’t constitute a spec. In fact, careful inspection of the less.js parser sources and unit tests shows that the spec is a bit wider than the examples with respect to selectors of mixins.
This showcases an important non-technical aspect of endeavors such as these:
Customer expectation management
Since the DSL already exists, it’s quite probable that it will have domain users which have been using this DSL and have certain expectations towards this newer implementation. Also, management will have its reasons to go greenlight something that already exists in some shape or form for which they already paid.
It’s important to get social with these customers, gauge their expectations and to set goals for your implementation accordingly. The golden rule here is to provide something that’s good enough to make it worth changing from the existing tooling to the newer tooling. By providing this value as early as possible, you will have an army of testers to battle-harden your implementation and iron out any bugs or incompatibilities.
The silver rule is not to overdo it: pick the initial scope responsibly, get a green light on the scope from users and management, deliver the value and harvest feedback.
In the case of an open-source endeavor like Less/CSS, this is quite a bit trickier: there are plenty of users (certainly of CSS) but probably none share a manager with you so you have to rely on social media to get people to know your stuff, try it out and provide you with feedback. But even here you have to manage expectations. Note that you can also get feedback some by measuring download numbers and page traffic on the GitHub repository.
For Less/CSS, I’ve set the following initial scope after careful consideration:
Be able to completely parse all CSS and Less unit tests in the GitHub repo, giving errors only where the content really is invalid.
For CSS, this means that I’m ‘vaguely’ going to support CSS3. Obviously, this leaves quite a number of aspects out to dry. To name just a few:
- Compliance to a documented sub set of the CSS specs.
- Knowledge about actual CSS properties.
- Validation of Less expressions (type safety) – this requires implementing a evaluation engine.
- More sensible highlighting than the defaults provide.
- Formatting.
- Generation of CSS from Less.
However, I feel that a basic compliance to the CSS and Less syntax with the niceties that Xtext provides out-of-the-box based on the grammar and a minimal implementation of scoping, reflects the minimum that would make the Less editor useful in the sense that it improves the development experience over eliciting a response from the less.js parser.
Of course, a good CSS editor already exists in the Eclipse WTP project, so I only intend to support CSS to the point that it makes the Less editor more useful. The CSS editor is therefore currently nothing more than a runtime dependency of the Less editor. In fact, it’s more of a development convenience than anything else, at the moment. I might roll the CSS language into the Less language somewhere along the line – probably when I start generating the CSS-intrinsic parts of the grammar.
Future ‘releases’ (i.e., the occasional new download) will gradually provide more functionality, depending on the amount of time I allocate for this project which will also depend on the amount of feedback I get. So, if you like it, be sure to drop me a line saying that and also what you would like to see in the plug-ins.
Finally, a word on:
Harvesting pre-existing implementations
It’s rare that pre-existing tooling can actually be used or even harvested to create an Xtext version, however, for numerous reasons. For example, the parsing tech used might be sufficiently different from Xtext’s LL(*) attribute grammar language to make a pre-existing grammar useless: e.g., it may be LALR(k), it may not be an attribute grammar (meaning that parsing and subsequent model creation are separate phases) or it may use things like semantic predicates and symbol tables to resolve parsing issues. Even an ANTLR grammar is usually quite insufficient to transform into an Xtext grammar.
So, whenever someone (probably a manager) suggests this approach, just point him/her to this paragraph 😉 On the other hand, having the source code of a pre-existing implementation really is very useful, especially as supplement or even substitute of a real language spec.
Please leave a comment if you are finding this useful. In case you’re not able to duplicate my efforts for your own DSL(s) or lack the time to do so yourself: you can hire me as a consultant.
Path expressions in entity models, revisited
Some 15 months ago, I wrote a blog detailing how to implement path-like expressions in Xtext DSLs using a custom scope provider. At lot has changed in the Xtext ‘verse since then, and I was triggered to update that blog by a comment on it. Also, the blog seems to be one of my more popular ones and I can refer to it every now and then on the Eclipse Xtext forum.
Most of what I wrote then which wasn’t particular to the ‘Xtext Domain-Model Example’ is still true: the scope provider API hasn’t changed (apart from IGlobalScopeProvider) and the way that scoping is triggered from the generated parser is fundamentally the same. However, the domain model example itself has changed a lot: it now serves to show (off) a lot of features that were introduced with Xtext 2 and Xtend(2). In particular, it relies heavily on Xbase by extending the grammar from Xtype (a sub language of Xbase) and extending the custom scope provider (in Xtend) and validator implementations (in Java) from their Xbase versions, meaning the DSL gets a lot of functionality essentially for free.
This also means that the grammar has changed enough from the original to make it rather delicate to adapt for my original intentions. In particular, the notions of Attribute and Reference have been clobbered together into the notion of Property which directly references a JVM type. To adapt the example I’d have to rely on classes like JvmFeatureDescriptionProvider in order not to re-invent the wheel, but I fear that bringing all that extra machinery is getting in the way of the idea I was trying to get across.
So, instead of that, I’ve opted for the easy way by starting from the original grammar and custom scope provider implementation, porting those over to Xtext 2.1(.0 or .1) and adapting those. In the process, I’ve tried to make the most of the latest, newest features of Xtext 2.x and especially Xtend: all custom code is done in Xtend, rather than Java.
For your convenience, I packed everything in ZIPs and uploaded them to my GitHub repository: complete Eclipse projects and only the essential files – I’m assuming you’ll find out how to interpret the latter in an Xtext context. For even more convenience, I exposed three of the essential files through GitHub’s gist feature: see below, in the running text.
It should be clearly discernible what I’ve added to the grammar to implement the path expressions. I’ve also added a PathElement convenience super type, just to make life a little easier (read: more statically-typed) on the Java/Xtend side of things.
I rewrote the scope provider implementation completely in Xtend. To be able to replace DomainModelScopeProvider.java completely with DomainModelScopeProvider.xtend, you’ll have to add “generateStub = false” to the configuration of the scoping generator fragment in the MWE2 workflow file – otherwise, you’ll get two generated DomainModelScopeProvider classes after running the worklow. Note that the implementations of the scope_Reference_opposite and scope_PathTail_feature are much more succinct and readable than they originally were, because of Xtend’s collection extensions and built-in polymorphic dispatch feature.
Also note that I made use of a separate Xtend file DomainModelExtensions implementing derived properties of model elements in a domain model. Xtend’s extension mechanism allows me to use these extensions transparently with no boilerplate at all apart from the extension declaration.
I also re-implemented the validator in Xtend. Because the Java validator generator fragment doesn’t have a generateStub parameter like the scoping fragment, we have to use a bit of boilerplate in the form of the DomainModelJavaValidator Java class extending the DomainModelXtendValidator Xtend class.
Lastly, I’ve implemented a couple of simple unit tests in Xtend using Xtext 2’s new JUnit4-ready unit testing capabilities. Note that it’s quite simple to do parameterized tests on a fairly functional level: see the ScopingTest Xtend class (in the .tests project) for details.