Posts Tagged ‘left factorisation’

Ambiguitiy in Xtext grammars – part 2

September 11, 2014 Leave a comment

In this continuation of the previous instalment, we’re going to take an ambiguous grammar and resolve its ambiguity.

As an example, consider the situation that we have a (arguably slightly stupid) language involving expressions and statements, two of which are variable declaration and assignment. (Let’s assume that all other statements start off by consuming an appropriate keyword token.) So, the following is valid, Java-like syntax (SomeClass is the identifier of a class thingy defined elsewhere):

SomeClass.SomeInnerClass localVar := ...
localVar.intField := 42

Now, let’s implement a “naive” Xtext grammar fragment for this:

Variable: name=ID;

Statement: VariableDeclaration | Assignment;

VariableDeclaration: typeRef=ClassRef variable=Variable (':=' value=Expression)?;
ClassRef:            type=[Class] tail=FeatureRefTail?;

Assignment:     lhs=AssignableSite ':=' value=Expression;
AssignableSite: var=[VariableDeclaration] tail=FeatureRefTail?;

FeatureRefTail: '.' feature=[Feature] tail=FeatureRefTail?;

Here, Class and Feature are quite standard types that both have String-valued ‘name’ features and have corresponding syntax elsewhere. Expression references an expression sub language which is at least able to do integer literals. Note that a Variable is contained by a VariableDeclaration so you can refer to a variable without needing to refer to its declaration. (You can find this grammar on GitHub.)

Now, let’s run this through the Xtext generator:

error(211): ../nl.dslmeinte.xtext.ambiguity/src-gen/nl/dslmeinte/xtext/ambiguity/parser/antlr/internal/InternalMyPL.g:415:1: [fatal] rule ruleStatement has non-LL(*) decision due to recursive rule invocations reachable from alts 1,2.  Resolve by left-factoring or using syntactic predicates or using backtrack=true option.
error(211): ../nl.dslmeinte.xtext.ambiguity.ui/src-gen/nl/dslmeinte/xtext/ambiguity/ui/contentassist/antlr/internal/InternalMyPL.g:472:1: [fatal] rule rule__Statement__Alternatives has non-LL(*) decision due to recursive rule invocations reachable from alts 1,2.  Resolve by left-factoring or using syntactic predicates or using backtrack=true option.

Even though Xtext itself doesn’t warn us about any problem (upfront), ANTLR spits out two errors back at us, and flat-out refuses to generate a parser after which the Xtext generation process crashes completely. The problem is best illustrated with the example DSL proza: its first line corresponds to a token stream ID-Keyword(‘.’)-ID-Keyword(‘:=’)-… while the second line corresponds to a stream ID-Keyword(‘.’)-ID-Keyword(‘:=’)-INT(42). (Note that whitespace is usually irrelevant and therefore, typically hidden which is Xtext’s default anyway.) Both lines start with consuming an ID token and because of the k=1 lookahead, the parser doesn’t stand a chance of distinguishing the variable declaration parser rule from the assignment one: only the fourth token reveals the distinction ID vs. Keyword(‘:=’). Note that since the nesting can be arbitrarily deep, any finite lookahead wouldn’t suffice meaning that we’d have to switch on the backtracking – one could think of this as setting k=∞.

To recap the situation with the token streams in comments:

SomeClass.SomeInnerClass localVar := ...  // ID-Keyword('.')-ID-[WS]-ID-[WS]-Keyword(':=')-[WS]-...
localVar.intField := 42                   // ID-Keyword('.')-ID-[WS]-Keyword(':=')-INT(42)

So, how do we deal with this ambiguity? One answer is to left-factor(ize) the grammar – as is already suggested by the ANTLR output. The trade-off is that our grammar becomes more complicated and we might have to do some heavy lifting outside of the grammar. But that is only to be expected since the grammar deals first and foremost with the syntax – what Xtext provides extra has everything to do with inference of the Ecore meta model (to which the EMF models conform) and only marginally so with semantics, by means of the default behavior for lazily-resolved cross-references.

Analogous to the left-factorized pattern for expression grammars, we’re going to implement the lookahead manually and rewrite nodes in the parsing tree to have the appropriate type. First note that our statements always begin with an ID token which either equals a variable name or a class name. After that any number of Keyword(‘.’)-ID sequences follow (we don’t care about whitespace, comments and such for now) until we either encounter an ID-Keyword(‘:=’) sequence or a Keyword(‘:=’) token, in both cases followed by an expression of sorts.

So, the idea is to first parse the ID-(Keyword(‘.’)-ID)* token sequence (which we’ll call the head) and then rewrite the tree according to whether we encounter an ID or the Keyword(‘:=’) token first. In Xtext, there’s a distinction between parser and type rules but only type rules give us code completion through scoping out-of-the-box, so we would like to use a type rule for the head. The head starts with either a reference to a Class or to a VariableDeclaration. Unfortunately, we can’t distinguish between these two at parse level so we have to have a common super type:

HeadTarget: Class | Variable;

However, due to the way that Xtext tries to “lift” or automatically Refactor identical features (having the same name, type, etc.), we need to introduce an additional type (that’s used nowhere) to suppress the corresponding errors:

Named: Class | Variable | Attribute;

Now we can make the Head grammar rule, reusing the FeatureRefTail rule we already had:

Head: target=[HeadTarget] tail=FeatureRefTail?;

And finally, the new grammar rule to handle both Assignment and VariableDeclaration:

  Head (
    ({VariableDeclaration.assignableSite=current} name=ID ':=' (value=Expression)?) |
    ({Assignment.lhs=current} ':=' value=Expression)

This works as follows:

  1. Try to parse and construct a Head model element without actually creating a model element containing that Head;
  2. When the first step is successful, determine whether we’re in a variable declaration or an assignment by looking at the next tokens;
  3. Create a model element of the corresponding type and assign the Head instance to the right feature.

This is commonly referred to a “tree rewriting” but in the case of Xtext that’s actually slightly misleading, as no trees are rewritten. (In fact, Xtext produces models which are only trees as long as there are no unresolved references.)

To complete the example, we have to implement the scoping (which can also be found on GitHub). I’ve already covered that (with slightly different type names) in a previous blog post, but I will rephrase that here. Essentially, scoping separates into two parts:

  1. Determining the features of the type of a variable. This type is specified by the typeRef feature (of type Head) of a VariableDeclaration. This is a actually a type system computation as the Head instance in the VariableDeclaration should already be completely resolved.
  2. Determining the features of the previous element of a Head instance as possible values of the current FeatureRefTail.feature. For this we only want the “direct features” since we’re actively computing a scope.

(The scoping implementation uses a type SpecElement which is defined as a super type of Head and FeatureRefTail, but this is merely for convenience and type-safety of said implementation.)

In conclusion, we’ve rewritten an ambiguous grammar as an unambiguous one so we didn’t need to use backtracking with all its associated disadvantages: less performance, ANTLR reports no warnings about unreachable alternatives, “magic”, etc. We also found that this didn’t really complicate the grammar: it expresses intent and mechanism quite clearly and doesn’t feel like as kluge.