Parsing Expression Grammars, part 4
This article could probably be called “Left Recursion Considered Harmful“. PEG parsers are unambiguous and relatively easy to reason about. A little reasoning about left-recursive PEGs shows that they don’t make sense. The motivation to use left recursion seems to be driven by the desire to build left-associative parse-trees for arithmetic operators. However, parse-tree generation is a separate (though related) problem from language recognition. Equivalent languages can be expressed in different ways. We will explore how iteration can recognize grammars for left-associative operators and semantic actions can build the desired parse-tree.
Our examples will all be arithmetic expressions of integers, so we will assume a common definition of an integer as part of each of our examples. An integer is a one-or-more digits, converted into an integer value.
# int = [0-9]+ $# CREATE int_ptrn WITH and_ptrn_beh( NEW many_ptrn_beh( NEW set_ptrn_beh((48, 57), NIL) ), NEW xform_ptrn_beh(\x.tuple_to_number(10, x)) )
Subtraction Grammar
Let’s begin with the simple case of a subtraction expression. A subtraction expression consists of a sequence of integers separated by minus signs ‘-
‘. A traditional left-recursive definition for a subtraction expression would be:
sub = sub '-' int | int
The grammar is defined this way to express the idea that we should collect subtraction sub-expressions as deeply-nested as possible on the left side of the minus sign. This corresponds to the desired left-associativity of the subtraction operation. For example, evaluation of “1-2-3
” should result in “-4
“, as shown:
"1-2-3" -> (((1)-2)-3) -> ((-1)-3) -> (-4)
However, when we consider this grammar from a PEG perspective, it doesn’t make sense. In fact it is clearly an error, equivalent to defining an infinitely-recursive function. The problem is that sub is defined in terms of itself, without consuming any input. A clear case of infinite recursion.
Right-Recursion
We need to find another way to express this grammar. How about making the grammar right-recursive (or tail-recursive) instead. This gives us:
sub = int '-' sub | int
Notice that the right-recursive grammar matches all the same patterns as the left-recursive version. A subtraction expression is still a sequence of integers separated by minus signs. And this grammar makes a sensible PEG. If we add variable bindings and expression evaluation to the grammar, it could be implemented like this:
# sub = int:n '-' sub:m $sub(n,m) | int CREATE sub_ptrn WITH scope_ptrn_beh( NEW or_ptrn_beh( NEW seq_ptrn_beh( int_ptrn, NEW bind_ptrn_beh(#n), NEW one_ptrn_beh(45), sub_ptrn, NEW bind_ptrn_beh(#m), NEW expr_ptrn_beh(\(acc, env).( sub( env_lookup(env, #n), env_lookup(env, #m) ), env )) ), int_ptrn ) ) # TEST: "1-2-3" SEND NEW tuple_stream_beh(49, 45, 50, 45, 51) TO NEW start_beh((ok, fail), sub_ptrn)
If you try this example in the Humus simulator (along with the pre-requisite definitions from previous articles) you might be surprised to get this result:
#ok, <Actor: id=@stream_end>, 2, NIL
What happened? Why did we get “2
” instead of “-4
“? Let’s walk through the matching process and see. Recall that we are matching the input “1-2-3
” with this grammar:
sub = int:n '-' sub:m $sub(n,m) | int
First, the number “1
” is matched by the int pattern. The resulting integer value (1) is bound to the variable n. Then, the minus sign is matched. Next, we make a recursive call to the sub pattern. Within the recursion, “2
” is matched (by int) and bound to n. This binding does not interfere with the previous binding of n to 1
because it occurs in a nested scope. The next minus sign is matched, followed by another recursive call to sub. This time, the recursion matches “3
“, but fails to match the minus sign because we’ve reached the end of the input stream. Falling back to the alternative clause, the int pattern matches “3
“, and the integer result (3) becomes the value of this recursion. Up one level, this value (3) is bound to m. Then the expression sub(n,m)
is evaluated (where n=2
and m=3
), and the result (-1) becomes the value of this recursion. Back at the top level now, this value (-1) is bound to m. Finally, the expression sub(n,m)
is evaluated (where n=1
and m=-1
) producing 2
as the final result.
In effect, evaluation of “1-2-3
” proceeded as follows:
"1-2-3" -> (1-(2-(3))) -> (1-(-1)) -> (2)
Converting the grammar to be right-recursive resulted in treating the minus operator as right-associative when it should have been left-associative. As you can imagine, the same kind of problem would arise if we were generating parse-trees rather than directly evaluating expressions. The parse-trees would group the operations right-to-left when we want them grouped left-to-right.
Suffix Iteration
Let’s look at this problem in a different way. A subtraction expression is a sequence of integers separated by minus signs. Left-recursion produces infinite recursion. Right-recusion produces right-associativity. What if we define this grammar with an iterative structure, like this:
sub = int ('-' int)*
Again, this grammar matches all the same patterns as the recursive versions. But, with this structure we can iteratively modify an accumulated value, applying operations from left-to-right. If we add variable bindings and expression evaluation to the grammar, it would look like this:
sub = int:n ('-' int:m $sub(n,m):n)* $n
Consider how this would match the input “1-2-3
“. First, the number “1
” is matched by the int pattern. The resulting integer value (1) is bound to the variable n, which serves as our accumulator. Next we try a zero-or-more repeating sequence starting with matching the minus sign. Then, the number “2
” is matched by the int pattern, and the resulting integer value (2) is bound to the variable m. The expression sub(n,m)
is evaluated (where n=1
and m=2
), and the result (-1) is bound to n. The repeating sequence is tried again, this time matching the minus, followed by the number “3
“, which is bound to m. The expression sub(n,m)
is evaluated (where n=-1
and m=3
), and the result (-4) is bound to n. The repeating sequence is tried again, but fails this time, so the final expression retrieves the accumulated value of n producing -4
as the final result.
The iterative grammar produced the result we expected. Sub-expressions were evaluated from left-to-right. The accumulated value (stored in the variable n) was initialized by the first int pattern and updated throughout the iteration.
Repetition
When we first defined the Repeat Pattern in part 2, we wanted to accumulate all the matches into a single sequence, so we used this grammar:
rep(grammar) = grammar:h rep(grammar):t $<h,t> |
This is a right-recursive definition that builds a sequence (tuple) from pairs, where the tail of each pair is defined recursively, and the empty (zero) pattern produces the NIL
termination. When repetition is used to provide iteration, we’re not interested in building this sequence. Instead our accumulated value is explicitly specified through variable bindings and expressions. This means we can (and should) use this simple grammar instead:
star(grammar) = grammar star(grammar) |
This is a straight-forward definition of the Kleene Star operation. It is an alternative of, either a grammar followed by a recursive reference; or the empty pattern. Our PEG processing defines the value of this grammar to be the last pattern matched. Since the empty pattern will always end the recursion, the value will always be NIL
. That’s fine for iterative usage, because the value we’re interested in has already been safely stashed in a variable. A Star Pattern can be defined by expansion, like this:
# star(ptrn) = ptrn star(ptrn) | zero LET star_ptrn_beh(ptrn) = \msg.[ BECOME or_ptrn_beh( NEW and_ptrn_beh( ptrn, SELF ), zero_ptrn ) SEND msg TO SELF ]
The implementation can be further optimized, as we did with the Repeat Pattern in part 3, by expanding the Star Pattern definition and refactoring, like this:
LET loop_ptrn_beh(ptrn) = \((ok, _), ctx).[ LET repeat = $SELF SEND ((ok', fail'), ctx) TO ptrn CREATE ok' WITH \ctx'.[ SEND ((ok, fail'), ctx') TO repeat ] CREATE fail' WITH \_.[ SEND ctx TO ok ] ]
We will use this optimized Loop Pattern to define iteration. For example, our subtraction grammar can now be implemented like this:
# sub = int:n ('-' int:m $sub(n,m):n)* $n CREATE sub_ptrn WITH scope_ptrn_beh( NEW seq_ptrn_beh( int_ptrn, NEW bind_ptrn_beh(#n), NEW loop_ptrn_beh( NEW seq_ptrn_beh( NEW one_ptrn_beh(45), int_ptrn, NEW bind_ptrn_beh(#m), NEW expr_ptrn_beh(\(acc, env).( sub( env_lookup(env, #n), env_lookup(env, #m) ), env )), NEW bind_ptrn_beh(#n) ) ), NEW expr_ptrn_beh(lookup_expr(#n)) ) ) # TEST: "1-2-3" SEND NEW tuple_stream_beh(49, 45, 50, 45, 51) TO NEW start_beh((ok, fail), sub_ptrn)
As expected, executing this test case produces the result:
#ok, <Actor: id=@stream_end>, -4, NIL
Parse Tree
With a few slight modifications, we can build a parse-tree instead of directly evaluating the subtraction expression. Our parse-tree will consist of tuples of the form (left, right, operator). Where left is the left sub-expression, right is the right sub-expression and operator is #SUB
.
# sub = int:n ('-' int:m $<n, m, #SUB>:n)* $n LET tuple_expr(x) = ( CASE x OF (h, t) : pair_expr(h, tuple_expr(t)) _ : x END ) LET const_expr(value) = \(acc, env).(value, env) CREATE sub_ptrn WITH scope_ptrn_beh( NEW seq_ptrn_beh( int_ptrn, NEW bind_ptrn_beh(#n), NEW loop_ptrn_beh( NEW seq_ptrn_beh( NEW one_ptrn_beh(45), int_ptrn, NEW bind_ptrn_beh(#m), NEW expr_ptrn_beh(tuple_expr( lookup_expr(#n), lookup_expr(#m), const_expr(#SUB) )), NEW bind_ptrn_beh(#n) ) ), NEW expr_ptrn_beh(lookup_expr(#n)) ) ) # TEST: "1-2-3" # => <<1, 2, #SUB>, 3, #SUB> SEND NEW tuple_stream_beh(49, 45, 50, 45, 51) TO NEW start_beh((ok, fail), sub_ptrn)
For convenience, we added the tuple_expr function to make it easier to express n-tuples composed of pairs, and the const_expr function to produce a constant value. Output from our test case confirms the creation of a left-associative parse-tree.
#ok, <Actor: id=@stream_end>, ((1, 2, #SUB), 3, #SUB), NIL
Infix Operator
Of course, we are not limited to just subtraction. Let’s expand our grammar to include both ‘+
‘ and ‘-
‘ as left-associative additive operators. To enhance readability, we will replace the plus-sign and minus-sign with the symbolic constants #ADD
and #SUB
. Since we now have a choice of operators, we capture the operator in the variable op for inclusion in the value tuple.
# addop = '+' $#ADD # | '-' $#SUB # expr = int:x (addop:op int:y $<x, y, op>:x)* $x CREATE addop_ptrn WITH or_ptrn_beh( NEW and_ptrn_beh( NEW one_ptrn_beh(43), NEW expr_ptrn_beh(const_expr(#ADD)) ), NEW and_ptrn_beh( NEW one_ptrn_beh(45), NEW expr_ptrn_beh(const_expr(#SUB)) ) ) CREATE expr_ptrn WITH scope_ptrn_beh( NEW seq_ptrn_beh( int_ptrn, NEW bind_ptrn_beh(#x), NEW loop_ptrn_beh( NEW seq_ptrn_beh( addop_ptrn, NEW bind_ptrn_beh(#op), int_ptrn, NEW bind_ptrn_beh(#y), NEW expr_ptrn_beh(tuple_expr( lookup_expr(#x), lookup_expr(#y), lookup_expr(#op) )), NEW bind_ptrn_beh(#x) ) ), NEW expr_ptrn_beh(lookup_expr(#x)) ) ) # TEST: "1+2-3+4" # => <<<1, 2, #ADD>, 3, #SUB>, 4, #ADD> SEND NEW tuple_stream_beh(49, 43, 50, 45, 51, 43, 52) TO NEW start_beh((ok, fail), expr_ptrn)
The test input stream “1+2-3+4
“, should produce the following output:
#ok, <Actor: id=@stream_end>, (((1, 2, #ADD), 3, #SUB), 4, #ADD), NIL
Infix Pattern
We can generalize the additive-operator grammar to any left-binding infix operator. By abstracting out the operand and operator patterns, we can define a function that creates infix grammars.
# infix(opnd,oper) = opnd:x (oper:op opnd:y $<x, y, op>:x)* $x LET infix_ptrn_beh(opnd_ptrn, oper_ptrn) = scope_ptrn_beh( NEW seq_ptrn_beh( opnd_ptrn, NEW bind_ptrn_beh(#x), NEW loop_ptrn_beh( NEW seq_ptrn_beh( oper_ptrn, NEW bind_ptrn_beh(#op), opnd_ptrn, NEW bind_ptrn_beh(#y), NEW expr_ptrn_beh(tuple_expr( lookup_expr(#x), lookup_expr(#y), lookup_expr(#op) )), NEW bind_ptrn_beh(#x) ) ), NEW expr_ptrn_beh(lookup_expr(#x)) ) )
Applying this to our additive-operator grammar, our previous test case would look like this:
# expr = infix(int, addop) CREATE expr_ptrn WITH infix_ptrn_beh(int_ptrn, addop_ptrn) # TEST: "1+2-3+4" # => <<<1, 2, #ADD>, 3, #SUB>, 4, #ADD> SEND NEW tuple_stream_beh(49, 43, 50, 45, 51, 43, 52) TO NEW start_beh((ok, fail), expr_ptrn)
Expression Grammar
With this foundation in place, we’re ready to tackle a full-blown expression grammar, including recursion and operator precedence. We’ll start by considering a traditional left-recursive definition for an expression grammar.
expr = expr [-+] term | term term = term [*/] factor | factor factor = '(' expr ')' | [0-9]+
This is an appropriate grammar for bottom-up parsers, but it is not appropriate for top-down parsers, like PEGs. Instead, we re-express this grammar using iteration, which is appropriate for top-down parsers.
expr = term ([-+] term)* term = factor ([*/] factor)* factor = '(' expr ')' | [0-9]+
Notice that both of these grammars describe the same language. The trick is arranging to build the right parse-tree. We begin with patterns to recognize operators and convert them to symbolic constants.
# mulop = '*' $#MUL # | '/' $#DIV # addop = '+' $#ADD # | '-' $#SUB CREATE mulop_ptrn WITH or_ptrn_beh( NEW and_ptrn_beh( NEW one_ptrn_beh(42), NEW expr_ptrn_beh(const_expr(#MUL)) ), NEW and_ptrn_beh( NEW one_ptrn_beh(47), NEW expr_ptrn_beh(const_expr(#DIV)) ) ) CREATE addop_ptrn WITH or_ptrn_beh( NEW and_ptrn_beh( NEW one_ptrn_beh(43), NEW expr_ptrn_beh(const_expr(#ADD)) ), NEW and_ptrn_beh( NEW one_ptrn_beh(45), NEW expr_ptrn_beh(const_expr(#SUB)) ) )
A factor is either an expression wrapped in parentheses, or a simple integer value.
# factor = '(' expr:x ')' $x | int CREATE factor_ptrn WITH \msg.[ BECOME or_ptrn_beh( NEW seq_ptrn_beh( NEW one_ptrn_beh(40), expr_ptrn, NEW bind_ptrn_beh(#x), NEW one_ptrn_beh(41), NEW expr_ptrn_beh(lookup_expr(#x)) ), int_ptrn ) SEND msg TO SELF ]
A term is one-or-more factors connected by infix multiplicative operators. And an expression is one-or-more terms connected by infix additive operators. Both are (left-associative) infix operator patterns.
# expr = infix(term, addop) # -> term:x (addop:op term:y $<x, y, op>:x)* $x # term = infix(factor, mulop) # -> factor:x (mulop:op factor:y $<x, y, op>:x)* $x CREATE term_ptrn WITH infix_ptrn_beh(factor_ptrn, mulop_ptrn) CREATE expr_ptrn WITH infix_ptrn_beh(term_ptrn, addop_ptrn) # TEST: "1+2*3-4" # -> "(1+(3*4))-4" # => <<1, <2, 3, #MUL>, #ADD>, 4, #SUB> SEND NEW tuple_stream_beh(49, 43, 50, 42, 51, 45, 52) TO NEW start_beh((ok, fail), expr_ptrn)
The test input stream “1+2*3-4
“, should produce the following output:
#ok, <Actor: id=@stream_end>, ((1, (2, 3, #MUL), #ADD), 4, #SUB), NIL
Note that the multiplication is grouped first, since it has higher precedence. Evaluation of this parse-tree would produce “3
“, as we would expect based on operator precedences and left-associativity.
Summary
In top-down parsers, like PEGs, left-recursive grammars are an error that seems to be motivated by a desire to build left-leaning parse-trees. We have shown how equivalent grammars can be defined using either iteration or tail (right) recursion. Tail-recursion lends itself to building parse trees with right-to-left grouping. Iteration lends itself to building parse trees with left-to-right grouping. Both techniques can be safely combined and composed to express, for example, operator precedence. There is no need to support left recursion in PEGs.
Tags: actor, composition, grammar, iteration, pattern-matching, PEG, precedence, recursion
Nice article, in particular where you state, “Notice that both of these grammars describe the same language.” Certainly the wikipedia article isn’t very helpful.
I recently encountered this problem while trying to parse statements of the form “a[x].b[y] = c” where left to right associativity is required and the operators have equal precedence.
I used this as a reference for left to right associativity for my Treetop PEG grammar. It looks something like this https://gist.github.com/3385007 and works great.