Parsing Expression Grammars, part 2
It’s usually not enough to simply recognize patterns in an input stream. Soon we will want to take action based on what we recognize. In order to facilitate this, we will begin creating semantic values from the input tokens and trigger semantic actions when certain patterns are recognized. In part 1 of this series we focused simply on recognition. Now we will begin to take action.
Semantic Value
The pattern matching process in part 1 only kept track of the current position in the input stream. From this, it is possible to identify which segment of the input stream matched a given pattern. However, it is more useful to support accumulation and transformation of arbitrary values. We do this by extending the matching protocol to include a semantic value of the form (acc, env) where acc is the accumulated value and env is an environment for binding variable names to values. Our boot-strap start_beh is modified to include an initial semantic value (?, NIL)
indicating that the initial accumulator is undefined and the initial environment is empty.
LET init_env = NIL LET start_beh((ok, fail), ptrn) = \stream.[ SEND (k_read, #read) TO stream CREATE k_read WITH \input.[ SEND ((ok, fail), input, (?, init_env)) TO ptrn ] ]
By defining the initial environment as init_env, we have the option to provide a set of pre-defined bindings, if that should become useful in the future. For now, it is sufficient to start with an empty environment, represented by NIL
.
Zero Pattern
Now we to begin to modify previously-defined grammar patterns to support the additional semantic value. For the zero_ptrn, which always succeeds, matching nothing, we set the accumulator to NIL
and pass on the current environment.
LET zero_ptrn_beh = \((ok, fail), in, (acc, env)).[ SEND (in, (NIL, env)) TO ok ] CREATE zero_ptrn WITH zero_ptrn_beh
Remember that zero_ptrn is used in the definitions of the repetition patterns (optional, repeat and many). It represents the base-case, zero occurrences of a pattern. We reflect this with a semantic value of NIL
, to represent an empty collection.
Fail Pattern
For fail_ptrn, which always fails, matching nothing, we simply pass on the current semantic value unchanged.
LET fail_ptrn_beh = \((ok, fail), in, value).[ SEND (in, value) TO fail ] CREATE fail_ptrn WITH fail_ptrn_beh
In general, the result of a match failure is the original input in and semantic value, including the state of both the accumulator and the environment.
Any Pattern
The any_ptrn, which matches if there is a token remaining in the input stream, captures the matched token in the accumulator.
LET any_ptrn_beh = \((ok, fail), in, value).[ CASE in OF (token, next) : [ SEND (k_read, #read) TO next CREATE k_read WITH \in'.[ LET (acc, env) = $value SEND (in', (token, env)) TO ok ] ] _ : [ SEND (in, value) TO fail ] END ] CREATE any_ptrn WITH any_ptrn_beh
When a token is available, a #read
message is sent to next, to advance the input stream. The customer for this message k_read receives the updated input position in’ and decomposes the semantic value into the accumulator acc and environment env. The successful match result, consisting of the updated in’, the matched token and the unchanged env, is sent to the success customer ok. Match failure sends the original in and value to the failure customer fail.
One Pattern
The one_ptrn_beh matches a single specific input token.
LET one_ptrn_beh(match) = \((ok, fail), in, value).[ CASE in OF (token, next) : [ CASE token OF $match : [ SEND (k_read, #read) TO next CREATE k_read WITH \in'.[ LET (acc, env) = $value SEND (in', (token, env)) TO ok ] ] _ : [ SEND (in, value) TO fail ] END ] _ : [ SEND (in, value) TO fail ] END ]
The logic is nearly identical to any_ptrn, with an additional check that the input token is the same as the match value.
Predicate Pattern
The pred_ptrn_beh matches a single input token, if it meets the condition specified by a semantic predicate. This is a generalization of one_ptrn_beh where a general condition is tested in place of a simple match with a pre-defined token value. This is a very powerful extension. The semantic predicate can take into account information beyond the scope of a context-free grammar.
LET pred_ptrn_beh(pred) = \((ok, fail), in, value).[ CASE in OF (token, next) : [ CASE pred(token) OF TRUE : [ SEND (k_read, #read) TO next CREATE k_read WITH \in'.[ LET (acc, env) = $value SEND (in', (token, env)) TO ok ] ] _ : [ SEND (in, value) TO fail ] END ] _ : [ SEND (in, value) TO fail ] END ]
The logic is nearly identical to one_ptrn_beh, with a general semantic predicate pred replacing the token value match. If the predicate value is TRUE
, the token matches.
Set Pattern
The set_ptrn_beh matches a single input token, if it is part of a specified set. The set is defined by a list of individual values and inclusive ranges. The semantic predicate in_range_set evaluates to TRUE
if the token either matches one of the individual values or is included in one of the specified ranges.
LET in_range(min, n, max) = and( greater_equal(n, min), less_equal(n, max) ) LET in_range_set(range_set, token) = ( CASE range_set OF NIL : FALSE (first, rest) : ( CASE first OF $token : TRUE (min, max) : ( CASE in_range(min, token, max) OF TRUE : TRUE _ : in_range_set(rest, token) END ) _ : in_range_set(rest, token) END ) _ : ? END ) LET set_ptrn_beh(range_set) = pred_ptrn_beh( \token.(in_range_set(range_set, token)) )
The semantic predicate in_range_set is defined by a recursive function. The elements of range_set are considered in order. Each element is either a single value or an inclusive range. Single first values are matched against the value of token. Ranges from min to max are tested with the in_range function.
The set_ptrn_beh is defined by expansion, much like a macro definition. Calls to set_ptrn_beh in turn call pred_ptrn_beh with a semantic predicate defined in terms of in_range_set. As a test, we will define a set consisting of a space (32) and the range from tab (9) to carriage-return (13). We use the test fixtures from part 1 to provide an example input stream and default customers. As mentioned above, the boot-strap start_beh is redefined to support a semantic value. Sending the following message:# start = [ \t-\r] SEND src TO NEW start_beh((ok, fail), NEW set_ptrn_beh(32, (9, 13), NIL) )
should produce this output:
#ok, (32, <Actor: id=@next>), 9, NIL
The pattern matched the tab (9) character, left the input at the space (32) character, and produced a semantic value with an accumulator of 9
(the character matched) and an empty environment.
Look-Ahead Patterns
The look-ahead patterns defined by not_ptrn_beh and peek_ptrn_beh only require the addition of the semantic value to their results.
LET not_ptrn_beh(ptrn) = \((ok, fail), in, value).[ SEND ((ok', ok), in, value) TO ptrn CREATE ok' WITH \(in', value').[ SEND (in, value) TO fail ] ] # peek(grammar) = not(not(grammar)) LET peek_ptrn_beh(ptrn) = not_ptrn_beh(NEW not_ptrn_beh(ptrn))
The definition of peek_ptrn_beh is even more simple than before due to our use of an expansion-style definition.
Or Pattern
The or_ptrn_beh implements prioritized choice. The only change here is keeping the semantic value in sync with the input position.
LET or_ptrn_beh(first, rest) = \((ok, fail), in, value).[ SEND ((ok, fail'), in, value) TO first CREATE fail' WITH \(in', value').[ SEND ((ok, fail), in, value) TO rest ] ]
If the first pattern does not match, the rest is tried using the original in and value.
Alternative Pattern
The alt_ptrn_beh implements prioritized choice among a list of patterns. It is defined by expansion into instances of or_ptrn_beh and possibly fail_ptrn_beh.
LET alt_ptrn_beh(alt) = ( CASE alt OF NIL : fail_ptrn_beh (first, rest) : or_ptrn_beh(first, NEW alt_ptrn_beh(rest)) last : \msg.[ SEND msg TO last ] END )
And Pattern
The and_ptrn_beh implements sequential matching. The only change here is keeping the semantic value in sync with the input position.
LET and_ptrn_beh(first, rest) = \((ok, fail), in, value).[ SEND ((ok', fail), in, value) TO first CREATE ok' WITH \(in', value').[ SEND ((ok, fail'), in', value') TO rest CREATE fail' WITH \(in', value').[ SEND (in, value) TO fail ] ] ]
If the first pattern matches, the rest is tried using the updated in’ and value’. If the rest fails to match, the original in and value are sent to the fail customer.
Sequence Pattern
The seq_ptrn_beh implements sequential matching against a list of patterns. It is defined by expansion into instances of and_ptrn_beh and possibly zero_ptrn_beh.
LET seq_ptrn_beh(seq) = ( CASE seq OF NIL : zero_ptrn_beh (first, rest) : and_ptrn_beh(first, NEW seq_ptrn_beh(rest)) last : \msg.[ SEND msg TO last ] END )
Sending the following message:
# start = '\t' ' ' '0' '1' SEND src TO NEW start_beh((ok, fail), NEW seq_ptrn_beh( NEW one_ptrn_beh(9), NEW one_ptrn_beh(32), NEW one_ptrn_beh(48), NEW one_ptrn_beh(49) ) )
should produce this output:
#ok, (88, <Actor: id=@next>), 49, NIL
The pattern matched the tab (9), space (32), digit zero (48) and digit one (49) characters, leaving the input at the period (88) character, and produced a semantic value with an accumulator of 49
(the last match) and an empty environment.
Optional Pattern
The opt_ptrn_beh either matches a pattern (advancing the input) or matches nothing (notice that the final ‘|’ alternative is nothing). The definition is a simple expansion.
# opt(grammar) = grammar # | LET opt_ptrn_beh(ptrn) = or_ptrn_beh(ptrn, zero_ptrn)
Semantic Action
Semantic actions are executed when they are “matched” against any input. Like zero_ptrn, they always match successfully and do not consume any input. Their effect is defined as a transformation function on the semantic value, which has the form (acc, env) where acc is the accumulated value and env is an environment for binding variable names to values.
Environment
The environment provides bindings from names to values. We define two operations on environments, binding and lookup.
LET env_bind(env, name, value) = ((name, value), env) LET env_lookup(env, name) = ( CASE env OF (($name, value), _) : value ((_, _), rest) : env_lookup(rest, name) ($name, value) : value _ : ? END )
The env_bind function extends an environment env with a new binding from name to value. The extended environment is the result of evaluating env_bind.
The env_lookup function searches an environment env for the first binding associated with name. Since bindings are always added to the beginning of the environment, they can hide previous bindings. If no binding is found for name the result is?
, the undefined value.
Expression Pattern
The expr_ptrn_beh is the primitive semantic action. It simply applies a semantic action function to the current semantic value to produce a new semantic value.
LET expr_ptrn_beh(func) = \((ok, fail), in, value).[ SEND (in, func(value)) TO ok ]
When expr_ptrn_beh receives a match request, the semantic action function func is applied to the semantic value. The current input value in and the new semantic value are sent to the success customer ok.
Bind Expression
As noted in the test for seq_ptrn_beh, the accumulated value of a sequence is the value of its final match. We can capture an earlier value by binding it to a name in the environment.
LET bind_expr(name) = \(acc, env).(acc, env_bind(env, name, acc)) LET bind_ptrn_beh(name) = expr_ptrn_beh(bind_expr(name))
The bind_expr is a semantic action function that binds name to the current value of the accumulator acc. The new semantic value consists of the unchanged accumulator and the environment extended with the new binding. We define bind_ptrn_beh as an abbreviation for an expr_ptrn_beh with a bind_expr that binds name, using a now-familiar expansion definition.
Sending the following message:# start = '\t' ' ' :sp '0' '1' SEND src TO NEW start_beh((ok, fail), NEW seq_ptrn_beh( NEW one_ptrn_beh(9), NEW one_ptrn_beh(32), NEW bind_ptrn_beh(#sp), NEW one_ptrn_beh(48), NEW one_ptrn_beh(49) ) )
should produce this output:
#ok, (88, <Actor: id=@next>), 49, (#sp, 32), NIL
The pattern matched the tab (9) and space (32) characters, bound #sp
to 32
, matched digit zero (48) and digit one (49) characters, leaving the input at the period (88) character, and produced a semantic value with an accumulator of 49
(the last match) and an environment where #sp
is bound to 32
.
Lookup Expression
Once we’ve used bind_expr to add a binding to the environment, we can retrieve it with a lookup_expr.
LET lookup_expr(name) = \(acc, env).(env_lookup(env, name), env)
The lookup_expr is a semantic action function that searches the environment for a value bound to name. The new accumulator is the result of the lookup (which could be ?
undefined) and the unchanged environment.
# start = '\t' ' ' :sp '0' '1' $sp SEND src TO NEW start_beh((ok, fail), NEW seq_ptrn_beh( NEW one_ptrn_beh(9), NEW one_ptrn_beh(32), NEW bind_ptrn_beh(#sp), NEW one_ptrn_beh(48), NEW one_ptrn_beh(49), NEW expr_ptrn_beh(lookup_expr(#sp)) ) )
should produce this output:
#ok, (88, <Actor: id=@next>), 32, (#sp, 32), NIL
The pattern matched the tab (9) and space (32) characters, bound #sp
to 32
, matched digit zero (48) and digit one (49) characters, looked up the value bound to #sp
, leaving the input at the period (88) character, and produced a semantic value with an accumulator of 32
(the value of the lookup) and an environment where #sp
is bound to 32
.
Pair Expression
More complex values can be constructed by composing tuples from pairs. The pair_expr is a semantic action function that applies two semantics actions and then builds a pair from their respective accumulator values.
LET pair_expr(h_expr, t_expr) = \(acc, env).( LET (acc', env') = $h_expr(acc, env) IN LET (acc'', env'') = $t_expr(acc', env') IN ((acc', acc''), env'') )
The pair_expr applies the first semantic action h_expr to the original semantic value (acc, env) and deconstructs the result into (acc’, env’). Then the second semantic action t_expr is applied to (acc’, env’) and the result deconstructed into (acc”, env”). The final semantic value is the pair of accumulated values (acc’, acc”) with the final environment env”.
Sending the following message:# start = '\t' ' ' '0' :x '1' :y $<x,y> SEND src TO NEW start_beh((ok, fail), NEW seq_ptrn_beh( NEW one_ptrn_beh(9), NEW one_ptrn_beh(32), NEW one_ptrn_beh(48), NEW bind_ptrn_beh(#x), NEW one_ptrn_beh(49), NEW bind_ptrn_beh(#y), NEW expr_ptrn_beh(pair_expr( lookup_expr(#x), lookup_expr(#y) )) ) )
should produce this output:
#ok, (88, <Actor: id=@next>), (48, 49), (#y, 49), (#x, 48), NIL
The pattern matched the tab (9), space (32) and digit zero (48) characters, bound #x
to 48
, matched the digit one (49) character, bound #y
to 49
, created a pair from the bound values of #x
and #y
, leaving the input at the period (88) character, and produced a semantic value with an accumulator of (48, 49)
(the value of the pair expression) and an environment where #y
is bound to 49
and #x
is bound to 48
.
Scope Pattern
In order to prevent interference between different uses of variable names, we must control the scope of our environments. The scope_ptrn_beh matches a pattern using a new (default/empty) semantic value, and restores the old environment after the match or failure is determined.
LET scope_ptrn_beh(ptrn) = \((ok, fail), in, (acc, env)).[ SEND ((ok', fail'), in, (?, init_env)) TO ptrn CREATE ok' WITH \(in', (acc', env')).[ SEND (in', (acc', env)) TO ok # restore environment ] CREATE fail' WITH \(in', (acc', env')).[ SEND (in, (acc, env)) TO fail # restore entire context ] ]
When scope_ptrn_beh receives a match request, is sends a match request to ptrn with an undefined accumulator and initial (empty) environment. The success customer for the match ok’ passes on the updated input value in’ and accumulator acc’, but restores the original environment env. The failure customer for the match fail’ reverts to the original values of the input in and the semantic value (acc, env), maintaining the failure invariant described in fail_ptrn.
Repeat Pattern
The rep_ptrn_beh matches zero or more occurrences of a pattern. It is defined in terms of or_ptrn_beh, seq_ptrn_beh, and zero_ptrn. Recall that the semantic value produced by seq_ptrn_beh is the semantic value of the last pattern in the sequence. We would prefer to accumulate the results of repeated matching into a tuple. We use bind_ptrn_beh to capture the matched values in the environment and expr_ptrn_beh with lookup_expr and pair_expr to build up a tuple from the values matched. Since the accumulator produced by zero_ptrn is NIL
, our tuple will be NIL
-terminated (or just NIL
if there are zero matches). The whole pattern is wrapped in a scope_ptrn_beh to create a local scope in which #h
and #t
are bound.
# rep(grammar) = grammar :h rep(grammar) :t $<h,t> # | LET rep_ptrn_beh(ptrn) = \msg.[ BECOME scope_ptrn_beh( NEW or_ptrn_beh( NEW seq_ptrn_beh( ptrn, NEW bind_ptrn_beh(#h), SELF, NEW bind_ptrn_beh(#t), NEW expr_ptrn_beh(pair_expr( lookup_expr(#h), lookup_expr(#t) )) ), zero_ptrn ) ) SEND msg TO SELF ]
We can’t use an expansion definition for rep_ptrn_beh because we need the SELF
reference to establish the circular reference implied by our recursive definition. Instead we use BECOME
to generate the appropriate self-referential structure lazily, on receipt of the first message.
Many Pattern
The many_ptrn_beh matches one or more occurrences of a pattern. It is defined in terms of seq_ptrn_beh, and rep_ptrn_beh. We use bind_ptrn_beh to capture the matched values in the environment and expr_ptrn_beh with lookup_expr and pair_expr to build up a tuple from the values matched. Since the accumulator produced by rep_ptrn_beh is NIL
-terminated, our tuple will be NIL
-terminated. The whole pattern is wrapped in a scope_ptrn_beh to create a local scope in which #h
and #t
are bound.
# many(grammar) = grammar :h rep(grammar) :t $<h,t> LET many_ptrn_beh(ptrn) = scope_ptrn_beh( NEW seq_ptrn_beh( ptrn, NEW bind_ptrn_beh(#h), NEW rep_ptrn_beh(ptrn), NEW bind_ptrn_beh(#t), NEW expr_ptrn_beh(pair_expr( lookup_expr(#h), lookup_expr(#t) )) ) )
We can use an expansion definition for many_ptrn_beh, even with two references to the base pattern ptrn. The whole pattern is wrapped in a scope_ptrn_beh to create a local scope in which #h
and #t
are bound.
Modular Grammar
We would like to build up more complex grammars by creating grammar patterns that may be referenced by name. In some cases we would like these grammars to reference themselves recursively, or each other in mutually recursion. We can support this by taking advantage of lazy initialization. If we wait to generate the grammar until we receive the first match request, we can refer to other grammars defined (later) in the same top-level scope.
# start = sp* digit+ # sp = [ \t-\r] # digit = [0-9] CREATE start_ptrn WITH \msg.[ BECOME scope_ptrn_beh( NEW seq_ptrn_beh( NEW rep_ptrn_beh(sp_ptrn), NEW many_ptrn_beh(digit_ptrn) ) ) SEND msg TO SELF ] CREATE sp_ptrn WITH scope_ptrn_beh( NEW set_ptrn_beh(32, (9, 13), NIL) ) CREATE digit_ptrn WITH scope_ptrn_beh( NEW set_ptrn_beh((48, 57), NIL) )
The sp_ptrn and digit_ptrn grammars are referenced from start_ptrn, but are defined afterward. Each “named” pattern is wrapped in a scope_ptrn_beh to create a local scope for variable bindings. Sending the following message:
SEND src TO NEW start_beh((ok, fail), start_ptrn)
should produce this output:
#ok, (88, <Actor: id=@next>), (48, 49, NIL), NIL
The sp*
pattern matched the tab (9) and space (32) characters. The digit+
pattern matched the digit zero (48) and digit one (49) characters, leaving the input at the period (88) character, and produced a semantic value with an accumulator of (48, 49, NIL)
(a NIL
-terminated tuple of the digits matched) and an empty environment.
Summary
We have extended our pattern matching grammars from part 1 to not only recognize, but also capture tokens from an input stream. We’ve generalized and extended previous matching primitives to implement semantic predicates, match sets, multi-way alternatives and sequences.
We’ve provided a semantic value which consists of an accumulated value and an environment for variable bindings. Expressions perform semantic actions on the semantic value, with full access to the power of the underlying programming language. Most commonly, these actions are used to build an abstract syntax tree, or some other convenient representation of the matched structure. By limiting our transformations to pure functional values, the parser automatically manages the application and undoing of effects on the semantic value. By constructing modular grammars from these matchers, we can build powerful and maintainable grammars which correspond naturally to the languages we want to process.Tags: actor, composition, grammar, laziness, pattern-matching, PEG, protocol, sequence, streams, value
[…] Parsing Expression Grammars, part 2 […]
[…] we first defined the Repeat Pattern in part 2, we wanted to accumulate all the matches into a single sequence, so we used this […]