Parsing Expression Grammars, part 1

Parsing Expression Grammars (PEGs) define a powerful class of matchers for recognizing strings of a formal language [1]. They’re well suited to parsing the syntax of computer languages. PEG-based tools like OMeta [2] have been successfully applied to a wide variety of transformation problems [3] [4].

The fundamental elements of PEGs can be described compactly and elegantly using actors. They provide a perfect opportunity to illustrate use of multiple “continuation” customers, and lazy initialization based on an executable specification. As usual with actors, the implementation focuses on events, in the form of asynchronous messages. Event transitions drive the process of parsing an input stream.

Tuple Streams

Parser/tokenizers process sequences of input symbols. We will use a simple stream protocol to provide symbols as needed from a tuple (usually a literal). We will find it convenient to ignore NIL at the end of a tuple, treating it as end-of-stream. This makes it easier to build variable length input sequences. As each symbol is extracted from the tuple, we cache the value so it can be returned immediately on subsequent requests. In this case, the cache doesn’t buy us much in the way of performance, but it demonstrates lazy initialization and caching techniques in a simple example.

CREATE stream_end WITH \(cust, _).[ SEND stream_end TO cust ]

LET tuple_stream_beh(tuple) = \(cust, #read).[
	CASE tuple OF
	NIL : [
		BECOME \(cust, #read).[
			SEND stream_end TO cust
		]
	]
	(first, rest) : [
		CREATE next WITH tuple_stream_beh(rest)
		BECOME \(cust, #read).[
			SEND (first, next) TO cust
		]
	]
	last : [
		BECOME \(cust, #read).[
			SEND (last, stream_end) TO cust
		]
	]
	END
	SEND (cust, #read) TO SELF
]

When tuple_stream_beh receives a #read request, it BECOMEs one of three distinct behaviors, based on what kind of tuple was provided. If tuple is NIL, the new behavior sends stream_end to the customer cust. If tuple is a pair, the next stream is created from rest and the new behavior sends the first token and the next stream to the customer. If tuple is any other value, the new behavior sends this last value and stream_end to the customer. Once the appropriate behavior has been selected, the original message is sent to the modified SELF and processed accordingly. In general, the stream responds with either stream_end, or a pair consisting of a token value and the stream actor to read from next.

Test Fixtures

We need a little more infrastructure in order to test our parsers. We will use a couple of simple clients, ok for success, and fail for failure. All of our examples will re-use a simple pre-defined input stream src. This stream consists of the Unicode code-points for a tab (9), a space (32), the digit zero (48), the digit one (49), a period (88), carriage-return (13) and finally line-feed (10).

LET label_beh(cust, label) = \msg.[ SEND (label, msg) TO cust ]

CREATE ok WITH label_beh(println, #ok)
CREATE fail WITH label_beh(println, #fail)

CREATE src WITH tuple_stream_beh(9, 32, 48, 49, 88, 13, 10, NIL)

Matching is initiated by passing an input value to a grammar. The input value is usually a pair consisting of the current input symbol (a character, at this point) and an actor from which we may #read the next symbol. As described previously, the input value is stream_end when there is no more input available. Often this value is assumed if the input value is not a pair.

We boot-strap the matching process by reading the first input value from the stream and passing it to the initial grammar pattern.

LET start_beh((ok, fail), grammar) = \stream.[
	SEND (k_read, #read) TO stream
	CREATE k_read WITH \input.[
		SEND ((ok, fail), input) TO grammar
	]
]

An actor with start_beh behavior is created, bound to a pair of customers, ok for success and fail for failure, and an initial grammar. This actor receives an input stream to start the matching process. A #read request is sent to the stream. The customer for this request k_read receives an initial input value, which it sends to the grammar along with the ok and fail customers.

Basic Pattern Matching

Patterns are matched by sending a match request to a grammar pattern. The match request has the form ((ok, fail), in), where ok is the success customer, fail is the failure customer, and in is the input value. The input value is either stream_end, or a pair (token, next). The value sent to ok or fail is the input value representing the position in the stream after the match attempt.

Zero and Fail Patterns

The most basic grammar patterns are those that match or fail without consuming any input. These grammars are so simple that they don’t require any parameters, thus shared singleton instances of these actors can be used to compose more complex grammars.

LET zero_ptrn_beh = \((ok, fail), in).[ SEND in TO ok ]
CREATE zero_ptrn WITH zero_ptrn_beh

LET fail_ptrn_beh = \((ok, fail), in).[ SEND in TO fail ]
CREATE fail_ptrn WITH fail_ptrn_beh

The zero_ptrn always succeeds, but consumes no input. This is often called the “empty” grammar. The fail_ptrn always fails, and consumes no input. Success or failure is indicated by which customer receives a message. In both cases, the customer receives the next input value to be processed. Since neither of these patterns consumes any input, this is simply the same as the original input value in.

We can test these patterns interactively in the Humus simulator, using the example customers, stream, and boot-strap behaviors defined above. Sending the following message:

SEND src TO NEW start_beh((ok, fail), zero_ptrn)

should produce this output:

#ok, 9, <Actor: id=@next>

This indicates that zero_ptrn (the empty grammar) succeeded and the input is positioned at the tab (9) character (the first position in the stream).

Sending this message:

SEND src TO NEW start_beh((ok, fail), fail_ptrn)

should produce this output:

#fail, 9, <Actor: id=@next>

This indicates that fail_ptrn failed to match and the input is positioned at the tab (9) character (the first position in the stream). When a pattern fails, we expect the input value to be the same as the one it was given to match. With actors, the flow of control does not necessarily return to the actor that made the request, so it’s convenient to provide the original input position to the actor that is handling the failure case.

Any Pattern

Now let’s examine a grammar which does consume input. In fact, this is a kind of wildcard grammar. It matches the next symbol from the input, regardless of what that symbol is. The only way it can fail is if there is no more input—that is, we are at stream_end.

LET any_ptrn_beh = \((ok, fail), in).[
	CASE in OF
	(token, next) : [ SEND (ok, #read) TO next ]
	_ : [ SEND in TO fail	]
	END
]
CREATE any_ptrn WITH any_ptrn_beh

Like zero_ptrn and fail_ptrn, we create a shared singleton any_ptrn since this grammar requires no parameters. When any_ptrn receives a match request, it examines the input value in. If the input has a token and next then we matched successfully, and send a #read request to next to advance the input position. The new input value will be sent to the ok customer, indicating the successful match. If there is no more input, the original input value (which should be stream_end) is sent to the fail customer, indicating that the pattern failed to match.

Our sample input stream should match this pattern. Sending this message:

SEND src TO NEW start_beh((ok, fail), any_ptrn)

should produce this output:

#ok, 32, <Actor: id=@next>

This indicates that any_ptrn matched and the input is positioned at the space (32) character. The tab (9) at the beginning of the stream was consumed as the symbol matched.

Primitive Grammar Patterns

Now let’s explore some primitive grammars. These grammars are created with parameters that specify various aspects of the patterns they match.

One Pattern

Our first parametrized grammar is one that matches a single specific symbol. If the symbol matches, it is consumed. Otherwise the pattern fails and no input is consumed.

LET one_ptrn_beh(match) = \((ok, fail), in).[
	CASE in OF
	(token, next) : [
		CASE token OF
		$match : [ SEND (ok, #read) TO next ]
		_ : [ SEND in TO fail ]
		END
	]
	_ : [ SEND in TO fail ]
	END
]

The logic is nearly the same as any_ptrn_beh, except for checking to see if the token has the same value as match. If the token does not match, or if there is no more input, the original input value is sent to the fail customer.

As our grammars are becoming more complex, we find it useful to separate the “language” grammar from the test fixture. Here we create a language that matches a single tab (9) character. Testing it against our sample input stream, like this:

CREATE language WITH one_ptrn_beh(9)
SEND src TO NEW start_beh((ok, fail), language)

should produce this output:

#ok, 32, <Actor: id=@next>

This language matched the tab (9), leaving the input at the space (32).

If we create a language that matches a single space (32) character, testing it like this:

CREATE language WITH one_ptrn_beh(32)
SEND src TO NEW start_beh((ok, fail), language)

should produce this output:

#fail, 9, <Actor: id=@next>

This language failed to match, leaving the input positioned at the starting tab (9) character.

Not Pattern

This pattern is the “negative look-ahead” primitive. It ensures that a specific pattern does not match at the current position in the input stream. In other words, it matches if the specified pattern fails, and fails if the specified pattern matches. No input is consumed by this pattern.

LET not_ptrn_beh(ptrn) = \((ok, fail), in).[
	SEND ((ok', ok), in) TO ptrn
	CREATE ok' WITH \in'.[
		SEND in TO fail
	]
]

The not_ptrn_beh works by changing the operation of the success and failure customers. The original ok customer is passed to ptrn in the failure position. A new ok’ customer is created which sends the original input value in to the original fail customer. The updated input value in’ resulting from a successful match of ptrn is ignored.

If we create a language that fails on a tab (9) character, testing it like this:

CREATE language WITH not_ptrn_beh(NEW one_ptrn_beh(9))
SEND src TO NEW start_beh((ok, fail), language)

should produce this output:

#fail, 9, <Actor: id=@next>

This language expected not to see a tab (9) character, but it did—so the match failed.

If we create a language that fails on a space (32) character, testing it like this:

CREATE language WITH not_ptrn_beh(NEW one_ptrn_beh(32))
SEND src TO NEW start_beh((ok, fail), language)

should produce this output:

#ok, 9, <Actor: id=@next>

This language expected not to see a space (32) character. Since the first character in the stream is not a space, the match succeeded. Notice that no input is consumed by the look-ahead. The input is still positioned at the starting tab (9) character.

Peek Pattern

This pattern is the “positive look-ahead” primitive. It ensures that a specific pattern matches at the current position in the input stream, but does not consume any input. It is also our first example of a lazy-initialized executable specification. A positive look-ahead grammar is equivalent to two nested negative look-ahead grammars. We take advantage of that equivalence to dynamically construct this grammar, but only if and when a request message is actually received.

# peek(grammar) = not(not(grammar))
LET peek_ptrn_beh(ptrn) = \msg.[
	BECOME not_ptrn_beh(NEW not_ptrn_beh(ptrn))
	SEND msg TO SELF
]

When peek_ptrn_beh receives any message, it redefines its own behavior (with BECOME) to be two nested instances of not_ptrn_beh. Then it re-sends the original request msg to itself, where it will be processed by the redefined behavior.

If we create a language that looks ahead for a tab (9) character, testing it like this:

CREATE language WITH peek_ptrn_beh(NEW one_ptrn_beh(9))
SEND src TO NEW start_beh((ok, fail), language)

should produce this output:

#ok, 9, <Actor: id=@next>

This language expected to see a tab (9) character, and it did—so the match succeeded, but the character was not consumed.

If we create a language that looks ahead for a space (32) character, testing it like this:

CREATE language WITH peek_ptrn_beh(NEW one_ptrn_beh(32))
SEND src TO NEW start_beh((ok, fail), language)

should produce this output:

#fail, 9, <Actor: id=@next>

This language expected to see a space (32) character. Since the first character in the stream is not a space, the match failed. Again, no input is consumed by the look-ahead. The input is still positioned at the starting tab (9) character.

Composite Grammar Patterns

Now let’s explore some composite grammars. These grammars are created by combining other grammar patterns in interesting ways.

Or Pattern

This pattern is the distinguishing feature of PEG parsers. It implements “prioritized choice” among alternatives. The first pattern to match is considered to be the result, and subsequent alternatives will be ignored. Earlier patterns are given “priority”, which removes ambiguity. Each pattern is tried at the same input position.

LET or_ptrn_beh(first, rest) = \((ok, fail), in).[
	SEND ((ok, fail'), in) TO first
	CREATE fail' WITH \in'.[
		SEND ((ok, fail), in) TO rest
	]
]

When or_ptrn_beh receives a match request, the first pattern is tried at the current input value in. If it succeeds, the original ok customer is notified. Otherwise the fail’ customer gets control, and tries to match rest with the original ((ok, fail), in) values.

If we create a language that matches either a space (32) or a tab (9) character, testing it like this:

CREATE language WITH or_ptrn_beh(
	NEW one_ptrn_beh(32),
	NEW one_ptrn_beh(9)
)
SEND src TO NEW start_beh((ok, fail), language)

should produce this output:

#ok, 32, <Actor: id=@next>

This language matched the tab (9) at the beginning of the stream, leaving the input positioned at the space (32) character.

And Pattern

This pattern composes grammars into a sequence, attempting to match each one in turn. Each pattern is matched from the input position resulting from the previous successful match. If any pattern fails, the input position is reset to the original value.

LET and_ptrn_beh(first, rest) = \((ok, fail), in).[
	SEND ((ok', fail), in) TO first
	CREATE ok' WITH \in'.[
		SEND ((ok, fail'), in') TO rest
		CREATE fail' WITH \in'.[
			SEND in TO fail
		]
	]
]

When and_ptrn_beh receives a match request, the first pattern is tried at the current input value in. If it fails, the original fail customer is notified. Otherwise the ok’ customer gets control, and tries to match rest at the updated input value in’. If that is successful, the original ok customer is notified. Otherwise the fail’ customer gets control, and sends the original in value to the original fail customer.

If we create a language that matches a tab (9) followed by a space (32), testing it like this:

CREATE language WITH and_ptrn_beh(
	NEW one_ptrn_beh(9),
	NEW one_ptrn_beh(32)
)
SEND src TO NEW start_beh((ok, fail), language)

should produce this output:

#ok, 48, <Actor: id=@next>

This language matched the tab (9) at the beginning of the stream, followed by the space (32), leaving the input positioned at the zero (48) character.

Optional Pattern

This pattern matches a pattern optionally. It succeeds if the pattern matches, consuming the matched input, or succeeds without consuming any input. This definition lends itself to an executable specification.

# opt(grammar) = or(grammar, empty)
LET opt_ptrn_beh(ptrn) = \msg.[
	BECOME or_ptrn_beh(ptrn, zero_ptrn)
	SEND msg TO SELF
]

The behavior of opt_ptrn_beh is defined in terms of or_ptrn_beh and zero_ptrn. Note that the order of the alternatives in or_ptrn_beh is important. If we tried to match zero_ptrn first, it would always succeed (without consuming any input) and the original grammar ptrn would never get a chance to match. As implemented, zero_ptrn is matched only if ptrn fails to match.

Repeat Pattern

This pattern matches zero or more occurrences of a pattern. It is defined with an executable specification.

# rep(grammar) = or(and(grammar, rep(grammar)), empty)
LET rep_ptrn_beh(ptrn) = \msg.[
	BECOME or_ptrn_beh(NEW and_ptrn_beh(ptrn, SELF), zero_ptrn)
	SEND msg TO SELF
]

The behavior of rep_ptrn_beh is defined in terms of or_ptrn_beh, and_ptrn_beh, and zero_ptrn. Note the SELF reference, which creates the circular reference implied by the recursive definition. Take a moment to convince yourself that this will match any number of occurrences of a pattern (including zero).

Generated Structure of a Repeat Pattern

Figure 1 – Generated Structure of a Repeat Pattern

Many Pattern

This pattern matches one or more occurrences of a pattern. It is defined with an executable specification.

# many(grammar) = and(grammar, rep(grammar))
LET many_ptrn_beh(ptrn) = \msg.[
	BECOME and_ptrn_beh(ptrn, NEW rep_ptrn_beh(ptrn))
	SEND msg TO SELF
]

The behavior of many_ptrn_beh is defined in terms of and_ptrn_beh, and rep_ptrn_beh. Although the generated structure of this grammar is complex, its specification is simple; one occurrence of ptrn followed by zero or more additional occurrences.

Generated Structure of a Many Pattern

Figure 2 – Generated Structure of a Many Pattern

Summary

PEG parsers are powerful, flexible and easy to understand. They can be implemented compactly with actors, using a protocol with separate customers for success and failure. The resulting parsers can be easily combined and composed to build more complex grammars. Composite parsers can be defined using a kind of lazy-initialized executable specification, which makes them easier to understand and implement correctly. With the grammar patterns defined here, and our ubiquitous test fixture, we can build a fairly complex language grammar.

# L = (' ' | '\t')* ('0' | '1')+ '.'? (('\r' '\n') | '\n' | '\r')
CREATE language WITH and_ptrn_beh(
	NEW rep_ptrn_beh(
		NEW or_ptrn_beh(
			NEW one_ptrn_beh(32),
			NEW one_ptrn_beh(9)
		)
	),
	NEW and_ptrn_beh(
		NEW many_ptrn_beh(
			NEW or_ptrn_beh(
				NEW one_ptrn_beh(48),
				NEW one_ptrn_beh(49)
			)
		),
		NEW and_ptrn_beh(
			NEW opt_ptrn_beh(
				NEW one_ptrn_beh(88)
			),
			NEW or_ptrn_beh(
				NEW and_ptrn_beh(
					NEW one_ptrn_beh(13),
					NEW one_ptrn_beh(10)
				),
				NEW or_ptrn_beh(
					NEW one_ptrn_beh(10),
					NEW one_ptrn_beh(13)
				)
			)
		)
	)
)
SEND src TO NEW start_beh((ok, fail), language)

Which should produce this output:

#ok, <Actor: id=@stream_end>

This language successfully matched our entire example input stream, as indicated by stream_end for the final input value.

References

[1]
B. Ford. Parsing Expression Grammars: A Recognition Based Syntactic Foundation. Proceedings of the 31st ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages. ACM, 2004.
[2]
A. Warth and I. Piumarta. OMeta: an Object-Oriented Language for Pattern Matching. TR-2007-003, http://www.vpri.org/pdf/tr2007003_ometa.pdf, Viewpoints Research Institute, 2007.
[3]
A. Warth. Experimenting with Programming Languages. TR-2008-003, http://www.vpri.org/pdf/tr2008003_experimenting.pdf, Viewpoints Research Institute, 2008.
[4]
I. Piumarta. PEG-based Transformer Provides Front-, Middle- and Back- end Stages in a Simple Compiler, Workshop on Self-Sustaining Systems (S3). Tokyo, Japan, Sept 2010. ACM Digital Library


Tags: , , , , ,
This entry was posted on Monday, February 14th, 2011 at 8:36 am and is filed under Uncategorized. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

2 Responses to “Parsing Expression Grammars, part 1”

  1. Parsing Expression Grammars, part 2

    […] Parsing Expression Grammars, part 1 […]

  2. Parsing Expression Grammars, part 3

    […] the current input in contains a token, then that token is used to produce a new tuple_stream_beh and start_beh initiates matching the pattern ptrn. If the pattern matches, we ensure that there is […]

Leave a Reply

Your comment