Perl programmers spend a lot of time munging data: reading it in,
transforming it, and writing out the results. Perl's great at this ad hoc text
transformation, with all sorts of string manipulation tools, including regular
expressions.
Regular expressions will only get you so far: witness the repeated advice
that you cannot parse HTML or XML with regular expressions themselves. When
Perl's builtin text processing tools aren't enough, you have to turn to
something more powerful.
That something is parsing.
An Overview of Lexing and Parsing
For a more formal discussion of what exactly lexing and parsing are, start
with Wikipedia's definitions: Lexing and Parsing.
Unfortunately, when people use the word parsing, they sometimes include the
idea of lexing. Other times they don't. This can cause confusion, but I'll
try to keep them clear. Such situations arise with other words, and our minds
usually resolve the specific meaning intended by analysing the context in which
the word is used. So, keep your mind in mind.
The lex phase and the parse phase can be combined into a single process, but
I advocate always keeping them separate. Trust me for a moment; I'll explain
shortly. If you're having trouble keeping the ideas separate, note that the
phases very conveniently run in alphabetical order: first we lex, and then we
parse.
A History Lesson - In Absentia
At this point, an article such as this would normally provide a summary of
historical developments in this field, to explain how the world ended up where
it is. I won't do that, especially as I first encountered parsing many years
ago, when the only tools (lex, bison, yacc) were so complex to operate I took
the pledge to abstain. Nevertheless, it's good to know such tools are still
available, so here are a few references:
Flex is a successor to lex, and Bison is a successor to yacc. These are
well-established (old) tools to keep you from having to build a lexer or parser
by hand. This article explains why I (still) don't use any of these.
But Why Study Lexing and Parsing?
There are many situations where the only path to a solution requires a lexer
and a parser:
Running a program
This is trivial to understand, but not to implement. In order to run a
program we need to set up a range of pre-conditions:
- Define the language, perhaps called Perl
- Write a compiler (combined lexer and parser) for that language's grammar
- Write a program in that language
Lex and parse the source code
After all, it must be syntactically correct before we run it. If not, we
display syntax errors. The real point of this step is to determine the
programmer's intention, that is, the reason for writing the code. We
don't run the code in this step, but we do get output. How do we do
that?
Run the code
Then we can gaze at the output which, hopefully, is correct. Otherwise,
perhaps, we must find and fix logic errors.
Rendering a web page of HTML + content
The steps are identical to those of the first example, with HTML replacing
Perl, although I can't bring myself to call writing HTML writing a program.
This time, we're asking: What is the web page designer's intention.
What would they like to render and how? Of course, syntax checking is far
looser that with a programming language, but must still be undertaken. For
instance, here's an example of clearly-corrupt HTML which can be parsed by Marpa:
<title>Short</title><p>Text</head><head>
See Marpa::HTML for
more details. So far, I have used Marpa::R2 in all my work,
which does not involve HTML.
Rendering an image, perhaps in SVG
Consider this file, written in the DOT language, as
used by the Graphviz graph visualizer
(teamwork.dot):
digraph Perl
{
graph [ rankdir="LR" ]
node [ fontsize="12pt" shape="rectangle" style="filled, solid" ]
edge [ color="grey" ]
"Teamwork" [ fillcolor="yellow" ]
"Victory" [ fillcolor="red" ]
"Teamwork" -> "Victory" [ label="is the key to" ]
}
Given this "program", a renderer give effects to the author's
intention by rendering an image:
What's required to do that? As above, lex, parse,
render. Using Graphviz's dot command to carry out these
tasks, we would run:
shell> dot -Tsvg teamwork.dot > teamwork.svg
Note: Files used in these examples can be downloaded from http://savage.net.au/Ron/html/graphviz2.marpa/teamwork.tgz.
The link to the DOT language points to a definition of DOT's syntax, written
in a somewhat casual version of BNF: Backus-Naur
Form. This is significant, as it's usually straight-forward to
translate a BNF description of a language into code within a lexer and
parser.
Rendering that same image, using a different language in the input file
Suppose that you decide that the Graphviz language is too complex, and hence
you write a wrapper around it, so end users can code in a simplified version of
that language. This actually happened, with the original effort available in
the now-obsolete Perl module Graph::Easy. Tels, the
author, devised his own very clever little language,
which he called Graph::Easy.
When I took over maintenance of Graph::Easy, I found the
code too complex to read, let alone work on, so I wrote another implementation
of the lexer and parser, released as Graph::Easy::Marpa.
I'll have much more to say about that in another article. For now, let's
discuss the previous graph rewritten in Graph::Easy
(teamwork.easy):
graph {rankdir: LR}
node {fontsize: 12pt; shape: rectangle; style: filled, solid}
edge {color: grey}
[Teamwork]{fillcolor: yellow}
-> {label: is the key to}
[Victory]{fillcolor: red}
That's simpler for sure, but how does Graph::Easy::Marpa work? Easy: lex, parse, render. More samples of Graph::Easy::Marpa's work are my Graph::Easy::Marpa examples.
It should be clear by now that lexing and parsing are in fact widespread,
although they often operate out of sight, with only their rendered output
visible to the average programmer, user, or web surfer.
All of these problems have in common a complex but well-structured source
text format, with a bit of hand-waving over the tacky details available to
authors of documents in HTML. In each case, it is the responsibility of the
programmer writing the lexer and parser to honour the intention of the original
text's author. We can only do that by recognizing each token in the input as a
discrete unit of meaning (where a word such as print
means to output something of the author's choosing), and by bringing
that meaning to fruition (for print, to make the output visible on
a device).
With all that I can safely claim that the ubiquity and success of lexing and
parsing justify their recognition as vital constituents in the world of
software engineering. Why study them indeed!
Good Solutions and Home-grown Solutions
There's another—more significant— reason to discuss lexing and
parsing: to train programmers, without expertise in such matters, to resist the
understandable urge to opt for using tools they are already familiar with, with
regexps being the obvious choice.
Sure, regexps suit many simple cases, and the old standbys of flex and bison
are always available, but now there's a new kid on the block: Marpa. Marpa draws heavily
from theoretical work done over many decades, and comes in various forms:
- libmarpa
Hand-crafted in C.
Marpa::XS
The Perl and C-based interface to the previous version of libmarpa.
Marpa::R2
The Perl and C-based interface to the most recent version of libmarpa.
This is the version I use.
Marpa::R2::Advanced::Thin
The newest and thinnest interface to libmarpa, which documents how to
make Marpa accessible to non-Perl languages.
The problem, of course, is whether or not any of these are a good, or even
excellent, choice. Good news! Marpa's advantages are huge. It's well tested,
which alone is of great significance. It has a Perl interface, so that I can
specify my task in Perl and let Marpa handle the details. It has its own Marpa Google
Group. It's already used by various modules on the CPAN (see a search for Marpa on the
CPAN); Open Source says you can see exactly how other people use
it.
Even better, Marpa has a very simple syntax, once you get used to it, of
course! If you're having trouble, just post on the Google Group. (If you've
ever worked with flex and bison, you'll be astonished at how simple it is to
drive Marpa.) Marpa is also very fast, with libmarpa written in C. Its speed is
a bit surprising, because new technology usually needs some time to surpass
established technology while delivering the all-important stability.
Finally, Marpa is being improved all the time. For instance, recently the
author eliminated the dependency on Glib, to improve portability. His work
continues, so that users can expect a series of incremental improvements for
some time to come.
I myself use Marpa in Graph::Easy::Marpa
and GraphViz2::Marpa,
but this is not an article on Marpa in specific.
The Lexer's Job Description
As I mentioned earlier, the stages conveniently, run in English alphabetical
order. First you lex. Then you parse.
Here, I use lexing to mean the comparatively simple (compared to
parsing) process of tokenising a stream of text, which means chopping that
input stream into discrete tokens and identifying the type of each. The output
is a new stream, this time of stand-alone tokens. (Lexing is comparatively
simpler than parsing.)
Lexing does nothing more than identify tokens. Questions about the meanings
of those tokens or their acceptable order or anything else are matters for the
parser. The lexer will say: I have found another token and have identified it
as being of some type T. Hence, for each recognized token, the lexer will
produce two items:
- The type of the token
- The value of the token
Because the lexing process happens repeatedly, the lexer will produce an
output of an array of token elements, with each element needing at least these
two components: type and value.
In practice, I prefer to represent these elements as a hashref:
{
count => $integer, # 1 .. N.
name => '', # Unused.
type => $string, # The type of the token.
value => $value, # The value from the input stream.
}
... with the array managed by an object of type Set::Tiny. The latter
module has many nice methods, making it very suitable for such work. Up until
recently I used Set::Array,
which I did not write but which I do now maintain. However, insights from a
recent report of mine, Set-handling
modules, comparing a range of similar modules, has convinced me to
switch to Set::Tiny. For an
application which might best store its output in a tree, the Perl module Tree::DAG_Node is
superb.
The count field, apparently redundant, is sometimes useful in
the clean-up phase of the lexer, which may need to combine tokens unnecessarily
split by the regexps used in lexing. Also, it is available to the parser if
needed, so I always include it in the hashref.
The name field really is unused, but gives people who fork or
sub-class my code a place to work with their own requirements, without worrying
that their edits will affect the fundamental code.
The Parser's Job Description
The parser concerns itself with the context in which each token appears,
which is a way of saying it cares about whether or not the sequence and
combination of tokens actually detected fits the expected grammar.
Ideally, the grammar is provided in BNF Form. This makes it easy to
translate into the form acceptable to Marpa. If you have a grammar in another
form, your work will probably be more difficult, simply because someone else
has not done the hard work of formalizing the grammar.
That's a parser. What's a grammar?
Grammars and Sub-grammars
I showed an example grammar earlier, for the DOT format. How
does a normal person understand a block of text written in BNF? Training helps.
Besides that, I've gleaned a few things from practical experience. To us
beginners eventually comes the realization that grammars, no matter how
formally defined, contain within them two sub-grammars:
Sub-grammar #1
One sub-grammar specifies what a token looks like, meaning the range of
forms it can assume in the input stream. If the lexer detects an
incomprehensible candidate, the lexer can generate an error, or it can activate
a strategy called Ruby
Slippers (no relation to the Ruby programming language). This technique
was named by Jeffrey Kegler, the author of Marpa.
In simple terms, the Ruby Slippers strategy fiddles the current token (or an
even larger section of the input stream) in a way that satisfies the grammar
and restarts processing at the new synthesized token. Marpa is arguably unique
in being able to do this.
Sub-grammar #2
The other sub-grammar specifies the allowable ways in which these tokens may
combine, meaning if they don't conform to the grammar, the code generates a
syntax error of some sort.
Easy enough?
I split the grammar into two sub-grammars because it helps me express my
Golden Rule of Lexing and Parsing: encode the first sub-grammar into the
lexer and the second into the parser.
If you know what tokens look like, you can tokenize the input stream by
splitting it into separate tokens using a lexer. Then you give those tokens to
the parser for (more) syntax checking, and for interpretation of what the user
presumably intended with that specific input stream (combination of
tokens).
That separation between lexing and parsing gives a clear plan-of-attack for
any new project.
In case you think this is going to be complex, truly it only sounds
complicated. Yes, I've introduced a few new concepts (and will introduce a few
more), but don't despair. It's not really that difficult.
For any given grammar, you must somehow and somewhere manage the complexity
of the question "Is this a valid document?" Recognizing a token with a regex is
easy. (That's probably why so many people stop at the point of using regexes to
pick at documents instead of moving to parsing.) Keeping track of the context
in which that token appeared, and the context in which a grammar allows that
token, is hard.
The complexity of setting up and managing a formal grammar and its
implementation seems like a lot of work, but it's a specified and well
understood mechanism you don't have to reinvent every time. The lexer and
parser approach limits the code you have to write to two things: a set of rules
for how to construct tokens within a grammar and a set of rules for what
happens when we construct a valid combination of tokens. This limit allows you
to focus on the important part of the application—determining what a
document which conforms to the grammar means (the author's
intention)—and less on the mechanics of verifying that a
document matches the grammar.
In other words, you can focus on what you want to do with a
document more than how to do something with it.
Coding the Lexer
The lexer's job is to recognise tokens. Sub-grammar #1 specifies what those
tokens look like. Any lexer will have to examine the input stream, possibly one
character at a time, to see if the current input, appended to the immediately
preceding input, fits the definition of a token.
A programmer can write a lexer in many ways. I do so by combining regexps
with a DFA (Discrete
Finite Automaton) module. The blog entry More
Marpa Madness discusses using Marpa in the lexer (as well as in the
parser, which is where I use it).
What is a DFA? Abusing any reasonable definition, let me describe them
thusly. The Deterministic part means that given the same input at the
same stage, you'll always get the same result. The Finite part means
the input stream only contains a limited number of different tokens, which
simplifies the code. The Automata is, essentially, a software
machine—a program. DFAs are also often called STTs (State Transition
Tables).
How do you make this all work in Perl? MetaCPAN is your friend! In particular, I
like to use Set::FA::Element to
drive the process. For candidate alternatives I assembled a list of Perl
modules with relevance in the area, while cleaning up the docs for
Set::FA. See Alternatives to
Set::FA. I did not write Set::FA, nor Set::FA::Element,
but I now maintain them.
Transforming a grammar from BNF (or whatever form you use) into a DFA
provides:
Insight into the problem
To cast BNF into regexps means you must understand exactly what the
grammar definition is saying.
Clarity of formulation
You end up with a spreadsheet which simply and clearly encodes your
understanding of tokens.
Spreadsheet? Yes, I store the derived regexps, along with other
information, in a spreadsheet. I even incorporate this spreadsheet into the
source code.
Back to the Finite Automaton
In practice, building a lexer is a process of reading and rereading, many
times, the definition of the BNF (here the DOT language) to
build up the corresponding set of regexps to handle each case. This is
laborious work, no doubt about it.
For example, by using a regexp like /[a-zA-Z_][a-zA-Z_0-9]*/,
you can get Perl's regexp engine to intelligently gobble up characters as long
as they fit the definition. In plain English, this regexp says: start with a
letter, upper- or lower-case, or an underline, followed by 0 or more letters,
digits or underlines. Look familiar? It's very close to the Perl definition of
\w, but it disallows leading digits. Actually, DOT disallows them
(in certain circumstances), but DOT does allow pure numbers in certain
circumstances.
What is the result of all of these hand-crafted regexps? They're
data fed into the DFA, along with the input stream. The output of the
DFA is a flag that signifies Yes or No, the input stream matches/doesn't match
the token definitions specified by the given regexps. Along the way, the DFA
calls a callback functions each time it recognizes a token, stockpiling them.
At the end of the run, you can output them as a stream of tokens, each with its
identifying type, as per The Lexer's Job Description I described earlier.
A note about callbacks: Sometimes it's easier to design a regexp to capture
more than seems appropriate, and to use code in the callback to chop up what's
been captured, outputting several token elements as a consequence.
Because developing the state transition table is such an iterative process,
I recommend creating various test files with all sorts of example programs, as
well as scripts with very short names to run the tests (short names because
you're going to be running these scripts an unbelievable number of
times...).
States
What are states and why do you care about them?
At any moment, the STT (automation, software machine) is in precisely
on) state. Perhaps it has not yet received even one token (so that
it's in the start state), or perhaps it has just finished processing the
previous one. Whatever the case, the code maintains information so as to know
exactly what state it is in, and this leads to knowing exactly what set of
tokens is now acceptable. That is, it has a set of tokens, any of which will be
legal in its current state.
The implication is this: you must associate each regexp with a specific
state and visa versa. Furthermore, the machine will remain in its current state
as long as each new input character matches a regexp belonging to the current
state. It will jump (make a transition) to a new state when that character does
not match.
Sample Lexer Code
Consider this simplistic code from the synopsis of Set::FA::Element:
my($dfa) = Set::FA::Element -> new
(
accepting => ['baz'],
start => 'foo',
transitions =>
[
['foo', 'b', 'bar'],
['foo', '.', 'foo'],
['bar', 'a', 'foo'],
['bar', 'b', 'bar'],
['bar', 'c', 'baz'],
['baz', '.', 'baz'],
],
);
In the transitions parameter the first line says: "foo" is a
state's name, and "b" is a regexp. Jump to state "bar" if the next input char
matches that regexp. Other lines are similar.
To use Set::FA::Element,
you must prepare that transitions parameter matching this format. Now you see
the need for states and regexps.
This is code I've used, taken directly from GraphViz2::Marpa::Lexer::DFA:
Set::FA::Element -> new
(
accepting => \@accept,
actions => \%actions,
die_on_loop => 1,
logger => $self -> logger,
start => $self -> start,
transitions => \@transitions,
verbose => $self -> verbose,
);
Let's discuss these parameters.
- accepting
This is an arrayref of state names. After processing the entire input
stream, if the machine ends up in one of these states, it has accepted that
input stream. All that means is that every input token matched an appropriate
regexp, where "appropriate" means every char matched the regexp belonging to
the current state, whatever the state was at the instant that char was
input.
- actions
-
This is a hashref of function names so that the machine can call a function,
optionally, upon entering or leaving any state. That's how the stockpile for
recognized tokens works.
Because I wrote these functions myself and wrote the rules to attach each to
a particular combination of state and regexp, I encoded into each function the
knowledge of what type of token the DFA has matched. That's how the stockpile
ends up with (token, type) pairs to output at the end of the run.
- die_on_loop
-
This flag, if true, tells the DFA to stop if none of the regexps belonging to the current state match the current input char. Rather than looping forever, stop. Throw an exception.
You might wonder what stopping automatically is not the default, or even
mandatory. The default behavior allows you to try to recover from this bad
state, or at least give a reasonable error message, before dying.
- logger
This is an (optional) logger object.
- start
This is the name of the state in which the STT starts, so the code knows
which regexp(s) to try upon receiving the very first character of
input.
- transitions
This is a potentially large arrayref which lists separately for all
states all the regexps which may possibly match the current input
char.
- verbose
-
Specifies how much to report if the logger object is not defined.
With all of that configured, the next problem is how to prepare the grammar
in such a way as to fit into this parameter list.
Coding the Lexer - Revisited
The coder thus needs to develop regexps etc which can be fed directly into
the chosen DFA, here Set::FA::Element, or
which can be transformed somehow into a format acceptable to that module. So
far I haven't actually said how I do that, but now it's time to be
explicit.
I use a spreadsheet with nine columns:
- Start
This contains one word, "Yes", against the name of the state which is
the start state.
- Accept
This contains the word "Yes" against the name of any state which will be
an accepting state (the machine has matched an input stream).
- State
This is the name of the state.
- Event
This is a regexp. The event will fire the current input char matches this
regexp.
Because the regexp belongs to a given state, we know the DFA will only
process regexps associated with the current state, of which there will be
usually one or or at most a few.
When there are multiple regexps per state, I leave all other columns
empty.
- Next
The name of the "next" state to which the STT will jump if the current
char matches the regexp given on the same line of the spreadsheet (in the
current state of course).
- Entry
The optional name of the function the DFA is to call upon (just before)
entry to the (new) state.
- Exit
The optional name of the function the DFA is to call upon exiting from
the current state.
- Regexp
This is a working column, in which I put formulas so that I can refer to
them in various places in the Event column. It is not passed to the DFA in the
transitions parameter.
- Interpretation
Comments to myself.
I've put the STT for STT
for GraphViz2::Marpa online.
This spreadsheet has various advantages:
Legibility. It is very easy to read and to work with. Don't forget,
to start with you'll be basically switching back and forth between the grammar
definition document (hopefully in BNF) and this spreadsheet. I don't do much
(any) coding at this stage.
Exportability. Because I have no code yet, there are several
possibilities. I could read the spreadsheet directly. The two problems with
this approach are the complexity of the code (in the external module which does
the reading of course), and the slowness of loading and running this code.
Because I use LibreOffice I can
either force end-users to install OpenOffice::OODoc,
or export the spreadsheet as an Excel file, in order to avail themselves of
this option. I have chosen to not support reading the.ods file
directly in the modules (Graph::Easy::Marpa
and GraphViz2::Marpa)
I ship.
I could alternately export the spreadsheet to a CSV file first. This way,
we can read a CSV file into the DFA fairly quickly, without loading the module
which reads spreadsheets.
Be careful here with LibreOffice, because it forces you to use Unicode for
the spreadsheet but exports odd character sequences, such as double-quotes as
the three byte sequence 0xe2, 0x80, 0x9c. When used in a regexp, this sequence
will never match a real double-quote in your input stream. Sigh. Do No
Evil. If only.
I could also incorporate the spreadsheet directly into my code. This is my
favorite approach. I do this in two stages. I export my data to a CSV file,
then append that file to the end of the source code of the module, after the
__DATA__ token.
Such in-line data can be accessed effortlessly by the very neat and very
fast module Data::Section::Simple.
Because Perl has already loaded the module—and is executing
it—there is essentially no overhead whatsoever in reading data from
within it. Don't you just love Perl! And MetaCPAN of course. And a community
which contributes such wondrous code.
An advantage of this alternative is that it allows end users to edit the
shipped .csv or .ods files, after which they can use a
command line option on scripts to read their own file, overriding the built-in
STT.
After all this, it's just a matter of code to read and validate the
structure of the STT's data, then to reformat it into what Set::FA::Element
demands.
Coding the Parser
At this point, you know how to incorporate the first sub-grammar into the
design and code of the lexer. You also know that the second sub-grammar must be
encoded into the parser, for that's how the parser performs syntax
checking.
How you do this depends intimately on which pre-existing module, if any, you
choose to use to aid the development of the parser. Because I choose Marpa
(currently Marpa::R2), I am
orienting this article to that module. However, only in the next article will I
deal in depth with Marpa.
Whichever tool you choose, think of the parsing process like this: Your
input stream is a set of pre-defined tokens (probably but necessarily output
from the lexer). You must now specify all possible legal combinations of those
tokens. This is the syntax of the language (or, more accurately, the
remainder of the syntax, because the first sub-grammar has already
handled all of the definitions of legal tokens). At this point, assume all
incoming tokens are legal. In other words, the parser will not try to parse and
run a program containing token-based syntax errors, although it may contain
logic errors (even if written in Perl :-).
A combination of tokens which does not match any of the given legal
combinations can be immediately rejected as a syntax error. Keep in mind that
the friendliest compilers find as many syntax errors as possible per parse.
Because this check takes place on a token-by-token basis, you (ought to)
know precisely which token triggered the error, which means that you can emit a
nice error message, identifying the culprit and its context.
Sample Parser Code
Here's a sample of a Marpa::R2 grammar (adapted from its
synopsis):
my($grammar) = Marpa::R2::Grammar -> new
({
actions => 'My_Actions',
start => 'Expression',
rules =>
[
{ lhs => 'Expression', rhs => [qw/Term/] },
{ lhs => 'Term', rhs => [qw/Factor/] },
{ lhs => 'Factor', rhs => [qw/Number/] },
{ lhs => 'Term', rhs => [qw/Term Add Term/],
action => 'do_add'
},
{ lhs => 'Factor', rhs => [qw/Factor Multiply Factor/],
action => 'do_multiply'
},
],
default_action => 'do_something',
});
Despite the differences between this and the calls to Set::FA::Element
-> new() in the lexer example, these two snippets are basically
the same:
- actions
This is the name of a Perl package in which Marpa will look for actions
such as do_add() and do_multiply(). (Okay, the lexer
has no such option, as it defaults to the current package.)
- start
This is the lhs name of the rule to start with, as with the
lexer.
- rules
This is an arrayref of rule descriptors defining the syntax of
the grammar. This is the lexer's transitions parameter.
- default_action
Use this (optional) callback as the action for any rule element which does not explicitly specify its own action.
The real problem is recasting the syntax from BNF, or whatever, into a set
of rule descriptors. How do you think about this problem? I suggest
contrast-and-compare real code with what the grammar says it must be.
Here's the teamwork.dot file I explained earlier.
digraph Perl
{
graph [ rankdir="LR" ]
node [ fontsize="12pt" shape="rectangle" style="filled, solid" ]
edge [ color="grey" ]
"Teamwork" [ fillcolor="yellow" ]
"Victory" [ fillcolor="red" ]
"Teamwork" -> "Victory" [ label="is the key to" ]
}
In general, a valid Graphviz (DOT)
graph must start with one of:
strict digraph $id {...} # Case 1. $id is a variable.
strict digraph {...}
strict graph $id {...} # Case 3
strict graph {...}
digraph $id {...} # Case 5
digraph {...}
graph $id {...} # Case 7
graph {...}
... as indeed this real code does. The graph's id is Perl, which is
case 5. If you've ever noticed that you can write a BNF as a tree (right?), you
can guess what comes next. I like to write my rule descriptors from
the root down.
Drawing this as a tree gives:
DOT's Grammar
|
V
---------------------
| |
strict |
| |
---------------------
|
V
---------------------
| |
digraph or graph
| |
---------------------
|
V
---------------------
| |
$id |
| |
---------------------
|
V
{...}
Connecting the Parser back to the Lexer
Wait, what's this? Didn't I say that strict is optional. It's not
optional, not in the parser. It is optional in the DOT language, but I designed
the lexer, and I therein ensured it would necessarily output strict =>
no when the author of the graph omitted the strict.
By the time the parser runs, strict is no longer optional. I did
this to make the life easier for consumers of the lexer's output stream, such
as authors of parsers. (Making the parser work less is often good.)
Likewise, for digraph 'v' graph, I designed the lexer to
output digraph => 'yes' in one case and digraph =>
'no' in the other. What does that mean? For teamwork.dot, the
lexer will output (in some convenient format) the equivalent of:
strict => no
digraph => yes
graph_id => Perl
...
I chose graph_id because the DOT language allows other types of
ids, such as for nodes, edges, ports, and compass points.
All of this produces the first six Marpa-friendly rules:
[
{ # Root-level stuff.
lhs => 'graph_grammar',
rhs => [qw/prolog_and_graph/],
},
{
lhs => 'prolog_and_graph',
rhs => [qw/prolog_definition graph_sequence_definition/],
},
{ # Prolog stuff.
lhs => 'prolog_definition',
rhs => [qw/strict_definition digraph_definition graph_id_definition/],
},
{
lhs => 'strict_definition',
rhs => [qw/strict/],
action => 'strict', # <== Callback.
},
{
lhs => 'digraph_definition',
rhs => [qw/digraph/],
action => 'digraph', # <== Callback.
},
{
lhs => 'graph_id_definition',
rhs => [qw/graph_id/],
action => 'graph_id', # <== Callback.
},
...
]
In English, all of this asserts that the graph as a whole consists of a
prolog thingy, then a graph sequence thingy. (Remember, I made up the names
prolog_and_graph, etc.
Next, a prolog consists of a strict thingy, which is now not optional, and
then. a digraph thingy, which will turn out to match the lexer input of
/^(di|)graph$/, and the lexer output of digraph =>
/^(yes|no)$/, and then a graph_id, which is optional, and then some
other stuff which will be the precise definition of real live graphs,
represented by {...} in the list of the eight possible formats for
the prolog.
Whew.
Something Fascinating about Rule Descriptors
Take another look at those rule descriptors. They say nothing about
the values of the tokens! For instance, in graph_id => Perl what
happens to ids such as Perl. Nothing. They are ignored. That's just
how these grammars work.
Recall: it's the job of the lexer to identify valid graph ids based
on the first sub-grammar. By the time the data hits the parser, we know we have
a valid graph id, and as long as it plugs in to the structure of the
grammar in the right place, we are prepared to accept any valid graph
id. Hence Marpa::R2 does not
even look at the graph id, which is a way of saying this one grammar works with
every valid graph id.
This point also raises the tricky discussion of whether a specific
implementation of lexer/parser code can or must keep the two phases separate,
or whether in fact you can roll them into one without falling into the
premature optimisation trap. I'll just draw a veil over that discussion, as
I've already declared my stance: my implementation uses two separate
modules.
Chains and Trees
If these rules have to be chained into a tree, how do you handle the root?
Consider this call to Marpa::R2's
new() method:
my($grammar) = Marpa::R2::Grammar -> new(... start => 'graph_grammar', ...);
graph_grammar is precisely the lhs in the first rule
descriptor.
After that, every rule's rhs, including the root's, must be defined
later in the list of rule descriptors. These definitions form the links in the
chain. If you draw this, you'll see the end result is a tree.
Here's the full Marpa::R2
grammar for DOT (as used in the GraphViz2::Marpa
module) as an image: http://savage.net.au/Ron/html/graphviz2.marpa/Marpa.Grammar.svg.
I created this image with (you guessed it!) Graphviz via GraphViz2. I've added
numbers to node names in the tree, otherwise Graphviz would regard any two
identical numberless names as one and the same node.
Less Coding, More Design
Here I'll stop building the tree of the grammar (see the next article), and
turn to some design issues.
My Rules-of-Thumb for Writing Lexers/Parsers
The remainder of this document is to help beginners orient their thinking
when tackling a problem they don't yet have much experience in. Of course, if
you're an expert in lexing and parsing, feel free to ignore everything I say,
and if you think I've misused lexing/parsing terminology here, please let me
know.
Eschew Premature Optimisation
Yep, this old one again. It has various connotations:
The lexer and the parser
Don't aim to combine the lexer and parser, even though that might
eventually happen. Do wait until the design of each is clear and finalized,
before trying to jam them into a single module (or program).
The lexer and the tokens
Do make the lexer identify the existence of tokens, but not identify
their ultimate role or meaning.
The lexer and context
Don't make the lexer do context analysis. Do make the parser
disambiguate tokens with multiple meanings, by using the context. Let the
lexer do the hard work of identifying tokens.
And context
analysis for businesses, for example, is probably not what you want
either.
The lexer and syntax
Don't make the lexer do syntax checking. This is effectively the same as
the last point.
The lexer and its output
Don't minimize the lexer's output stream. For instance, don't force the
code which reads the lexer's output to guess whether or not a
variable-length set of tokens has ended. Output a specific token as a set
terminator. The point of this token is to tell the parser exactly what's
going on. Without such a token, the next token has to do double-duty:
Firstly it tells the parser the variable-length part has finished and
secondly, it represents itself. Such overloading is unnecessary.
The State Transition Table
In the STT, don't try to minimize the number of states, at least not
until the code has stabilized (that is, it's no longer under [rapid]
development).
I develop my STTs in a spreadsheet program, which means a formula
(regexp) stored in one cell can be referred to by any number of other
cells. This is very convenient.
Divide and Conquer
Hmmm, another ancient aphorism. Naturally, these
persist precisely because they're telling us something important. Here, it
means study the problem carefully, and deal with each part (lexer, parser) of
it separately. Enough said.
Don't Reinvent the Wheel
Yes, I know you'd never do that.
The CPAN has plenty of Perl modules to help with things like the STT, such
as Set::FA::Element.
Check its See Also (in Set::FA, actually) for other
STT helpers.
Be Patient with the STT
Developing the STT takes many iterations:
The test cases
For each iteration, prepare a separate test case.
The tiny script
Have a tiny script which runs a single test. Giving it a
short—perhaps temporary—name, makes each test just that little
bit easier to run. You can give it a meaningful name later, when including
it in the distro.
The wrapper script
Have a script which runs all tests.
I keep the test data files in the data/ dir, and the scripts in the scripts/ dir. Then, creating tests in the t/ dir can perhaps use these two sets of helpers.
Because I've only used Marpa::R2 for graphical
work, the output of the wrapper is a web page, which makes viewing the
results simple. I like to include (short) input or output text files on
such a page, beside the SVG images. That way I can see at a glance what the
input was and hence I can tell what the output should be without switching
to the editor's window.
There's a little bit of effort initially, but after that it's
so easy to check the output of the latest test.
I've made available sample output from my wrapper scripts:
GraphViz2
(non-Marpa)
GraphViz2::Marpa
Graph::Easy::Marpa
Be Patient with the Grammar
As with the STT, creating a grammar is at least for me very much a
trial-and-error process. I offer a few tips:
Tips:
Paper, not code
A good idea is not to start by coding with your editor, but to draw the
grammar as a tree, on paper.
Watch out for alternatives
This refers to when one of several tokens can appear in the input
stream. Learn exactly how to draw that without trying to minimize the
number of branches in the tree.
Of course, you will still need to learn how to code such a construct.
Here's a bit of code from Graph::Easy::Marpa
which deals with this (note: we're back to the Graph::Easy
language from here on!):
{ # Graph stuff.
lhs => 'graph_definition',
rhs => [qw/graph_statement/],
},
{
lhs => 'graph_statement', # 1 of 3.
rhs => [qw/group_definition/],
},
{
lhs => 'graph_statement', # 2 of 3.
rhs => [qw/node_definition/],
},
{
lhs => 'graph_statement', # 3 of 3.
rhs => [qw/edge_definition/],
},
This is telling you that a graph thingy can be any one of a group, node, or
edge. It's Marpa::R2's job
to try these alternatives in order to see which (if any) matches the input
stream. This ruleset represents a point in the input stream where one of
several alternatives can appear.
The tree looks like:
graph_definition
|
V
graph_statement
|
V
---------------------------------------
| | |
V V V
group_definition node_definition edge_definition
My comment 3 of 3 says an edge can stand alone.
Watch out for sequences
... but consider the node_definition:
{ # Node stuff.
lhs => 'node_definition',
rhs => [qw/node_sequence/],
min => 0,
},
{
lhs => 'node_sequence', # 1 of 4.
rhs => [qw/node_statement/],
},
{
lhs => 'node_sequence', # 2 of 4.
rhs => [qw/node_statement daisy_chain_node/],
},
{
lhs => 'node_sequence', # 3 of 4.
rhs => [qw/node_statement edge_definition/],
},
{
lhs => 'node_sequence', # 4 of 4.
rhs => [qw/node_statement group_definition/],
},
Here 3 of 4 tells you that nodes can be followed by
edges.
A realistic sample is: [node_1] -> [node_2], where
[x] is a node and -> is an edge, because an
edge can be followed by a node (applying 3 of 4).
This full example represents a point in the input stream where one of
several specific sequences of tokens are allowed/expected. Here's
the edge_definition:
{ # Edge stuff.
lhs => 'edge_definition',
rhs => [qw/edge_sequence/],
min => 0,
},
{
lhs => 'edge_sequence', # 1 of 4.
rhs => [qw/edge_statement/],
},
{
lhs => 'edge_sequence', # 2 of 4.
rhs => [qw/edge_statement daisy_chain_edge/],
},
{
lhs => 'edge_sequence', # 3 of 4.
rhs => [qw/edge_statement node_definition/],
},
{
lhs => 'edge_sequence', # 4 of 4.
rhs => [qw/edge_statement group_definition/],
},
{
lhs => 'edge_statement',
rhs => [qw/edge_name attribute_definition/],
},
{
lhs => 'edge_name',
rhs => [qw/edge_id/],
action => 'edge_id',
},
But, I have to stop somewhere, so...
Wrapping Up and Winding Down
I hope I've clarified what can be a complex and daunting part of
programming, and I also hope I've convinced you that working in Perl, with the
help of a spreadsheet, is the modern (or "only") path to lexer and parser
bliss.
Ron Savage is a
longtime Perl programmer and prolific CPAN contributor.