Keywords: Parsers, XPath, YAPPS, Spark, BisonGen
While little languages have been discussed in the Python community for quite some time [Ayc98], little progress has been made with including a parser generator in the standard library. This may be partially due to the fact that so far contributions have been made only by authors of the parser toolkits, but not by the users.
This paper is a report on using Python parser generators in an application project, namely in an XPath parser for PyXML. XPath [W3C99] is a query language for XML. An XPath expression can describe a subset of the nodes in an XML document. An XPath implementation then allows evaluating a given expression for a given document.
Even though the major part of developing an XPath implementation is in
the logic for evaluating XPath expressions, parsing them turns out to
be a challenge on its own. In the 4Suite
[Fou01] package,
the XPath parser is built using the BisonGen [Fou00]
generator. Incorporating this implementation into PyXML proved to be
quite challenging. A detailed description of BisonGen is given later;
however, the following obstacles should serve as a motivation for
developing yet another XPath parser:
BisonGen
, PyXML
, bison
,
flex
, SWIG
, and a C compiler. The PyXML
prerequisite in particular causes a cyclic dependency, which could
be resolved by including the generated C files with the
distribution.
wchar_t
strings; it is not
clear whether these patches would allow to implement full Unicode
support for an XPath scanner.
It should be noted that some of these problems have been resolved with the 0.11.1 release of 4Suite.
When analysing parser generators, the following absolute requirements where defined:
In addition, the following non-critical criteria are used for ranking the parsers:
In the remainder of this document, I'll assume that the reader is familiar with the principles of compiler construction. As introductory material, the Dragon Book [ASU86] always serves as a reference. Familiarity with XPath is not assumed, as this language only serves as an example.
Most scanner generators use regular expressions to describe tokenization. Even though all the parser generators described in this paper do support tokenization by means of regular expressions, it quickly turned out that this tokenization process was not useful for XPath. In the XPath specification, special tokenization rules are given as plain English text:
If there is a preceding token and the preceding token is not one of
@
,::
,(
,[
,,
or an Operator, then a*
must be recognized as a MultiplyOperator and an NCName must be recognized as an OperatorName....
If the character following an NCName (possibly after intervening ExprWhitespace) is
(
, then the token must be recognized as a NodeType or a FunctionName.
These disambiguation rules specify that in */para
, the
asterisk has its wildcard meaning, whereas in 3 * 5
, it
indicates multiplication. Likewise, in para[last()]
,
last is a function call, whereas in last[1]
it
is an element name. The disambiguation rules are necessary for XPath
to avoid the definition of keywords, since all identifiers
(NCName
s) should be allowed as XML element and attribute
names.
Since the XPath disambiguation operates on lexical context, it does
not easily fit into scanner generators. The lex
scanner
states may allow an implementation of the disambiguation rules,
however, we found that they are easy enough to implement using program
logic. Thus, we implemented a custom scanner for XPath, using the
re
module for regular expressions.
A scanner based on regular expressions is usually implemented as an alternative of all token definitions. For XPath, a fragment of this expressions looks like this:
(?P<Number>\\d+(\\.\\d*)?|\\.\\d+)| (?P<VariableReference>\\$""" + QName + """)| (?P<NCName>"""+NCName+""")| (?P<QName>"""+QName+""")| (?P<LPAREN>\\()| (?P<RPAREN>\\))| (?P<STAR>\\*)| ... (?P<ExprWhiteSpace>[ \t\n\r]+)
Here, each alternative in the regular expression defines a named group. Scanning proceeds in the following steps:
It should be noted that this scanning process requires the complete input. For XPath, this causes no problems, since the input is typically small, and completely provided in the Python source code of the application, or in the XSLT style sheet.
When experimenting with various parsers, adjusting the grammar definition to each parser and fitting in the semantic actions was somewhat time-consuming. Thus, a single hand-written scanner would simplify adjustment to various parser generators. On the other hand, it introduced another requirement on parsers being evaluated: They need to interoperate with a hand-written scanner.
Since the ultimate objective of this project was to incorporate 4XPath into PyXML, the existing XPath evaluation library of 4XPath had to be re-used, if possible as-is. That turned out to be quite feasible: The 4XPath BisonGen parser would create class instances in its semantic actions, which represent an abstract syntax of the expression.
For the parser generators studied, it was not difficult to generate the same abstract syntax, in particular since the concrete syntax of all parsers was similar (it could not be identical for all parsers for reasons shown below).
These days, parser generators often provide their own framework for abstract syntaxes and parse trees. We find such frameworks to be rather the source of problems than a solution. It may be easier to generate the abstract syntax using the parser framework (in particular if the parser does not support custom actions), but it is more difficult to process the resulting syntax tree. In the specific application, we would need to apply a transformation from the parser-generated tree to the instances of the classes already defined in 4XPath.
As first parser generator, we looked at kwParsing, by Aaron Watters [Cho98]. It became quickly apparent that this toolkit currently does not support Unicode strings, due to its reliance on the regex module. We have not studied it further, although either a port to re, or an integration of a custom tokenizer may have been feasible.
Following the links on the String SIG page [Kuc00], we found "Yet Another Python Parser System", YAPPS, by Amit Patel. Even though the parser only supports LL(1) languages due to its recursive-descent strategy, it turned out that the XPath grammar could easily be rewritten to be acceptable using standard techniques. For example, the rule 18 of the XPath specification reads
UnionExpr ::= PathExpr | UnionExpr '|' PathExpr
in its original form. In YAPPS, the left recursion can be replaced as
rule UnionExpr: PathExpr UnionExprs rule UnionExprs: # empty | BAR PathExpr UnionExprs
YAPPS operates as a parser generator. The grammar specification is given to a command line tool (which is also written using YAPPS), and a Python file is produced. The generated parser makes use of a library file; the library and the generated parser module together are all that is needed for anybody using the parser
YAPPS is implemented using recursive descent. This makes it possible to understand the generated code easily. For example, the UnionExpr rule translates into a function definition
def UnionExpr(self): PathExpr = self.PathExpr() UnionExprs = self.UnionExprs(PathExpr) return UnionExprs
It may not be obvious how the return statement comes about; this was generated from semantic actions. The full definition of the rule in the XPath parser reads
rule UnionExpr: PathExpr UnionExprs<<PathExpr>> {{return UnionExprs}} rule UnionExprs<<v>>: {{return v}} | BAR PathExpr UnionExprs<<self.nop(self.UNION,v,PathExpr)>> {{return UnionExprs}}
Each rule can accept arbitrary parameters, which are declared in the rule definition, and passed as parameters using the '<<...>>' braces. Likewise, semantic actions are inserted through '{{...}}' braces.
YAPPS will generate a class that contains a method for each rule. This allows to create a new parser state for each parse process. It also gives easy access to additional functions: In the XPath parser, I have defined a subclass of the generated parser class:
class MyXPath(XPath): ... UNION = pyxpath.UNION_OPERATOR ... def nop(self, operator, left, right): return self.factory.createNumericOperator(operator, left, right)
This class can go into the grammar definition file. Like YACC, YAPPS supports custom code before and after the grammar definition.
YAPPS parsers integrate with the scanner through a Scanner object,
which is passed to the parser as a constructor argument. Even though
YAPPS supports definition of tokens in the grammar, we have not used
this capability in XPath, since we have provided my own scanner class.
The YAPPS parser will only require a token()
method from
the scanner object, which must return a four-tuple (start, end,
token type, token value).
In case of a syntax error, the generated parser will raise a
yappsrt.SyntaxError
exception. This exception will take
the current position and an error message.
Entering the XPath grammar into the parser was straight-forward, as shown above. Debugging the parser turned out to be easy as well, since it was possible to trace the generated Python code in a debugger, or enrich it with debugging statements. The YAPPS distribution comes with manual and a number of example grammars.
The SPARK parser was first presented in John Aycock's paper on little
languages [Ayc98]. It is a pure-Python framework, which does not
require a parser generator. Instead, the grammar is defined by means
of __doc__
strings, which are collected using
introspection in the SPARK library module.
In SPARK, the UnionExpr production is formulated as follows:
def p_UnionExpr(self,args): """ UnionExpr ::= PathExpr UnionExpr ::= UnionExpr | PathExpr """ return args
Since the method name starts with a p_
prefix, it is
treated as a production method. SPARK will analyse its doc string, and
associate the production with the method. Every time one of the given
production rules matches, it invokes this method.
The semantic action in this code is trivial. It would be easy to define a meaningful action: args is the list of all right-hand side non-terminals. For this rule, it is thus either one or two elements long.
Integration with a custom tokenizer is also straight-forward. All
tokens are specified as t_
methods in the
GenericScanner
subclass. The tokenize()
method of this subclass then needs to take care of the XPath
disambiguation.
On error, SPARK will invoke an error()
method on the
parser, which, by default will exit Python.
Since SPARK uses the Earley algorithm, it can parse almost any grammar as-is; this makes entering the grammar straight forward. Unfortunately, this parsing method (and the fact that the parser is generated dynamically) makes it difficult to debug the parser. Since the parsers silently support ambiguous grammars, it might be difficult to find the cause of a syntax error. The documentation primarily consists of the IPC7 paper, and a few examples.
Fourthought Inc has developed the BisonGen framework to implement parsers for their 4Suite package [Fou01]. The parser is defined using an XML syntax. Until recently, the build process of a BisonGen parser was as follows:
Recently, this build procedure was completely restructured. Today, BisonGen implements the LALR(1) algorithm itself, not relying on bison anymore. Therefore, the build procedure is simplified to:
To give an impression of how BisonGen code looks, the UnionExpr production is again presented, as it appears in the 0.11 release of 4Suite.
<RULE_SET NAME="18"> <NON_TERMINAL>unionExpr</NON_TERMINAL> <RULE> <SYMBOL TYPE="s">pathExpr</SYMBOL> <CODE> </CODE> </RULE> <RULE> <SYMBOL TYPE="s">unionExpr</SYMBOL> <SYMBOL TYPE="s">'|'</SYMBOL> <SYMBOL TYPE="s">pathExpr</SYMBOL> <CODE> <VARIABLE TYPE="PyObject*" NAME="right"></VARIABLE> <VARIABLE TYPE="PyObject*" NAME="left"></VARIABLE> <VARIABLE TYPE="PyObject*" NAME="expr"></VARIABLE> <CODE_SNIPPET> right = stack_pop(); left = stack_pop(); expr = PyObject_CallMethod(ParsedExpr, "ParsedUnionExpr", "OO", left, right); decref(right); decref(left); stack_push(expr); </CODE_SNIPPET> </CODE> </RULE> </RULE_SET>
Since the document type for BisonGen varies between releases, we will
not explain all tags used here in detail. In this fragment, two rules
are defined, which are both alternatives of the unionExpr
non-terminal. The right-hand-side of each rule consists of a sequence
of symbols; it is just pathExpr
for the first rule, and
unionExpr '|' pathExpr
for the second.
In the second rule, a specific semantic action is defined, which is a
call to ParsedExpr.ParsedUnionExpr
, which in turn is a
class of the abstract syntax. BisonGen will declare three variables in
the Bison file, and insert the specified code snippet into the
semantic action.
Even though the build process of BisonGen applications has been dramatically simplified recently, the grammar specifications still look quite verbose.
Integration with the lexical analysis follows the usual YACC
convention: an yylex
function is invoked to return the
next token. Token numbers identify tokens. In addition, the
yylval variable carries the semantic value.
Older versions of BisonGen generate flex files from token definitions
given XML; the recent versions generate re
-style regular
expressions from similar XML specifications.
Error handling also follows the YACC tradition: an yyerror
function is invoked. Since unwinding out of a bison parser run is not
easy, this function normally only sets a global variable, which is
then checked when the parser returns.
The BisonGen distribution comes with a short overview of the grammar input language, and a few examples as part of the test suite.
In this test, the three parsers where compared for parsing speed. The test ran 32 example expressions from the XPath specification, parsing each one 100 times. Tests where executed using Python 2.1 on a 600 MHz UltraSparc machine. The list of tests being executed, and the grammars that have been used for the parser generators are part of the PyXPath distribution, in [vLo00]. For convenience, the test cases are reproduced below.
exprs = [ "child::para", "child::*", "child::text()", "child::node()", "attribute::name", "attribute::*", "descendant::para", "ancestor::div", "ancestor-or-self::div", "descendant-or-self::para", "self::para", "child::chapter/descendant::para", "child::*/child::para", "/", "/descendant::para", "/descendant::olist/child::item", "child::para[position()=1]", "child::para[position()=last()]", "child::para[position()=last()-1]", "child::para[position()>1]", "following-sibling::chapter[position()=1]", "preceding-sibling::chapter[position()=1]", "/descendant::figure[position()=42]", "/child::doc/child::chapter[position()=5]/child::section[position()=2]", 'child::para[attribute::type="warning"]', "child::para[attribute::type='warning'][position()=5]", 'child::para[position()=5][attribute::type="warning"]', "child::chapter[child::title='Introduction']", "child::chapter[child::title]", "child::*[self::chapter or self::appendix]", "child::*[self::chapter or self::appendix][position()=last()]", "//element[descendant::y[.='z']][1]" ]
The Spark parser differed from the other two parsers in that it did not perform semantic actions. To allow an estimation of the overhead spend in the semantic actions, the YAPPS parser was run a second time with all semantic actions removed. For the tests, Spark version 0.6.1 was used.
The BisonGen test case was taken from 4Suite 0.11, which only supports operation in C mode (i.e. without a pure-Python mode).
In addition, the YAPPS parser was run a third time, where
re.py
was modified to use pre
as the regular
expression engine, to get an estimate of the relative performance of
sre
.
Test Case | Time (s) |
---|---|
BisonGen (C) | 2.1 |
YAPPS (sre, no semantic actions) | 9.5 |
YAPPS (sre) | 11 |
YAPPS (pre) | 17 |
SPARK | 84 |
There are a number of other parser generators and parser embedding strategies available for Python. Some of those are not easily applicable to XPath parsing; for others, we could not find a simple porting strategy to avoid writing the grammar from scratch every time.
A number of authors have embedded Bison into Python, usually using hand-written parsers and interfacing code. The CORBA IDL compilers of both Fnorb [DSTC99] and OmniORBpy [ATT01] use this strategy. It is very similar to BisonGen in its properties.
PyLR [Cot97] provides a single C module to speed-up LR parsing. According to the author, a number of enhancements have to be made before it can be considered complete. Unfortunately, it has been that way since 1997. Even though it looks like PyLR would support a custom tokenizer and that PyLR was capable of dealing with the original XPath grammar, it is not clear how exactly to use that tool, given the scarce documentation.
Furthermore, the author reports that the LR engine is quite slow; it is not clear whether PyLR helps in constructing an LALR(1) grammar.
Plex [Ewi01] only supports scanner generation. Due to the special lexical structure of XPath, and the lack of parser generation in Plex, no attempt was made to use this tool.
Trap [Ern99] supports the specification of intermediate representations (IR) as part of the grammar definitions, and subsequently generates parsers that parse into the specified IR, including support for unparsing. It currently uses kwParsing as its underlying engine, so no new insights were expected from using this package.
The YAPPS and SPARK XPath parsers used for this paper are available from http://www.informatik.hu-berlin.de/~loewis/xml. The BisonGen parser is distributed as part of 4Suite [Fou01].
I'd like to thank Andrew Kuchling for collecting Python parser information over a long time as part of the String SIG, and Jeremy Kloth and Uche Ogbuji for their comments on PyXPath.
Of all the parser generators tested, the YAPPS parser is the one that will be used in PyXML, perhaps side-by-side with the BisonGen parser provided by Fourthought.
Even though the LL(1) grammar specification has deficiencies, the entire parser generator is so small and easy to understand that it would make a valuable addition to the standard Python library.
[ASU86] A. V. Aho, R. Sethi, and J. D. Ullman. Compilers: Principles, Techniques, and Tools. Addison-Wesley, 1986.
[ATT01] AT&T. omniORBpy. http://www.uk.research.att.com/omniORB/omniORBpy/
[Ayc98] John Aycock, Compiling Little Languages, IPC 7, Houston, 1998.
[Cho98] Chordate Systems. Gadfly 1.0: An SQL Database in Python. http://www.chordate.com/kwParsing/index.html
[Cot97] S. Cotton. PyLR. http://starship.python.net/crew/scott/PyLR.html
[DSTC99] DSTC. Fnorb. http://www.fnorb.com/
[Ern99] T. Ernst. TRAP. http://www.first.gmd.de/smile/trap/
[Ewi01] Greg Ewing. Plex. http://www.cosc.canterbury.ac.nz/~greg/python/Plex/
[Fou00] Fourthough Inc. BisonGen. ftp://ftp.fourthought.com/pub/BisonGen/
[Fou01] Fourthought Inc. 4Suite. http://4suite.org/download.html
[Kuc00] Andrew Kuchling. Python String-SIG Activities. http://www.amk.ca/python/string.html
[vLo00] Martin v. Löwis. PyXPath. http://www.informatik.hu-berlin.de/~loewis/xml/
[W3C99] W3 Consortium. XML Path Language (XPath) Version 1.0. http://www.w3.org/TR/xpath, W3C, 1999.
[W3C99a] W3 Consortium. XSL Transformations (XSLT) Version 1.0. http://www.w3.org/TR/xslt, W3C, 1999.