Towards a Standard Parser Generator

Martin v. Löwis
Humboldt-Universität zu Berlin
Institut für Informatik
loewis@informatik.hu-berlin.de

Abstract

Developing parsers for "little" languages is a common task for many software developers. People have frequently requested inclusion of a specific parser generator framework into the Python library. In this paper, we compare several Python parser generators, using the XPath language as an application. Criteria for comparison are ease of use, performance, and ease of deployment. Using these criteria, we give a recommendation for including parser generators into the standard library..

Keywords: Parsers, XPath, YAPPS, Spark, BisonGen

Introduction

While little languages have been discussed in the Python community for quite some time [Ayc98], little progress has been made with including a parser generator in the standard library. This may be partially due to the fact that so far contributions have been made only by authors of the parser toolkits, but not by the users.

This paper is a report on using Python parser generators in an application project, namely in an XPath parser for PyXML. XPath [W3C99] is a query language for XML. An XPath expression can describe a subset of the nodes in an XML document. An XPath implementation then allows evaluating a given expression for a given document.

Even though the major part of developing an XPath implementation is in the logic for evaluating XPath expressions, parsing them turns out to be a challenge on its own. In the 4Suite [Fou01] package, the XPath parser is built using the BisonGen [Fou00] generator. Incorporating this implementation into PyXML proved to be quite challenging. A detailed description of BisonGen is given later; however, the following obstacles should serve as a motivation for developing yet another XPath parser:

It should be noted that some of these problems have been resolved with the 0.11.1 release of 4Suite.

When analysing parser generators, the following absolute requirements where defined:

  1. The generated parsers should allow semantic actions in Python. This requirement is obvious, since it would not be a Python parser generator otherwise. It was no requirement that the generator itself is a Python program.
  2. The generated parsers need to support parsing of Unicode strings. Although this requirement is specific to XPath, Unicode support is desirable whenever the input may come in different encodings.

In addition, the following non-critical criteria are used for ranking the parsers:

  1. Since the XSL Transformations [W3C99a] are a primary application for XPath expressions, and since many short expressions occur in a typical XSLT style sheet, it is desirable that the parser is efficient. It does not need to be super fast, though, since parsing of the style sheet is only a small portion of the complete transformation process.
  2. Ideally, no external C modules are required for the parser, to simplify installation of a generated parser. If C modules are required, then parsers whose C modules are independent from the input language rank better than parsers which generate C code dependent on the input language. If a parser generator uses a generic C module, than this module might get included in the Python core one day, and thus be available for all generated parsers.
  3. A parser generator for Python should be easy to use. This means that the it should be easy to enter the grammar, but it also means that it should be easy to debug the resulting parser. Furthermore, documentation should allow newcomers to get familiar with the generator quickly.

In the remainder of this document, I'll assume that the reader is familiar with the principles of compiler construction. As introductory material, the Dragon Book [ASU86] always serves as a reference. Familiarity with XPath is not assumed, as this language only serves as an example.

Lexical Analysis

Most scanner generators use regular expressions to describe tokenization. Even though all the parser generators described in this paper do support tokenization by means of regular expressions, it quickly turned out that this tokenization process was not useful for XPath. In the XPath specification, special tokenization rules are given as plain English text:

If there is a preceding token and the preceding token is not one of @, ::, (, [, , or an Operator, then a * must be recognized as a MultiplyOperator and an NCName must be recognized as an OperatorName.

...

If the character following an NCName (possibly after intervening ExprWhitespace) is (, then the token must be recognized as a NodeType or a FunctionName.

These disambiguation rules specify that in */para, the asterisk has its wildcard meaning, whereas in 3 * 5, it indicates multiplication. Likewise, in para[last()], last is a function call, whereas in last[1] it is an element name. The disambiguation rules are necessary for XPath to avoid the definition of keywords, since all identifiers (NCNames) should be allowed as XML element and attribute names.

Since the XPath disambiguation operates on lexical context, it does not easily fit into scanner generators. The lex scanner states may allow an implementation of the disambiguation rules, however, we found that they are easy enough to implement using program logic. Thus, we implemented a custom scanner for XPath, using the re module for regular expressions.

A scanner based on regular expressions is usually implemented as an alternative of all token definitions. For XPath, a fragment of this expressions looks like this:



  (?P<Number>\\d+(\\.\\d*)?|\\.\\d+)|


  (?P<VariableReference>\\$""" + QName + """)|


  (?P<NCName>"""+NCName+""")|


  (?P<QName>"""+QName+""")|


  (?P<LPAREN>\\()|


  (?P<RPAREN>\\))|


  (?P<STAR>\\*)|


  ...


  (?P<ExprWhiteSpace>[ \t\n\r]+)


Here, each alternative in the regular expression defines a named group. Scanning proceeds in the following steps:

  1. Given the complete input, match the regular expression with the beginning of the input.
  2. Find out which alternative matched.
  3. If the token is white space, ignore it. Otherwise, append it to the token list.
  4. Continue with step 1, this time starting at the end of the previous token, until all input is consumed.
  5. Given the complete token list, perform the disambiguation, modifying the recognized tokens as specified.
  6. Pass the token list to the parser.

It should be noted that this scanning process requires the complete input. For XPath, this causes no problems, since the input is typically small, and completely provided in the Python source code of the application, or in the XSLT style sheet.

When experimenting with various parsers, adjusting the grammar definition to each parser and fitting in the semantic actions was somewhat time-consuming. Thus, a single hand-written scanner would simplify adjustment to various parser generators. On the other hand, it introduced another requirement on parsers being evaluated: They need to interoperate with a hand-written scanner.

Semantic Actions

Since the ultimate objective of this project was to incorporate 4XPath into PyXML, the existing XPath evaluation library of 4XPath had to be re-used, if possible as-is. That turned out to be quite feasible: The 4XPath BisonGen parser would create class instances in its semantic actions, which represent an abstract syntax of the expression.

For the parser generators studied, it was not difficult to generate the same abstract syntax, in particular since the concrete syntax of all parsers was similar (it could not be identical for all parsers for reasons shown below).

These days, parser generators often provide their own framework for abstract syntaxes and parse trees. We find such frameworks to be rather the source of problems than a solution. It may be easier to generate the abstract syntax using the parser framework (in particular if the parser does not support custom actions), but it is more difficult to process the resulting syntax tree. In the specific application, we would need to apply a transformation from the parser-generated tree to the instances of the classes already defined in 4XPath.

kwParsing

As first parser generator, we looked at kwParsing, by Aaron Watters [Cho98]. It became quickly apparent that this toolkit currently does not support Unicode strings, due to its reliance on the regex module. We have not studied it further, although either a port to re, or an integration of a custom tokenizer may have been feasible.

YAPPS

Following the links on the String SIG page [Kuc00], we found "Yet Another Python Parser System", YAPPS, by Amit Patel. Even though the parser only supports LL(1) languages due to its recursive-descent strategy, it turned out that the XPath grammar could easily be rewritten to be acceptable using standard techniques. For example, the rule 18 of the XPath specification reads



UnionExpr ::=   PathExpr


                | UnionExpr '|' PathExpr


in its original form. In YAPPS, the left recursion can be replaced as



    rule UnionExpr:


        PathExpr UnionExprs


    rule UnionExprs:


          # empty


        | BAR PathExpr UnionExprs


YAPPS operates as a parser generator. The grammar specification is given to a command line tool (which is also written using YAPPS), and a Python file is produced. The generated parser makes use of a library file; the library and the generated parser module together are all that is needed for anybody using the parser

YAPPS is implemented using recursive descent. This makes it possible to understand the generated code easily. For example, the UnionExpr rule translates into a function definition



    def UnionExpr(self):


        PathExpr = self.PathExpr()


        UnionExprs = self.UnionExprs(PathExpr)


        return UnionExprs


It may not be obvious how the return statement comes about; this was generated from semantic actions. The full definition of the rule in the XPath parser reads



    rule UnionExpr:


        PathExpr UnionExprs<<PathExpr>>   


           {{return UnionExprs}}


    rule UnionExprs<<v>>:


           {{return v}}


        | BAR PathExpr UnionExprs<<self.nop(self.UNION,v,PathExpr)>>    


           {{return UnionExprs}}


Each rule can accept arbitrary parameters, which are declared in the rule definition, and passed as parameters using the '<<...>>' braces. Likewise, semantic actions are inserted through '{{...}}' braces.

YAPPS will generate a class that contains a method for each rule. This allows to create a new parser state for each parse process. It also gives easy access to additional functions: In the XPath parser, I have defined a subclass of the generated parser class:



class MyXPath(XPath):


    ...


    UNION = pyxpath.UNION_OPERATOR


    ...


    def nop(self, operator, left, right):


        return self.factory.createNumericOperator(operator, left, right)


This class can go into the grammar definition file. Like YACC, YAPPS supports custom code before and after the grammar definition.

YAPPS parsers integrate with the scanner through a Scanner object, which is passed to the parser as a constructor argument. Even though YAPPS supports definition of tokens in the grammar, we have not used this capability in XPath, since we have provided my own scanner class. The YAPPS parser will only require a token() method from the scanner object, which must return a four-tuple (start, end, token type, token value).

In case of a syntax error, the generated parser will raise a yappsrt.SyntaxError exception. This exception will take the current position and an error message.

Entering the XPath grammar into the parser was straight-forward, as shown above. Debugging the parser turned out to be easy as well, since it was possible to trace the generated Python code in a debugger, or enrich it with debugging statements. The YAPPS distribution comes with manual and a number of example grammars.

SPARK

The SPARK parser was first presented in John Aycock's paper on little languages [Ayc98]. It is a pure-Python framework, which does not require a parser generator. Instead, the grammar is defined by means of __doc__ strings, which are collected using introspection in the SPARK library module.

In SPARK, the UnionExpr production is formulated as follows:



    def p_UnionExpr(self,args):


        """


        UnionExpr ::= PathExpr


        UnionExpr ::= UnionExpr | PathExpr


        """


        return args


Since the method name starts with a p_ prefix, it is treated as a production method. SPARK will analyse its doc string, and associate the production with the method. Every time one of the given production rules matches, it invokes this method.

The semantic action in this code is trivial. It would be easy to define a meaningful action: args is the list of all right-hand side non-terminals. For this rule, it is thus either one or two elements long.

Integration with a custom tokenizer is also straight-forward. All tokens are specified as t_ methods in the GenericScanner subclass. The tokenize() method of this subclass then needs to take care of the XPath disambiguation.

On error, SPARK will invoke an error() method on the parser, which, by default will exit Python.

Since SPARK uses the Earley algorithm, it can parse almost any grammar as-is; this makes entering the grammar straight forward. Unfortunately, this parsing method (and the fact that the parser is generated dynamically) makes it difficult to debug the parser. Since the parsers silently support ambiguous grammars, it might be difficult to find the cause of a syntax error. The documentation primarily consists of the IPC7 paper, and a few examples.

BisonGen

Fourthought Inc has developed the BisonGen framework to implement parsers for their 4Suite package [Fou01]. The parser is defined using an XML syntax. Until recently, the build process of a BisonGen parser was as follows:

  1. BisonGen parses the XML file using PyXML.
  2. It generates a number of files, including
  3. flex, bison, and SWIG are invoked to generate C code,
  4. The C code is compiled to form an extension module.

Recently, this build procedure was completely restructured. Today, BisonGen implements the LALR(1) algorithm itself, not relying on bison anymore. Therefore, the build procedure is simplified to:

  1. BisonGen parses the XML specification of the grammar.
  2. It generates a C file, containing a C-implemented parser, and a Python module that implements the same parsing algorithm in pure Python.
  3. If desired, the C code is compiled for improved performance.

To give an impression of how BisonGen code looks, the UnionExpr production is again presented, as it appears in the 0.11 release of 4Suite.



<RULE_SET NAME="18">


  <NON_TERMINAL>unionExpr</NON_TERMINAL>


  <RULE>


    <SYMBOL TYPE="s">pathExpr</SYMBOL>


    <CODE>


    </CODE>


  </RULE>


  <RULE>


    <SYMBOL TYPE="s">unionExpr</SYMBOL>


    <SYMBOL TYPE="s">'|'</SYMBOL>


    <SYMBOL TYPE="s">pathExpr</SYMBOL>


    <CODE>


      <VARIABLE TYPE="PyObject*" NAME="right"></VARIABLE>


      <VARIABLE TYPE="PyObject*" NAME="left"></VARIABLE>


      <VARIABLE TYPE="PyObject*" NAME="expr"></VARIABLE>


      <CODE_SNIPPET>


        right = stack_pop();


        left = stack_pop();


        expr = PyObject_CallMethod(ParsedExpr, "ParsedUnionExpr", "OO", left, right);


        decref(right);


        decref(left);


        stack_push(expr);


      </CODE_SNIPPET>


    </CODE>


  </RULE>


</RULE_SET>


Since the document type for BisonGen varies between releases, we will not explain all tags used here in detail. In this fragment, two rules are defined, which are both alternatives of the unionExpr non-terminal. The right-hand-side of each rule consists of a sequence of symbols; it is just pathExpr for the first rule, and unionExpr '|' pathExpr for the second.

In the second rule, a specific semantic action is defined, which is a call to ParsedExpr.ParsedUnionExpr, which in turn is a class of the abstract syntax. BisonGen will declare three variables in the Bison file, and insert the specified code snippet into the semantic action.

Even though the build process of BisonGen applications has been dramatically simplified recently, the grammar specifications still look quite verbose.

Integration with the lexical analysis follows the usual YACC convention: an yylex function is invoked to return the next token. Token numbers identify tokens. In addition, the yylval variable carries the semantic value.

Older versions of BisonGen generate flex files from token definitions given XML; the recent versions generate re-style regular expressions from similar XML specifications.

Error handling also follows the YACC tradition: an yyerror function is invoked. Since unwinding out of a bison parser run is not easy, this function normally only sets a global variable, which is then checked when the parser returns.

The BisonGen distribution comes with a short overview of the grammar input language, and a few examples as part of the test suite.

Performance Comparison

In this test, the three parsers where compared for parsing speed. The test ran 32 example expressions from the XPath specification, parsing each one 100 times. Tests where executed using Python 2.1 on a 600 MHz UltraSparc machine. The list of tests being executed, and the grammars that have been used for the parser generators are part of the PyXPath distribution, in [vLo00]. For convenience, the test cases are reproduced below.



exprs = [


    "child::para",


    "child::*",


    "child::text()",


    "child::node()",


    "attribute::name",


    "attribute::*",


    "descendant::para",


    "ancestor::div",


    "ancestor-or-self::div",


    "descendant-or-self::para",


    "self::para",


    "child::chapter/descendant::para",


    "child::*/child::para",


    "/",


    "/descendant::para",


    "/descendant::olist/child::item",


    "child::para[position()=1]",


    "child::para[position()=last()]",


    "child::para[position()=last()-1]",


    "child::para[position()>1]",


    "following-sibling::chapter[position()=1]",


    "preceding-sibling::chapter[position()=1]",


    "/descendant::figure[position()=42]",


    "/child::doc/child::chapter[position()=5]/child::section[position()=2]",


    'child::para[attribute::type="warning"]',


    "child::para[attribute::type='warning'][position()=5]",


    'child::para[position()=5][attribute::type="warning"]',


    "child::chapter[child::title='Introduction']",


    "child::chapter[child::title]",


    "child::*[self::chapter or self::appendix]",


    "child::*[self::chapter or self::appendix][position()=last()]",


    "//element[descendant::y[.='z']][1]"


   ]


The Spark parser differed from the other two parsers in that it did not perform semantic actions. To allow an estimation of the overhead spend in the semantic actions, the YAPPS parser was run a second time with all semantic actions removed. For the tests, Spark version 0.6.1 was used.

The BisonGen test case was taken from 4Suite 0.11, which only supports operation in C mode (i.e. without a pure-Python mode).

In addition, the YAPPS parser was run a third time, where re.py was modified to use pre as the regular expression engine, to get an estimate of the relative performance of sre.

Parser Performance Measurement
Test CaseTime (s)
BisonGen (C)2.1
YAPPS (sre, no semantic actions)9.5
YAPPS (sre)11
YAPPS (pre)17
SPARK84
From these measurements, we draw the following conclusions:

Other Parsers

There are a number of other parser generators and parser embedding strategies available for Python. Some of those are not easily applicable to XPath parsing; for others, we could not find a simple porting strategy to avoid writing the grammar from scratch every time.

Embedding Bison

A number of authors have embedded Bison into Python, usually using hand-written parsers and interfacing code. The CORBA IDL compilers of both Fnorb [DSTC99] and OmniORBpy [ATT01] use this strategy. It is very similar to BisonGen in its properties.

PyLR

PyLR [Cot97] provides a single C module to speed-up LR parsing. According to the author, a number of enhancements have to be made before it can be considered complete. Unfortunately, it has been that way since 1997. Even though it looks like PyLR would support a custom tokenizer and that PyLR was capable of dealing with the original XPath grammar, it is not clear how exactly to use that tool, given the scarce documentation.

Furthermore, the author reports that the LR engine is quite slow; it is not clear whether PyLR helps in constructing an LALR(1) grammar.

Plex

Plex [Ewi01] only supports scanner generation. Due to the special lexical structure of XPath, and the lack of parser generation in Plex, no attempt was made to use this tool.

Trap

Trap [Ern99] supports the specification of intermediate representations (IR) as part of the grammar definitions, and subsequently generates parsers that parse into the specified IR, including support for unparsing. It currently uses kwParsing as its underlying engine, so no new insights were expected from using this package.

Availability

The YAPPS and SPARK XPath parsers used for this paper are available from http://www.informatik.hu-berlin.de/~loewis/xml. The BisonGen parser is distributed as part of 4Suite [Fou01].

Acknowledgements

I'd like to thank Andrew Kuchling for collecting Python parser information over a long time as part of the String SIG, and Jeremy Kloth and Uche Ogbuji for their comments on PyXPath.

Conclusion

Of all the parser generators tested, the YAPPS parser is the one that will be used in PyXML, perhaps side-by-side with the BisonGen parser provided by Fourthought.

Even though the LL(1) grammar specification has deficiencies, the entire parser generator is so small and easy to understand that it would make a valuable addition to the standard Python library.

Literature

[ASU86] A. V. Aho, R. Sethi, and J. D. Ullman. Compilers: Principles, Techniques, and Tools. Addison-Wesley, 1986.

[ATT01] AT&T. omniORBpy. http://www.uk.research.att.com/omniORB/omniORBpy/

[Ayc98] John Aycock, Compiling Little Languages, IPC 7, Houston, 1998.

[Cho98] Chordate Systems. Gadfly 1.0: An SQL Database in Python. http://www.chordate.com/kwParsing/index.html

[Cot97] S. Cotton. PyLR. http://starship.python.net/crew/scott/PyLR.html

[DSTC99] DSTC. Fnorb. http://www.fnorb.com/

[Ern99] T. Ernst. TRAP. http://www.first.gmd.de/smile/trap/

[Ewi01] Greg Ewing. Plex. http://www.cosc.canterbury.ac.nz/~greg/python/Plex/

[Fou00] Fourthough Inc. BisonGen. ftp://ftp.fourthought.com/pub/BisonGen/

[Fou01] Fourthought Inc. 4Suite. http://4suite.org/download.html

[Kuc00] Andrew Kuchling. Python String-SIG Activities. http://www.amk.ca/python/string.html

[vLo00] Martin v. Löwis. PyXPath. http://www.informatik.hu-berlin.de/~loewis/xml/

[W3C99] W3 Consortium. XML Path Language (XPath) Version 1.0. http://www.w3.org/TR/xpath, W3C, 1999.

[W3C99a] W3 Consortium. XSL Transformations (XSLT) Version 1.0. http://www.w3.org/TR/xslt, W3C, 1999.