Satine: a XML Data Binding technology for Python

Francesco Garelli Department of Information Engineering
University of Padua - Italy
garelli@dei.unipd.it

 

Abstract

The XML language is becoming quickly an important protocol to exchange information and services in wide-area networks. Unfortunately popular solutions to manage XML [4] documents seem to be little suitable. Technologies, such as DOM [6] and SAX [7], have proven to be effective for short and simple documents but they are really complex when documents size and variety is pretty large. XML data binding is a new approach that looks very promising. Through the data binding, any XML document is translated to an internal data of the programming language the application is developed with. As a result, the developer doesn't treat XML, but a corresponding representation in the environment he has chosen for his application.

This paper describes a XML data binding technology for Python, namely Satine XML data binding. The solution we propose takes benefit from Python weak types and it is definitely simpler than similar approaches in the Java [9] or .NET environments [10].

Introduction

Emerging distributed middlewares aim to provide large interoperability among applications spread over wide-area networks, in particular over the Internet. The standard candidate protocols the industry is relying on, are based on XML dialects, such as SOAP[12] or WSDL[14].
At the same time, most recent applications choose XML as data format to define their documents and configuration parameters. This growing interest needs easy and effective ways to treat XML.
As the XML language has been available since some years, many techniques to manage the documents have been developed. The most used ones are probably the SAX[1] and DOM[2] interfaces that are uniform among different development environments and programming languages. This consistency is desirable because it cuts down the effort when a developer deals with different platforms. Unfortunately well-known solutions impose a complex interface that, although tolerable when documents are few and small, is time expensive to be used, when applications save most of their data in XML.
Among the new approaches to face this problem, a very interesting one is XML data binding. Data binding is a technique that translates a document to a simple data structure in the application programming language. The translation is quite automatic and needs a minimal support.
As a result, the programmer has not to deal with the call-back parsing from SAX or with the complex structure of a DOM tree, but only with objects and methods of his favorite language. An appropriate library attends to the conversion: the library depends on the XML dialect and it is often produced from a XML schema without any effort.
XML data binding techniques for Java [9] and .NET [10] are under development. We propose an analogous technique for Python. In our experience, the Python environment has proved to be ideal: the data binding is easy and it doesn't necessarily depend on a XML schema.
In order to test our approach, we developed a very simple web server that accepts both SOAP and usual HTTP requests.

Data binding

XML documents are composed by XML elements. Each XML element is an information that is difficult to describe in usual programming languages. In fact a XML element has named attributes like objects, but it has also items like lists. Moreover a XML element has a type identifier, the tag, and a namespace. No Python native type shows all those properties. This assumption suggests to introduce a new type with the required features. In our project we defined a new type we named xlist. An xlist inherits from the native type list the capability of containing items and its methods such as append, remove, index.
Unlike its parent, a xlist support named attributes. An example definition might be as follows:

1:
2:
3:
4:
5:

import satine               #import the extension
l = xlist()                 #define the xlist

l.append("hello world")     #insert an item
l.language = "english"      #add an attribute

 
Figure 1

Interestingly, at line 6 object l behaves like a Python list and accepts a new item to be added. Indeed at line 7, l accepts a new attribute like a normal Python object does. An other relevant difference between xlists and lists is their representation as string. The representation of the xlist shows both the items that come from its base type list, and its attributes. Figure 2 shows the representation for the object l defined in previous example.

1:
2:
3:

<satine:xlist language="english">
Hello World
</satine:xlist>

 
Figure 2

Evidently, this representation suggests a natural binding between a XML element and a xlist: element attributes correspond to xlist attributes and nested items correspond to xlist items. Unfortunately this binding is not complete because it is valid only when the prefix and the tag are respectively satine and xlist. Of course usual XML documents allow elements from any namespace and with any tag.

A trivial solution is to define some special attributes, tag, prefix and uri, that complete the binding for each xlist instance. Indeed we decided for a different approach. Elements that have the same tag in the same XML namespace, describe an identical concept; for example, in XHTML [13], any element with tag table introduces always a table. Of course the particular element often depends on other properties besides the tag: in XHTML, a table content and aspect depend on the attributes and items of the corresponding element.
This consideration shows that a tag groups different elements in a single class. In object oriented programming languages, the same relation exists between objects and their classes. Hence we figure that the most appropriate binding for a tag is the name of a Python class. An example of how to define an element with tag Envelope is:

1:
2:
3:

class Envelope(xlist): pass

e = Envelope()

 
Figure 3

Its representation is:

1:

<satine:Envelope/>

 
Figure 4

A similar argument is valid for the uri and the prefix properties, too. In this case, elements are grouped according to a namespace. In the data binding, a Python module, grouping classes with the same intention, shows a similar effect. This fact shows that the uri and prefix might correspond to a module name. Unfortunately, a uri is rarely a valid module name. Satine faces this problem with a special function xspace that binds a uri to the module where the function is. For instance in a module that contains the expression

xspace(soap="http://schemas.xmlsoap.org/soap/envelope/")

any objects from class Envelope would be represented as <soap:Envelope/>.

Validation

XML schemas are documents that define the structure and the elements of a XML namespace. A technology for XML data binding can really take advantages of XML schemas. In fact, while translating XML to objects, the converter can check if the document is compliant with its schemas. In case of violation, the converter stops and notifies the error. The process of checking if a document is compliant to its schemas, is named XML validation.

Satine provides a very flexible validation that covers many of the features defined in the XMLSchema specification. When defining a new class that extends the type xlist, the developer can specify what are the valid attributes, their type and their default value. Moreover he can decide what types are valid as items. These constraints are set using two optional class fields.
The field __attrs__ is a XML string defining the accepted attributes. This field has a fixed syntax: for each attribute to be defined, a XML element suggests the XML datatype in accordance with XMLSchema datatypes [8], and the string that follows, is the name of the attribute. Figure 5 shows a simplified class definition for the SOAP [12] element Envelope; line 2 defines an attribute encodingStyle: the element <xsd:string> imposes the type string from the XMLSchema namespace.

In a similar fashion, the field __items__ defines what items are valid. Again the field is a XML string with a special syntax, similar to the regular expression syntax, that we name XML Regular Expressions (XRE). Unlike in regular expressions, the atomic item is not a character but a XML element. Special characters, such as '?', '*'..., have the same meaning than in regular expressions, that is they define the valid repetitions of the previous element. In figure 5, line 3 states that any Envelope object may have an optional Header element and a single mandatory Body element.

1:
2:
3:

class Envelope(xlist):
  __attrs__ = "<xsd:string>encodingStyle"
  __items__ = "<soap:Header>?<soap.Body>"

 
Figure 5

We decided to use those special syntax instead of XMLSchema syntax both for performance and for producing a compact code. Anyway we provide a tool that generates Python classes from XMLSchema documents.

Implementation details

At the moment we have developed a basic implementation of Satine that supports all major features. This implementation is a Python library based on both Python and C code. The C extension module provides the fundamental classes and functions. In particular, the module satine defines the type xlist that we described above. Also it defines the two functions xml2py and py2xml that respectively convert XML to xlists and xlists to XML.
Satine allows queries on the information stored in a xlist through the method query. This method has the signature:

<xlist>.query(<pattern>[,<style>])

The second parameter sets the language for the query. The default style is XRE. When this style is used, the parameter pattern is defined as follows:

<pattern> := """<XML regular expression>|<id>[,<id>]*"""

The statement before the pipe is the comparison pattern and the statement after the pipe is the extraction pattern. The former is compared with the items of the xlist. For each match, the function stores all the attributes whose identifiers are in the extraction pattern. If the extraction pattern is empty, the function stores the matching items.
An example query might be:

1:

soap_message.query("<soap:Envelope><soap:Header>?mustUnderstand")

 
Figure 6

This query extracts the attribute mustUnderstand from an XML element Header inside a SOAP message. Other supported styles are 'tag' and 'pyfun'; the former allows to retrieve all items with a particular tag, the latter uses a programmer defined function for the comparison.
Similar features are provided in the method visit. This method visits a xlist and executes a callback function each time a match is found. Through the method visit, Satine offers a callback parser similar to SAX. But, while SAX deals with the document structure only (start tags, end tags and text), the function visit operates on the document content.

The other modules offer further interesting features. The module satine.dt contains appropriate bindings to validate any XML datatype [8]. The module satine.stream supports data binding from streams, such as files. Interestingly, Satine allows to convert fragments of a file at random positions. A developer is not required to convert an entire document, that could be really large, but only the parts that are in use. Other modules and features are described in [1] and [2].

The current implementation has proved to be very efficient. We compared the data binding performances of Satine, xml_objectify [11], Java DOM [5]. Satine is much faster then xml_objectify . Often Satine is also comparable with the DOM implementation provided in Xerces 2.3 for Java. The following data turns out when translating the novel "The Jungle Book" by Rudyard Kipling in XML format on an Pentium 4 1600MHz:

Satine WS

After a first prototype was available, we tried to develop a typical application. Our aim was to test the library interface in a possible environment and to understand how Satine could enhance productivity.
The result of this effort is a toy framework, named Satine WS, that makes development of web services pretty easy. Satine WS has inside a HTTP server that processes SOAP requests with document-style. The SOAP request is converted into a Python object and it is processed using Satine queries. Also the server accepts requests from Internet browser and converts them into XML code using predefined templates. As a result, developers could access to a web service using a friendly web interface both for testing and administration reasons.
Figures 7 and 8 show a very simple application that manages some online reviews of restaurants from different cuisines. The web service has been implemented in about 60 lines of Python code. The application supports both HTTP/html access (figure 7) and SOAP access (figure 8). In case of SOAP requests, the system is able to validate the request before it is processed by the web service.
A comprehensive description of Satine WS is besides the intentions of this paper. Further information is available at [3].



 
Figure 7

 



 
Figure 8

Conclusions

XML data binding makes management of XML easy and efficient. Satine is a data binding technology for Python, that offers an easy interface with interesting performances. At the moment we are improving the XMLSchema support and we are working on storage in relational databases. Finally we are working on a possible integration between Software Architectures and Python.

Availability

The library and its documentation are available at http://satine.sourceforge.net. The project is under the GNU LGPL license.

References

[1] Francesco Garelli, Carlo Ferrari: A Dynamic Model for Mapping XML Elements in a Object-Oriented Fashion. CoopIS/DOA/ODBASE 2002: 1255-1272.
[2] Francesco Garelli. Satine Cookbook. Technical Report at the Department of Information Engineering, University of Padua. December 2001
[3] Francesco Garelli. Satine WS: a Web Services application server for Python. December 2002. http://satine.sourceforge.net/ws.html
[4] Bray, T., Paoli, J., Sperberg-McQueen, C. M., Maler. Extensible Markup Language (XML) 1.0 (Second Edition). http://www.w3.org/TR/2000/REC-xml-20001006
[5] JDom.org. JDOM. http://www.jdom.org/
[6] Le Hors, A., ed. Document Object Model (DOM) Level 3 Core Specification. http://www.w3.org/TR/2001/WDDOM-Level-3-Core-20010126/.
[7] Megginson Technologies. SAX 2.0: The Simple API for XML. http://www.megginson.com/SAX/.
[8] Biron, P. and Malhotra. XML Schema Part 2: Datatypes.http://www.w3.org/TR/xmlschema-2/.
[9] Sun Microsystem. The Java™ Architecture for XML Binding. User's Guide. May 2001. http://java.sun.com/xml/jaxb/jaxb-docs.pdf
[10] Microsoft. XSD Compiler, .NET Development. http://msdn.microsoft.com
[11] David Mertz, Data Masseur. On the Pythonic Treatment of XML Documents As Objects. http://gnosis.cx/publish/programming/xml_matters_1.txt
[12] Don Box, David Ehnebuske , ed. Simple Object Access Protocol (SOAP) 1.1. http://www.w3.org/TR/SOAP/
[13] Steven Pemberton, Daniel Austin, ed. XHTML™ 1.0 The Extensible HyperText Markup Language. http://www.w3.org/TR/xhtml1
[14] Erik Christensen, Francisco Curbera, ed. Web Services Description Language (WSDL) 1.1. http://www.w3.org/TR/wsdl