Francesco Garelli Department of Information Engineering
University of Padua - Italy
garelli@dei.unipd.it
The XML language is becoming quickly an important protocol to exchange information and services in wide-area networks. Unfortunately popular solutions to manage XML [4] documents seem to be little suitable. Technologies, such as DOM [6] and SAX [7], have proven to be effective for short and simple documents but they are really complex when documents size and variety is pretty large. XML data binding is a new approach that looks very promising. Through the data binding, any XML document is translated to an internal data of the programming language the application is developed with. As a result, the developer doesn't treat XML, but a corresponding representation in the environment he has chosen for his application.
This paper describes a XML data binding technology for Python, namely Satine XML data binding. The solution we propose takes benefit from Python weak types and it is definitely simpler than similar approaches in the Java [9] or .NET environments [10].
Emerging distributed middlewares aim to provide large interoperability
among applications spread over wide-area networks, in particular over
the Internet. The standard candidate protocols the industry is relying
on, are based on XML dialects, such as SOAP[12] or WSDL[14].
At the same time, most recent applications choose XML as
data format to define their documents and configuration parameters. This
growing interest needs easy and effective ways to treat XML.
As the XML language
has been available since some years, many techniques to manage the
documents have been developed.
The most used
ones are probably the SAX[1] and DOM[2] interfaces that are uniform
among different development environments and programming languages.
This consistency
is desirable because it cuts down the effort when a developer
deals with different platforms. Unfortunately well-known solutions
impose a complex
interface that, although tolerable when documents are few and
small, is time expensive to be used, when applications save most
of their data in XML.
Among the new approaches to face this problem, a very interesting
one is XML data binding. Data binding is a technique that
translates a document
to a simple data structure in the application programming language.
The translation is quite automatic and needs a minimal support.
As a result, the
programmer has not to deal with the call-back parsing from SAX or with
the complex structure of a DOM tree, but only with objects and methods
of his favorite language. An
appropriate
library attends to the conversion: the
library
depends
on the
XML dialect
and it is often produced from a XML schema without any effort.
XML data binding techniques for Java [9] and .NET [10] are
under development. We propose an analogous technique for
Python. In
our experience, the
Python
environment has proved to be ideal: the data binding is easy
and it doesn't necessarily depend on a XML schema.
In order to test our approach, we developed a very simple web server
that accepts both SOAP and usual HTTP requests.
XML documents are composed by XML elements. Each XML element is an information
that is difficult to describe in usual programming languages. In fact
a XML element has named attributes like objects, but it has also items
like lists. Moreover a XML element has a type identifier, the tag, and
a namespace. No Python native type shows all those properties. This assumption
suggests to introduce a new type with the required features. In our project
we defined a new type we named xlist. An xlist inherits
from the native type list the capability of containing items and
its methods such as append, remove, index.
Unlike its parent, a xlist support named attributes. An example
definition might be as follows:
1:
2:
3:
4:
5:
import satine #import the extension
l = xlist() #define the xlist
l.append("hello world") #insert an item
l.language = "english" #add an attribute Figure 1
Interestingly, at line 6 object l behaves like a Python list and accepts a new item to be added. Indeed at line 7, l accepts a new attribute like a normal Python object does. An other relevant difference between xlists and lists is their representation as string. The representation of the xlist shows both the items that come from its base type list, and its attributes. Figure 2 shows the representation for the object l defined in previous example.
1:
2:
3:
<satine:xlist language="english">
Hello World
</satine:xlist> Figure 2
Evidently, this representation suggests a natural binding between a XML element and a xlist: element attributes correspond to xlist attributes and nested items correspond to xlist items. Unfortunately this binding is not complete because it is valid only when the prefix and the tag are respectively satine and xlist. Of course usual XML documents allow elements from any namespace and with any tag.
A trivial solution is to define some special attributes,
tag, prefix and uri, that complete the binding
for each xlist instance. Indeed we decided for a different approach.
Elements that have the same tag in the same XML namespace, describe
an identical concept; for example, in XHTML [13], any element with tag table
introduces always a table. Of course the particular element often depends
on other properties besides the tag: in XHTML, a table content and aspect
depend on the attributes and items of the corresponding element.
This consideration shows that a tag groups different elements in a single
class. In object oriented programming languages, the same relation exists
between objects and their classes. Hence we figure that the most appropriate
binding for a tag is the name of a Python class. An example of how to
define an element with tag Envelope is:
1:
2:
3:
class Envelope(xlist): pass
e = Envelope() Figure 3
Its representation is:
1:
<satine:Envelope/>
Figure 4
A similar argument is valid for the uri and the prefix properties, too. In this case, elements are grouped according to a namespace. In the data binding, a Python module, grouping classes with the same intention, shows a similar effect. This fact shows that the uri and prefix might correspond to a module name. Unfortunately, a uri is rarely a valid module name. Satine faces this problem with a special function xspace that binds a uri to the module where the function is. For instance in a module that contains the expression
xspace(soap="http://schemas.xmlsoap.org/soap/envelope/")
any objects from class Envelope would be represented as <soap:Envelope/>.
XML schemas are documents that define the structure and the elements of a XML namespace. A technology for XML data binding can really take advantages of XML schemas. In fact, while translating XML to objects, the converter can check if the document is compliant with its schemas. In case of violation, the converter stops and notifies the error. The process of checking if a document is compliant to its schemas, is named XML validation.
Satine provides a very flexible validation that covers many of the features
defined in the XMLSchema specification. When defining a new class that
extends
the type xlist, the developer can specify what are the valid attributes,
their type and their default value. Moreover he can decide what types
are valid as items. These constraints are set using two optional class
fields.
The field __attrs__ is a XML string defining the accepted
attributes. This field has a fixed syntax: for each attribute to be defined,
a XML element suggests the XML datatype in accordance with XMLSchema
datatypes [8], and the string that follows, is the name
of the attribute. Figure 5 shows a simplified class definition for
the SOAP [12] element Envelope; line 2
defines an attribute encodingStyle:
the element <xsd:string> imposes the type
string from the XMLSchema namespace.
In a similar fashion, the field __items__ defines
what items are valid. Again the field is a XML string with a special
syntax, similar to the regular expression syntax, that we name XML Regular
Expressions (XRE). Unlike in regular expressions, the atomic item is
not a character but a XML
element.
Special characters,
such as '?', '*'..., have the same meaning than in regular expressions,
that is they define the valid repetitions of the previous element. In
figure 5, line 3 states that any Envelope object may have an optional
Header element and a single mandatory Body element.
1: |
class Envelope(xlist): |
|
Figure 5
|
We decided to use those special syntax instead of XMLSchema syntax both for performance and for producing a compact code. Anyway we provide a tool that generates Python classes from XMLSchema documents.
At the moment we have developed a basic implementation of Satine that
supports all major features. This implementation is a Python library
based
on both Python and C code. The C extension module provides the fundamental
classes and functions. In particular, the module satine defines
the type xlist that we described above. Also it defines the
two functions xml2py and py2xml that respectively convert
XML to xlists and xlists to XML.
Satine allows queries on the information stored in a xlist through
the method query. This method has the signature:
<xlist>.query(<pattern>[,<style>])
The second parameter sets the language for the query. The default style is XRE. When this style is used, the parameter pattern is defined as follows:
<pattern> := """<XML regular expression>|<id>[,<id>]*"""
The statement before the pipe is the comparison pattern and
the statement after the pipe is the extraction pattern. The
former is compared with the items of the xlist. For each match,
the function stores all the attributes whose identifiers are in the
extraction pattern.
If the extraction pattern is empty, the function stores the matching
items.
An example query might be:
1:
soap_message.query("<soap:Envelope><soap:Header>?mustUnderstand")
Figure 6
This query extracts the attribute mustUnderstand from an XML
element
Header inside a SOAP message. Other supported styles are 'tag' and
'pyfun'; the former allows to retrieve all items with a particular
tag, the latter uses a programmer defined function for the comparison.
Similar features are provided in the method visit. This method
visits a xlist and executes a callback function each time a match
is found. Through the method visit, Satine offers a callback
parser similar to SAX. But, while SAX deals with the document structure
only (start tags, end tags and text), the function visit operates
on the document content.
The other modules offer further interesting features. The module satine.dt contains appropriate bindings to validate any XML datatype [8]. The module satine.stream supports data binding from streams, such as files. Interestingly, Satine allows to convert fragments of a file at random positions. A developer is not required to convert an entire document, that could be really large, but only the parts that are in use. Other modules and features are described in [1] and [2].
The current implementation has proved to be very efficient. We compared the data binding performances of Satine, xml_objectify [11], Java DOM [5]. Satine is much faster then xml_objectify . Often Satine is also comparable with the DOM implementation provided in Xerces 2.3 for Java. The following data turns out when translating the novel "The Jungle Book" by Rudyard Kipling in XML format on an Pentium 4 1600MHz:
After a first prototype was available, we tried to develop a typical
application. Our aim was to test the library interface in a possible
environment and to understand how Satine could enhance productivity.
The result of this effort is a toy framework, named Satine WS, that makes
development of web services pretty easy. Satine WS has inside a HTTP
server that processes SOAP requests with document-style. The SOAP request
is
converted into a Python object and it is processed using Satine queries.
Also the server accepts requests from Internet browser and converts them
into XML code using predefined templates. As a result, developers could
access to a web service using a friendly web interface both for testing
and administration reasons.
Figures 7 and 8 show a very simple application that manages some online
reviews of restaurants from different cuisines. The web service has
been implemented in about 60 lines of Python code. The application supports
both HTTP/html access (figure 7) and SOAP access (figure 8). In case
of SOAP requests, the system is able to validate the request before
it is processed by the web service.
A comprehensive description of Satine WS is besides the intentions of
this paper. Further information is available at
[3].
|
|
|
Figure 7
|
|
|
|
Figure 8
|
XML data binding makes management of XML easy and efficient. Satine is a data binding technology for Python, that offers an easy interface with interesting performances. At the moment we are improving the XMLSchema support and we are working on storage in relational databases. Finally we are working on a possible integration between Software Architectures and Python.
The library and its documentation are available at http://satine.sourceforge.net. The project is under the GNU LGPL license.
[1] Francesco Garelli, Carlo Ferrari: A Dynamic Model for Mapping
XML Elements in a Object-Oriented Fashion. CoopIS/DOA/ODBASE 2002: 1255-1272.
[2] Francesco Garelli. Satine Cookbook. Technical Report at the
Department of Information Engineering, University of Padua. December 2001
[3] Francesco Garelli. Satine WS: a Web Services application server
for Python. December 2002. http://satine.sourceforge.net/ws.html
[4] Bray, T., Paoli, J., Sperberg-McQueen, C. M., Maler. Extensible
Markup Language (XML) 1.0 (Second Edition). http://www.w3.org/TR/2000/REC-xml-20001006
[5] JDom.org. JDOM. http://www.jdom.org/
[6] Le Hors, A., ed. Document Object Model (DOM) Level 3 Core Specification.
http://www.w3.org/TR/2001/WDDOM-Level-3-Core-20010126/.
[7] Megginson Technologies. SAX 2.0: The Simple API for XML. http://www.megginson.com/SAX/.
[8] Biron, P. and Malhotra. XML Schema Part 2: Datatypes.http://www.w3.org/TR/xmlschema-2/.
[9] Sun Microsystem. The Java Architecture for XML Binding. User's
Guide. May 2001. http://java.sun.com/xml/jaxb/jaxb-docs.pdf
[10] Microsoft. XSD Compiler, .NET Development. http://msdn.microsoft.com
[11] David Mertz, Data Masseur. On the Pythonic Treatment of XML
Documents As Objects. http://gnosis.cx/publish/programming/xml_matters_1.txt
[12] Don Box, David Ehnebuske , ed. Simple Object Access Protocol
(SOAP) 1.1. http://www.w3.org/TR/SOAP/
[13] Steven Pemberton, Daniel Austin, ed. XHTML 1.0 The Extensible
HyperText Markup Language. http://www.w3.org/TR/xhtml1
[14] Erik Christensen, Francisco Curbera, ed. Web Services Description
Language (WSDL) 1.1. http://www.w3.org/TR/wsdl