Paper for PyCon 2005 30mn talk about PyLucene ============================================= Title : Pulling Java Lucene into Python: PyLucene Author: Andi Vajda Abstract -------- As OSAF needed an open source text search engine library for its Python based project, Chandler, we made the following bet: what if we pulled together Java Lucene, GNU's gcj Java compiler and SWIG to build a Python extension ? This paper examines the issues of pulling an open source Java library into Python, matching the differences in memory management, thread support and type systems. It also describes a technique for extending a Java class from Python by providing a python implementation to a Java extension proxy. Introduction ------------ OSAF's flagship project, Chandler (http://www.osafoundation.org), is a personal information manager. As such, it needs the ability to run unstructured full text queries over arbitrarily large repositories of text. There are not that many open source text search engines available. Lucene is considered among the better ones and it is licensed under the Apache license, both of which make it a very attractive solution. But Lucene is written in Java. Why PyLucene ? -------------- Java presents quite a challenge since it is a very insular environment. It doesn't play well, both culturally and technically, with other non-Java environments. The Java Native Interface (JNI) is clunky, the Java Runtime Environment is something we didn't want to be emcumbered with either. Really, for Lucene to be an acceptable solution for Chandler, it had to be able to be compiled down to a simple Python extension that is loaded and managed just like any other Python extension written in C or C++. Enter GNU's Java Compiler, gcj. GCJ compiles a set of Java classes together into a shared library and exports these classes as if they were implemented in C++ via the Common Native Interface (CNI) which makes them very easily accessible to other C/C++ programs. GCJ does not depend on the Apple or Sun Java Runtime Environments. It comes with its own, 100% open source, clean room implementation of the Java runtime classes. Enter the interface compiler, SWIG. SWIG is an interface compiler that connects programs written in C and C++ with scripting languages such as Perl, Python, Ruby, and Tcl. It works by taking the declarations found in C/C++ header files and using them to generate the wrapper code that scripting languages need to access the underlying C/C++ code. (quoted from http://www.swig.org/exec.html) Using SWIG and gcj allows to keep the resulting Python extension very close to the actual Java Lucene library which is under active development. This approach, while involving many thousands of lines of boilerplate code, almost all generated by SWIG, ultimately yields a much closer and more up to date library than a handcrafted port. Alternatives to the gcj/SWIG approach include a manual port of Java Lucene to Python such as Lupy, which is ten times slower than Java Lucene and way behind the Java Lucene version; a manual port to C++ integratable into Python such as CLucene, which is four times faster than Java Lucene but comes with its own set of C++ bugs and is also way behind the original; or finally, a JPipe-based solution, which is very clever but relies on the JNI and the JRE. PyLucene Architecture --------------------- PyLucene is compiled as a shared library - a python extension - loaded into the Python process by an 'import' statement. On Mac OS X and Linux, gcj's libgcj.so and libstdc++.so need to be shipped along with _PyLucene.pyd. On Windows however, all these libraries are linked together into a single shared library of a few megabytes in size. ----------- -------------- ------------------- | Java | --- gcj ---> | lucene.o |->------<-| PyLucene_wrap.o | | Lucene | --- gcjh -------------- | ------------------- ----------- | v ^ v ----------------- | --------------- | _PyLucene.pyd | g++ | C++ headers | -->| ----------------- | --------------- | | |--- SWIG ---> --------------------- -------------- | | PyLucene_wrap.cxx | | PyLucene.i | ----------------->| --------------------- -------------- | PyLucene.py | --------------------- Preparing the Java Lucene library for compilation by gcj -------------------------------------------------------- The GNU java compiler, gcj, has been under very active development for a number of years now and is quite usable. Still, it comes with a number of bugs and kinks that need to be worked around. For example, gcj can compile .java source files and .class bytecode files to .o files alike. There are, however, more bugs in the .java to .o compilation process, notably with handling anonymous inner classes. Also, the infamous Miranda bug, http://gcc.gnu.org/bugzilla/show_bug.cgi?id=15411, needs to be carefully worked around by adding abstract method declarations to abstract java classes for the methods of interfaces they claim to implement. Differences in reserved words between Java, C++ and Python can also cause problems. For example, 'delete' is a perfectly acceptable name for a method in Java but a reserved keyword in C++. Similarly, 'or' is acceptable in Java but a reserved keyword in Python. Some of these incompatibilities can be dealt with in the SWIG layer, others need to be patched in the original Java sources. Matching the memory models -------------------------- PyLucene involves three languages, Java, C++ and Python, each with their own conflicting memory management policies. Java's memory is garbage collected, C++ memory is more or less managed by hand and Python's memory objects are reference counted. The GNU Java class header file generator, gcjh, makes Java classes available to C++/CNI programs as C++ classes via the C++ headers it generates. SWIG, which generates the glue code between Python and C++, is not aware that the C++ pointers it wraps for Python's use are not really C++ but pointers to Java objects that cannot be explicitely 'deleted' when Python releases them. The gcj runtime library, libgcj, includes a garbage collector which keep track of Java objects in the Java heap, on the stack and in static variables but not in the Python heap. To solve these problems all Java objects returned to Python are kept in a Java IdentityHashMap where they are reference counted. That way, they are not garbage collected until they are removed from the table when their refcount reaches zero. The SWIG-generated 'delete_*' functions invoked when Python's refcount to wrapped Java objects reaches zero are patched to decrement their refcount in the IdentityHashMap instead. Matching thread support and concurrency models ---------------------------------------------- Both Python and Java support threads. On the platforms PyLucene is currently supported, Mac OS X, Linux and Windows, Python and Java use the same operating system thread implementations. One would hope that threads created in Python could be used to call into Java and vice-versa, but it is not that simple. Before any memory can be allocated by a thread using gcj's Java runtime library, libgcj, the garbage collector has to be made aware of the thread and its stack. The current implementation of the garbage collector, boehm-gc, does not support the registration of a thread after the fact. Hans Boehm, the author of the garbage collector module, intends to eventually support this but making it work reliably on the three supported PyLucene operating systems is a lost cause at the moment. Luckily, Python is a lot more nimble about threading support. It happily runs in threads it did not create and can even be coaxed into treating a foreign thread as one of its own. PyLucene exposes a new class, called PythonThread, which is a subclass of Python's threading.Thread class whose start() method delegates to Java the creation, initialization and starting of the actual operating system thread. There are then in fact two distinct thread objects, one Python, one Java, that use the same operating system thread. Any thread in python wishing to use the PyLucene APIs needs to be an instance of this PythonThread class. Python's threading support is not truly concurrent. While a Python thread is running, it is holding a Global Interpreter Lock (GIL) that prevents any other Python thread in the same process from running. Java, on the other hand, is fully concurrent. Several Java threads can run truly concurrently on hardware that supports it. Because of this concurrency mismatch, deadlocks may occur if the GIL is not released whenever a call is made from Python to Java since Java code may call back into Python when invoking Python 'extensions' to Java Lucene classes. Releasing the GIL also helps the Python process in not appearing hung when a longer running Java Lucene API such as IndexWriter.optimize() is in progress. 'Extending' Java classes from Python ------------------------------------ Many areas of the Lucene API expect the programmer to provide their own implementation or specialization of a feature where the default is inappropriate. For example, text analyzers and tokenizers are an area where many parameters and environmental or cultural factors are calling for customization. PyLucene enables this by providing Java extensions that serve as proxies for Java to call back into the Python implementations of these customizations. Technically, the PyLucene programmer is not providing an 'extension' but a Python implementation of a set of methods encapsulated by a Python class whose instances are wrapped by the Java proxies provided by PyLucene. For example, the code below, extracted from a PyLucene unit test, defines a custom analyzer using a custom token stream that returns the tokens '1', '2', '3', '4', '5' for any document it is called on. All that is needed in order to provide a custom analyzer in Python is defining a class that implements a method called 'tokenStream'. The presence of the 'tokenStream' method is detected by the corresponding SWIG type handler and the python instance passed in is wrapped by a new Java PythonAnalyzer instance that extends Lucene's abstract Analyzer class. In other words, SWIG in reverse. class _analyzer(object): def tokenStream(self, fieldName, reader): class _tokenStream(object): def __init__(self): self.tokens = ['1', '2', '3', '4', '5'] self.increments = [1, 2, 1, 0, 1] self.i = 0 def next(self): if self.i == len(self.tokens): return None t = Token(self.tokens[self.i], self.i, self.i) t.setPositionIncrement(self.increments[self.i]) self.i += 1 return t return _tokenStream() analyzer = _analyzer() store = RAMDirectory() writer = IndexWriter(store, analyzer, True) d = Document() d.add(Field.Text("field", "bogus")) writer.addDocument(d) writer.optimize() writer.close() Supporting downcasting ---------------------- Python's type system does not require type casting. On the other hand, downcasting is a very common operation in Java. SWIG will wrap a C++ object with a Python object matching the object's declared protocol. For example, if a Lucene API is declared to return Query, the resulting Python wrapper implements the Query methods, exactly. If the wrapped object is actually an instance of a subclass of Query, such as BooleanQuery, the subclass's methods are not available on the Python proxy. Where this is a problem, PyLucene extends the types in question to have is() type checkers and to() type casters to work this around. For example: analyzer = StandardAnalyzer() query = QueryParser.parse("a AND b", "data", analyzer).toBooleanQuery() Pythonic API flavors -------------------- Java is a rather verbose programming language. In places where it made sense, PyLucene added some pythonic extensions to the Lucene APIs such as iterators or property accessors. For example, one of the most commonly used Lucene classes, Hits, which returns the documents found in a search, is not iterable in Java Lucene, a hit counter is used instead. PyLucene wraps this nicely with a Python iterator. This can be illustrated as follows: The Java loop: for (int i = 0; i < hits.length(); i++) { Document doc = hits.doc(i); System.out.println(hits.score(i) + ":" + doc.get("title")); } with PyLucene becomes: for i, doc in hits: print "%s:%s" %(hits.score(i), doc['title']) Error reporting --------------- Java exceptions are caught by the Python - Java call boundary and wrapped into a Python JavaError exception. Errors that occur in Python code called from Java, are also caught there and reported as usual. In effect, for every call made from Python to Java, the glue code is the following: try { PythonThreadState state; // release/restore GIL $action // call Java API } catch (org::osafoundation::util::PythonException *e) { return NULL; // report Python error } catch (java::lang::Throwable *e) { PyErr_SetObject(PyExc_JavaError, // wrap Java exception jo2pr_type(e, "java::lang::Throwable *")); return NULL; // report Java error } The PythonThreadState type is a simple C++ class that ensures that the Python thread state is saved and the GIL released when it enters into scope and conversely that the Python thread state is restored and the GIL reacquired when it goes out of scope. class PythonThreadState { private: PyThreadState *state; public: PythonThreadState() { state = PyEval_SaveThread(); } ~PythonThreadState() { PyEval_RestoreThread(state); } }; Samples ------- A large number of samples are shipped with PyLucene. Most notably, all the samples published in the "Lucene in Action" book that did not depend on a third party Java library for which there was no obvious Python equivalent were ported to Python and PyLucene. "Lucene in Action" is a great companion to learning Lucene. Having all the samples available in Python should make it even easier for Python developers. "Lucene in Action" was written by Erik Hatcher and Otis Gospodnetic, both part of the Java Lucene development team, and is available from Manning Publications at http://www.manning.com/hatcher2. Future work ----------- Most of PyLucene's SWIG code is boilerplate code that could also be generated before being fed to SWIG. Such a Java classes to SWIG generation tool remains to be written. The same gcj/SWIG-based techniques could be used for other languages supported by SWIG such as Perl, Ruby, Lisp, etc... Various people have shown interest in a Ruby version for a while now. It would be exciting to see PyLucene morph into SWIGLucene as support for some of these languages is added. PyLucene could be ported to other operating systems. That effort is effectively bound by the state of the implementation of libgcj on these platforms. In particular, threading and garbage collection support can be problematic. GNU gcj is under very active development and progress is made on a regular basis. Acknowledgements ---------------- PyLucene wouldn't be possible without the tireless efforts of the people contributing to the open source projects below: - the contributors to the GCC/GCJ compiler suite, http://gcc.gnu.org/onlinedocs/gcc/Contributors.html - the GNU classpath team http://savannah.gnu.org/project/memberlist.php?group_id=85 - the Java Lucene developers, http://jakarta.apache.org/lucene/docs/whoweare.html - the SWIG developers, http://www.swig.org/guilty.html - the Open Source Applications Foundation, hosting the PyLucene project http://www.osafoundation.org Thank you all ! License ------- This paper is licensed under a Creative Commons License: http://creativecommons.org/licenses/by/2.0