Paper for PyCon 2005 30mn talk about PyLucene
=============================================

Title : Pulling Java Lucene into Python: PyLucene
Author: Andi Vajda


 Abstract
 --------

 As OSAF needed an open source text search engine library for its Python
 based project, Chandler, we made the following bet: what if we pulled
 together Java Lucene, GNU's gcj Java compiler and SWIG to build a Python
 extension ?

 This paper examines the issues of pulling an open source Java library into
 Python, matching the differences in memory management, thread support and
 type systems. It also describes a technique for extending a Java class from
 Python by providing a python implementation to a Java extension proxy.


 Introduction
 ------------

 OSAF's flagship project, Chandler (http://www.osafoundation.org), is a
 personal information manager. As such, it needs the ability to run
 unstructured full text queries over arbitrarily large repositories of
 text.

 There are not that many open source text search engines available. Lucene
 is considered among the better ones and it is licensed under the Apache
 license, both of which make it a very attractive solution.

 But Lucene is written in Java. 


 Why PyLucene ?
 --------------

 Java presents quite a challenge since it is a very insular environment.
 It doesn't play well, both culturally and technically, with other non-Java
 environments. The Java Native Interface (JNI) is clunky, the Java Runtime
 Environment is something we didn't want to be emcumbered with either.

 Really, for Lucene to be an acceptable solution for Chandler, it had
 to be able to be compiled down to a simple Python extension that is loaded
 and managed just like any other Python extension written in C or C++.

 Enter GNU's Java Compiler, gcj.

 GCJ compiles a set of Java classes together into a shared library and
 exports these classes as if they were implemented in C++ via the Common
 Native Interface (CNI) which makes them very easily accessible to other
 C/C++ programs.

 GCJ does not depend on the Apple or Sun Java Runtime Environments. It comes
 with its own, 100% open source, clean room implementation of the Java
 runtime classes.

 Enter the interface compiler, SWIG.

 SWIG is an interface compiler that connects programs written in C and C++
 with scripting languages such as Perl, Python, Ruby, and Tcl. It works by
 taking the declarations found in C/C++ header files and using them to
 generate the wrapper code that scripting languages need to access the
 underlying C/C++ code. (quoted from http://www.swig.org/exec.html)

 Using SWIG and gcj allows to keep the resulting Python extension very close
 to the actual Java Lucene library which is under active development. This
 approach, while involving many thousands of lines of boilerplate code,
 almost all generated by SWIG, ultimately yields a much closer and more up
 to date library than a handcrafted port.

 Alternatives to the gcj/SWIG approach include a manual port of Java Lucene
 to Python such as Lupy, which is ten times slower than Java Lucene and way
 behind the Java Lucene version; a manual port to C++ integratable into
 Python such as CLucene, which is four times faster than Java Lucene but
 comes with its own set of C++ bugs and is also way behind the original; or
 finally, a JPipe-based solution, which is very clever but relies on the JNI
 and the JRE.

 
 PyLucene Architecture
 ---------------------

 PyLucene is compiled as a shared library - a python extension - loaded into
 the Python process by an 'import' statement.
 On Mac OS X and Linux, gcj's libgcj.so and libstdc++.so need to be shipped
 along with _PyLucene.pyd. On Windows however, all these libraries
 are linked together into a single shared library of a few megabytes in
 size.

  -----------              --------------          -------------------
  | Java    | --- gcj ---> | lucene.o   |->------<-| PyLucene_wrap.o |
  | Lucene  | --- gcjh     --------------     |    -------------------
  -----------      |                          v             ^
                   v                   -----------------    |
                ---------------        | _PyLucene.pyd |   g++
                | C++ headers | -->|   -----------------    |
                ---------------    |                        |
                                   |--- SWIG ---> ---------------------
  --------------                   |              | PyLucene_wrap.cxx |
  | PyLucene.i | ----------------->|              ---------------------
  --------------                                  |    PyLucene.py    |
                                                  ---------------------


 Preparing the Java Lucene library for compilation by gcj
 --------------------------------------------------------

 The GNU java compiler, gcj, has been under very active development for a
 number of years now and is quite usable. Still, it comes with a number of
 bugs and kinks that need to be worked around. For example, gcj can compile
 .java source files and .class bytecode files to .o files alike. There are,
 however, more bugs in the .java to .o compilation process, notably with
 handling anonymous inner classes. Also, the infamous Miranda bug,
 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=15411, needs to be carefully
 worked around by adding abstract method declarations to abstract java
 classes for the methods of interfaces they claim to implement.

 Differences in reserved words between Java, C++ and Python can also cause
 problems. For example, 'delete' is a perfectly acceptable name for a method
 in Java but a reserved keyword in C++. Similarly, 'or' is acceptable in
 Java but a reserved keyword in Python. Some of these incompatibilities can
 be dealt with in the SWIG layer, others need to be patched in the original
 Java sources.


 Matching the memory models
 --------------------------

 PyLucene involves three languages, Java, C++ and Python, each with their
 own conflicting memory management policies. Java's memory is garbage
 collected, C++ memory is more or less managed by hand and Python's memory
 objects are reference counted.

 The GNU Java class header file generator, gcjh, makes Java classes
 available to C++/CNI programs as C++ classes via the C++ headers it
 generates.

 SWIG, which generates the glue code between Python and C++, is not aware
 that the C++ pointers it wraps for Python's use are not really C++ but
 pointers to Java objects that cannot be explicitely 'deleted' when Python
 releases them.

 The gcj runtime library, libgcj, includes a garbage collector which keep
 track of Java objects in the Java heap, on the stack and in static
 variables but not in the Python heap.

 To solve these problems all Java objects returned to Python are kept in a
 Java IdentityHashMap where they are reference counted. That way, they are
 not garbage collected until they are removed from the table when their
 refcount reaches zero. The SWIG-generated 'delete_*' functions invoked
 when Python's refcount to wrapped Java objects reaches zero are patched
 to decrement their refcount in the IdentityHashMap instead.


 Matching thread support and concurrency models
 ----------------------------------------------

 Both Python and Java support threads. On the platforms PyLucene is currently
 supported, Mac OS X, Linux and Windows, Python and Java use the same
 operating system thread implementations. One would hope that threads
 created in Python could be used to call into Java and vice-versa, but it is
 not that simple.

 Before any memory can be allocated by a thread using gcj's Java runtime
 library, libgcj, the garbage collector has to be made aware of the thread
 and its stack. The current implementation of the garbage collector,
 boehm-gc, does not support the registration of a thread after the
 fact. Hans Boehm, the author of the garbage collector module, intends to
 eventually support this but making it work reliably on the three supported
 PyLucene operating systems is a lost cause at the moment.

 Luckily, Python is a lot more nimble about threading support. It happily
 runs in threads it did not create and can even be coaxed into treating a
 foreign thread as one of its own.

 PyLucene exposes a new class, called PythonThread, which is a subclass of
 Python's threading.Thread class whose start() method delegates to Java the
 creation, initialization and starting of the actual operating system
 thread. There are then in fact two distinct thread objects, one Python,
 one Java, that use the same operating system thread.
 Any thread in python wishing to use the PyLucene APIs needs to be an
 instance of this PythonThread class.

 Python's threading support is not truly concurrent. While a Python thread
 is running, it is holding a Global Interpreter Lock (GIL) that prevents
 any other Python thread in the same process from running.
 Java, on the other hand, is fully concurrent. Several Java threads can run
 truly concurrently on hardware that supports it.

 Because of this concurrency mismatch, deadlocks may occur if the GIL is not
 released whenever a call is made from Python to Java since Java code may
 call back into Python when invoking Python 'extensions' to Java Lucene
 classes.
 Releasing the GIL also helps the Python process in not appearing hung when
 a longer running Java Lucene API such as IndexWriter.optimize() is in
 progress.


 'Extending' Java classes from Python
 ------------------------------------

 Many areas of the Lucene API expect the programmer to provide their own
 implementation or specialization of a feature where the default is
 inappropriate. For example, text analyzers and tokenizers are an area where
 many parameters and environmental or cultural factors are calling for
 customization.

 PyLucene enables this by providing Java extensions that serve as proxies
 for Java to call back into the Python implementations of these
 customizations.

 Technically, the PyLucene programmer is not providing an 'extension' but a
 Python implementation of a set of methods encapsulated by a Python class
 whose instances are wrapped by the Java proxies provided by PyLucene.

 For example, the code below, extracted from a PyLucene unit test, defines a
 custom analyzer using a custom token stream that returns the tokens '1',
 '2', '3', '4', '5' for any document it is called on. 

 All that is needed in order to provide a custom analyzer in Python is
 defining a class that implements a method called 'tokenStream'. The
 presence of the 'tokenStream' method is detected by the corresponding SWIG
 type handler and the python instance passed in is wrapped by a new Java
 PythonAnalyzer instance that extends Lucene's abstract Analyzer class.

 In other words, SWIG in reverse.

      class _analyzer(object):
           def tokenStream(self, fieldName, reader):
               class _tokenStream(object):
                   def __init__(self):
                       self.tokens = ['1', '2', '3', '4', '5']
                       self.increments = [1, 2, 1, 0, 1]
                       self.i = 0
                   def next(self):
                       if self.i == len(self.tokens):
                           return None
                       t = Token(self.tokens[self.i], self.i, self.i)
                       t.setPositionIncrement(self.increments[self.i])
                       self.i += 1
                       return t
               return _tokenStream()

       analyzer = _analyzer()

       store = RAMDirectory()
       writer = IndexWriter(store, analyzer, True)

       d = Document()
       d.add(Field.Text("field", "bogus"))
       writer.addDocument(d)
       writer.optimize()
       writer.close()


 Supporting downcasting
 ----------------------

 Python's type system does not require type casting. On the other hand,
 downcasting is a very common operation in Java. SWIG will wrap a C++ object
 with a Python object matching the object's declared protocol. 

 For example, if a Lucene API is declared to return Query, the resulting
 Python wrapper implements the Query methods, exactly. If the wrapped object
 is actually an instance of a subclass of Query, such as BooleanQuery, the
 subclass's methods are not available on the Python proxy.

 Where this is a problem, PyLucene extends the types in question to have
 is<Type>() type checkers and to<Type>() type casters to work this around.
 For example: 
   analyzer = StandardAnalyzer()
   query = QueryParser.parse("a AND b", "data", analyzer).toBooleanQuery()


 Pythonic API flavors
 --------------------

 Java is a rather verbose programming language. In places where it made
 sense, PyLucene added some pythonic extensions to the Lucene APIs such as
 iterators or property accessors. For example, one of the most commonly used
 Lucene classes, Hits, which returns the documents found in a search, is not
 iterable in Java Lucene, a hit counter is used instead. PyLucene wraps this
 nicely with a Python iterator. This can be illustrated as follows:

   The Java loop:

       for (int i = 0; i < hits.length(); i++) {
           Document doc = hits.doc(i);
           System.out.println(hits.score(i) + ":" + doc.get("title"));
       }

   with PyLucene becomes:

       for i, doc in hits:
           print "%s:%s" %(hits.score(i), doc['title'])


 Error reporting
 ---------------

 Java exceptions are caught by the Python - Java call boundary and wrapped
 into a Python JavaError exception. Errors that occur in Python code called
 from Java, are also caught there and reported as usual.

 In effect, for every call made from Python to Java, the glue code is the
 following:

       try {
           PythonThreadState state;             // release/restore GIL
           $action                              // call Java API
       } catch (org::osafoundation::util::PythonException *e) {
           return NULL;                         // report Python error
       } catch (java::lang::Throwable *e) {
           PyErr_SetObject(PyExc_JavaError,     // wrap Java exception
                           jo2pr_type(e, "java::lang::Throwable *"));
           return NULL;                         // report Java error
       }

 The PythonThreadState type is a simple C++ class that ensures that the
 Python thread state is saved and the GIL released when it enters into scope
 and conversely that the Python thread state is restored and the GIL
 reacquired when it goes out of scope.

       class PythonThreadState {
         private:
           PyThreadState *state;
         public:
           PythonThreadState()
           {
               state = PyEval_SaveThread();
           }
           ~PythonThreadState()
           {
               PyEval_RestoreThread(state);
           }
       };


 Samples
 -------

 A large number of samples are shipped with PyLucene. Most notably, all the
 samples published in the "Lucene in Action" book that did not depend on a
 third party Java library for which there was no obvious Python equivalent
 were ported to Python and PyLucene.

 "Lucene in Action" is a great companion to learning Lucene. Having all the
 samples available in Python should make it even easier for Python
 developers.

 "Lucene in Action" was written by Erik Hatcher and Otis Gospodnetic,
 both part of the Java Lucene development team, and is available from
 Manning Publications at http://www.manning.com/hatcher2.


 Future work
 -----------

 Most of PyLucene's SWIG code is boilerplate code that could also be
 generated before being fed to SWIG. Such a Java classes to SWIG generation
 tool remains to be written.

 The same gcj/SWIG-based techniques could be used for other languages
 supported by SWIG such as Perl, Ruby, Lisp, etc... Various people have
 shown interest in a Ruby version for a while now. It would be exciting to
 see PyLucene morph into SWIGLucene as support for some of these languages
 is added.

 PyLucene could be ported to other operating systems. That effort is
 effectively bound by the state of the implementation of libgcj on these
 platforms. In particular, threading and garbage collection support can be
 problematic. GNU gcj is under very active development and progress is made
 on a regular basis.


 Acknowledgements
 ----------------
 
 PyLucene wouldn't be possible without the tireless efforts of the people
 contributing to the open source projects below:

  - the contributors to the GCC/GCJ compiler suite,
    http://gcc.gnu.org/onlinedocs/gcc/Contributors.html

  - the GNU classpath team
    http://savannah.gnu.org/project/memberlist.php?group_id=85

  - the Java Lucene developers,
    http://jakarta.apache.org/lucene/docs/whoweare.html

  - the SWIG developers,
    http://www.swig.org/guilty.html

  - the Open Source Applications Foundation, hosting the PyLucene project
    http://www.osafoundation.org

 Thank you all !


 License
 -------

 This paper is licensed under a Creative Commons License:
     http://creativecommons.org/licenses/by/2.0