Mod_Python - Integrating Python with Apache

Gregory Trubetskoy

Abstract

Mod_python [1] is an Apache server [2] module that embeds the Python interpreter within the server and provides an interface to Apache server internals as well as a basic framework for simple application development in this environment. The advantages of mod_python are versatility and speed.

This paper describes mod_python with the focus on the implementation, its philosophy and challenges.

It is intended for an audience already familiar with web application development in general and Apache in particular, as well as preferably mod_python itself. Knowledge of C and some understanding of Python internals is helpful as well.

Project Goal

Quite simply - it is integration of Python and Apache. Apache is a sort of a Swiss knife of web serving, especially the upcoming 2.0 version, which does not limit itself to HTTP but can serve any protocol for which there exists a module. Mod_python aims to provide direct access to the riches of this functionality for Python developers.

While speed is definitely a key benefit of mod_python and is taken very seriously during design decisions, it would be wrong to identify it as the sole reason for mod_python's existence.

At least for now, providing "inline Python" type functionality a lá PHP [15] is not a goal of this project. This is because the integration with Apache can still use a lot of improvement, and there does not seem to be a clear consensus within the Python community on how to embed Python code in HTML, with quite a few modules floating around, each doing it their own way.

Project Status

Mod_python was initially released in April 2000 as a replacement for an earlier project called Httpdapy [3] (1998), which in turn was a port to Apache of Nsapy [4] (1997). Nsapy was based on an embedding example by Aaron Watters in the Internet Programming with Python [5] book.

Mod_python is stable enough to be used in production. The latest stable version at the time of this writing is 2.7.6. This version is written for 1.3 version of the Apache server. All of the development effort these days is focused on the next major version of mod_python, 3.0, which will support the upcoming Apache 2.0.

Quick Intro

Mod_python consists of two components - an Apache dynamically loadable module mod_python.so (this module can also be statically linked into Apache) and a Python package mod_python.

Assuming that mod_python is loaded into Apache, consider this configuration excerpt:

DocumentRoot /foo/bar
<Directory /foo/bar>
          AddHandler python-program .py
          PythonHandler hello
</Directory>

The following script named hello.py resides in the /foo/bar directory:

from mod_python import apache

def handler(req):
        req.send_http_header()
        req.write("hello %s" % req.remote_host)
        return apache.OK

A request to http://yourdomain/somefile.py would result in a page showing "hello 1.2.3.4" where 1.2.3.4 is the IP of the client.

Just about every mod_python script begins with "from mod_python import apache". apache is a module inside the mod_python package that provides the interface to Apache constants (such as OK) and many useful functions. Note also the Request object req, which provides information about the current request, the connection and an interface to more internal Apache functions, in this example send_http_header() to send HTTP headers and write() method to send data back to the client.

Apache Modules, Request Phases and Mod_python

Apache processes incoming requests in phases. A phase is one of a series of small tasks that each need to take place to service a request. For example, there is a phase during which a URI is mapped to a file on disk, a phase during which authentication happens, a phase to generate the content, etc. Altogether, Apache 1.3 has 10 phases (11 if you consider clean-ups a phase).

The key architectural feature of the Apache server is that it can allow a module to process any phase of a request. This way a module can augment the server behavior in any way whatsoever. (module in this context does not refer to a Python module; an Apache module is usually a shared library or DLL that gets loaded at server startup, though modules can also be statically linked with the server).

Mod_python is an Apache module. What makes it different from most other Apache modules is that it itself doesn't do anything, but provide the ability to do what Apache modules written in C do to be done in Python. To put it another way, it delegates phase processing to user-written Python code.

This figure shows a diagram of Apache request processing.




Each Apache module can provide a handler function for any of the request processing phases. There are 4 types of return values possible for every handler.

  1. DECLINED means the module declined to handle this phase, Apache moves to the next module in the module list.

  2. OK means that this phase has been processed, Apache will move on to the next phase without giving any more modules an opportunity to handle this phase.

  3. An error return (which is any HTTP [7] error constant) will cause Apache to produce an error page and jump to the Logging phase.

  4. A special value of DONE means the whole request has been serviced, Apache will jump to the Logging phase.

The DECLINED return is somewhat deceiving, because many modules actually perform some action and then return DECLINED to give other modules an opportunity to handle the phase. The example below illustrates how the DECLINED return can be used in a handler that inserts a silly reply header into every request:

from mod_python import apache

def fixup(req):

    req.headers_out["X-Grok-this"] = "Python-Psychobabble"
    return apache.DECLINED

At this point it should be a bit clearer how this functionality is different from CGI environment. Comparing CGI with mod_python is not very meaningful, because the scope of CGI is much narrower. One difference is that CGI is intended exclusively for dynamic content generation, which is not a requirement for mod_python scripts. For example, consider a mod_python script that implements a custom logging mechanism for the entire server, which plays no role in content generation.

Apache Objects

Apache request processing makes use of a few important C structures, access to which is available through mod_python.

request_rec - the Request Record

request_rec is probably the largest and most frequently encountered structure. It contains all the information associated with processing a request (about 50 members total).

Mod_python provides a wrapper around request_rec, a built-in type mp_request. The mp_request type is not meant to be used directly. Instead, each mod_python handler gets a reference to an instance of a Request class, a regular Python class which is a wrapper around mp_request (which is a wrapper around request_rec). This is so that mod_python users could attach their own attributes to the Request instance as a way to maintain state across different phases.

The Request class provides methods for sending headers and writing data to the client.

conn_rec - the Connection Record

conn_rec keeps all the information associated with the connection. It is a separate structure from request_rec because HTTP [7] allows for multiple requests to be serviced over the same connection.

The connection record is accessible in mod_python through the mp_conn built-in type, a reference to which is always available via connection member of the Request object (req.connection).

server_rec - the Server Record

server_rec keeps all the information associated with the virtual server, such as the server name, its IP, port number, etc. It is available via the server member of the Request object (req.server).

ap_table - Apache table

All key/value lists (for example RFC 822 [8] headers) in Apache are stored in tables. A table is a construct very similar to a Python dictionary, except that both keys and values must be strings, key lookups are case insensitive and a table can have duplicate keys. Internally, Apache tables differ from Python dictionaries in that lookups do not using hashing, but rather a simple sequential search (although there was a proposal to use hashing in Apache 2.0).

Mod_python provides a wrapper for tables, an mp_table object, which acts very much like a Python dictionary. If there are duplicate keys, mp_table will return a list. To allow addition of duplicate keys, mp_table provides an add() method.

Here is some code to illustrate how mp_table acts:

from mod_python import apache

def handler(req):

    t = apache.make_table()
    t["Set-Cookie"] = "Foo: bar;"
    t.add("Set-Cookie") = "Bar: foo;"

    s = t["Set-Cookie"]  # s is ["Foo: bar;", "Bar: foo;"]

    return apache.DECLINED

Subinterpreters

The Python C API has a function to initialize a sub-interpreter, Py_NewInterprer(). Here is an excerpt from the Python/C API Reference manual [6] documenting this function:

Create a new sub-interpreter. This is an (almost) totally separate environment for the execution of Python code. In particular, the new interpreter has separate, independent versions of all imported modules, including the fundamental modules __builtin__ , __main__ and sys . The table of loaded modules (sys.modules) and the module search path (sys.path) are also separate. The new environment has no sys.argv variable. It has new standard I/O stream file objects sys.stdin, sys.stdout and sys.stderr (however these refer to the same underlying FILE structures in the C library).

This valuable feature of Python is not available from within Python itself, so most Python users are not even aware of it. But it makes good sense to take advantage of this functionality for mod_python, where one Apache process can be responsible for any number of unrelated applications at the same time. By default, mod_python creates a subinterpreter for each virtual server, but this behavior can be altered.

When a subinterpreter is created, a reference to it is saved in a Python dictionary keyed by subinterpreter names, which are always strings. This dictionary is internal to mod_python.

During phase processing, prior to executing the user Python code, mod_python has to decide which interpreter to use. By default, the interpreter name will be the name of the virtual server, which is available via req->server->server_hostname Apache variable. If the PythonInterpPerDirectory is On, then the name of the interpreter will be the directory being accessed (from req->filename), and with PythonInterpPerDirective On, the directory where the Python*Handler directive currently in effect is specified (which can be some parent directory). The interpreter name can also be forced using PythonInterpreter directive.

Once mod_python has a name for the interpreter, we check the dictionary of subinterpreters for this name, if it exists, we switch to it, else a new subinterpreter is created.

Phase Processing Inside Mod_python

After mod_python has been given control by Apache to process a phase of a request, it steps through the following actions. (This is a simplified list.)

Memory Management and Cleanups

Memory management is always a challenge for long running processes. One has to be very careful to always remember to free all memory allocated during request processing, no matter what errors take place.

To combat this problem, Apache provides memory pools. The Apache API has a rich set of functions for allocating memory, manipulating strings, lists, etc., and each of these functions always takes a pool pointer. For example, instead of allocating memory using malloc() et al, Apache modules allocate memory using ap_palloc() and passing it a pool pointer. All memory allocated in such a way can then be freed at once by destroying the pool. Apache creates several pools with varying lifetimes, and modules can create their own pools as well. The pool probably used the most is the request pool, which is created for every request and is destroyed at the end of the request.

Unfortunately, the Python interpreter cannot use Apache pools. So for the most part, mod_python programmer is at the mercy of the Python reference counting and garbage collecting mechanism (or lack thereof). In most cases it works just fine. In those cases where you do see the Apache process growing the simplest solution is to configure the server to recycle itself every few thousand requests using the MaxRequestsPerChild directive.

Apache provides API's to execute cleanup functions just before a pool is destroyed. A cleanup is registered by calling the ap_register_cleanup() C function which takes three arguments: a pool pointer, a function pointer, and a void pointer to some arbitrary data. Just before the pool is destroyed, the function will be called and passed the pointer as the only argument. Mod_python uses cleanups internally to destroy mp_request and mp_tables.

Cleanups are available to mod_python users via Request.register_cleanup() and request.server.register_cleanup(). The former runs after every request, the latter runs when the server exits.

Standard Handlers

As an astute reader probably noticed, mod_python (or rather Apache) associates a handler with a directory (SetHandler) or a file type (AddHandler), but not a specific file. In the quick example in the beginning of this paper it really doesn't matter what file is being accessed in the "/foo/bar" directory. For as long as it ends with .py, same hello handler will be invoked always yielding the same result. In fact the file referred to in the URI doesn't even need to exist.

A natural question would then be "Why can't I access multiple mod_python scripts in one directory?" (or "This isn't very useful!"). The answer here is that mod_python expects there to be an intermediate layer between it and the application. This layer (handler) is up to the user's imagination, but a couple of functional handlers (standard handlers) is bundled with mod_python.

CGI Handler (mod_python.cgihandler)

This handler is for users who want to use their existing CGI code with mod_python. This handler sets up a fake CGI environment and runs the user program. A couple of interesting implementation challenges were encountered here.

At first, this handler used to set up the CGI environment through the standard os.environ object. For whatever reason (Python bug?) this frequent environment manipulation introduced a memory leak (about a kilobyte per request), so as a quick hack, os.environ was replaced with a regular dictionary object. This works fine for the most part, but is a problem for scripts that use environment as a way to communicate with subsequently called programs, notably some database interfaces which expect database server information in an environment variable.

Another problem was that since cgihandler uses import/reload to run a module, "indirect" module imports by the "main" module would become noops after the first hit. This became a problem for users who expected the top level code in those indirectly imported modules to be executed for every hit. To solve this problem, cgihandler now examines the sys.modules variable before and after importing the user scripts, and in the end, deletes any newly appeared modules from sys.modules, causing those modules to be imported again next time.

Last but not the least, the CGI specification [14] strongly recommends that the server set the current directory to the directory in which the script is located. There is no thread safe way of changing the current directory and so the cgihandler uses a thread lock in multithreaded environment (e.g. Win32) which is held for as long as the script runs essentially forcing the server to process one cgihandler request at a time.

Given all of the above problems, the cgihandler is not a recommended development environment, but is regarded as a stop gap measure for users who have a lot of legacy CGI code, and should be used with caution and only if really necessary.

Publisher Handler (mod_python.publisher)

The publisher handler is probably the best way to start writing web applications using mod_python. The functionality of the publisher handler was inspired by the ZPublisher, a component of Zope [10].

The idea is that a URI is mapped to some object inside a module, the "/" in the URI having the same meaning as a "." in Python. So http://somedomain/somedir/module/object/method would invoke method method of object object inside module module in directory somedir, and the return value of the method would be sent to the client.

Here is a "hello world" example:

def hello(req, who="nobody"):

    return "Hello, %s!" % who

If the file containing this code is called myapp.py in directory somedir, then hello function can be accessed via http://somedomain/somedir/myapp/hello which should result in a page showing "Hello, nobody!", whereas http://somedomain/somedir/myapp/hello?who=John should result in "Hello, John!".

Note that the first argument is a Request object, which means all the advanced mod_python functionality is still available when using the publisher handler.

Debugging

Debugging mod_python applications can be difficult. Mod_python provides support for the Python debugger (pdb) via the PythonEnablePdb configuration directive, but its usability is limited because the debugger is an interactive tool that uses standard input and output and therefore can only be used when Apache is running in foreground mode (-X switch in Apache 1.3 or -DONE_PROCESS in 2.0).

Mod_python sends any traceback information to the server log, and with PythonDebug directive set to On (default is Off), the traceback information is sent to the client.

For programmers who like to use the print statement as a debugging tool, the technique favored by the author is to instead raise a variable optionally surrounded by "`" (back quotes) from any point in the code with the PythonDebug directive On. This will make the value of the variable appear on the browser and is as effective as print.

Threads

Mod_python is thread-safe and runs fine on Win32, where Apache is multithreaded.

One should be careful to make sure that any extension modules that an application uses are thread-safe as well. For example, many database access drivers on Windows are not thread safe, and some kind of a thread lock needs to be used to make sure no two threads try to run the driver code in parallel.

Interestingly, the Python interpreter itself isn't completely thread safe, and to run multiple threads it maintains a thread lock that is released every 10 Python bytecode instructions to let other threads run. If any, the negative impact of that is most likely negligible.

On Design and Implementation

Mod_perl

Those familiar with mod_perl [10] will notice that some functionality of mod_python is remarkably similar to mod_perl, for example the names of the Apache configuration directives are exactly the same except the word Perl is substituted for Python.

It would be wrong not to say that much of mod_python functionality, especially in the area of Apache configuration, was intentionally made functionally similar to mod_perl. Under the hood they have next to nothing in common, mainly because Perl and Python interpreters are quite different.

There were good reasons for similarities though. First, there is no sense in reinventing the wheel - mod_perl has encountered and solved many problems just as applicable to mod_python. Second, since both projects had similar goals, except the language of choice was different, it made sense to keep the outside look consistent, especially the Apache configuration. Oftentimes the person who has to deal with the Apache config is a System Administrator, not a programmer, and consistency would make SysAdmin's job easier.

Python vs C

In a web application environment speed and low overhead are extremely important. Many people don't appreciate how really important it is until their site gets featured on another big volume site (the so called "/. effect") but instead of getting lots of hard earned publicity, they get a bunch of frustrated web surfers trying to get to a site so overloaded that no one can access it.

Considering this angle, C always wins over Python. If the author of mod_python had more time, a much larger percentage of mod_python would be implemented in C. But given the length of time it takes to write quality C code, initially a decision was made to implement in C only those parts which cannot be done in Python.

SWIG

SWIG [13] was given some consideration as a tool to provide the mapping to Apache C structures (such as request_rec). There are a few problems with SWIG. The main advantages of SWIG are speed and ease with which an interface to a C library can be created. The resulting C code is not necessarily meant to be easy to read, and SWIG itself becomes yet another tool that is required for compilation in an already pretty complicated build environment. Altogether, for a long-term project like mod_python, where quality is more important than the timeline, SWIG does not seem to be the right choice.

Future Direction and Apache 2.0

As has been mentioned before, the main focus of development today is compatibility with Apache 2.0. Apache 2.0 is architecturally quite a bit different from its predecessor (1.3), so much so that it would not be very easy or practical to try to write code that works with both 1.3 and 2.0. It is possible, but the code becomes a tangle of #ifedef statements because the majority of the API functions have been renamed. So the next major version of mod_python will support Apache 2.0 only.

Apache 2.0 is actually a combination of two software packages. One is the server itself, the other is the underlying library, the Apache Portable Runtime (APR) [12]. The APR is a general purpose library designed to provide functionality common in daemons of all kinds and to abstract the OS specifics (thus "Portable"). Future versions of mod_python will eventually provide an interface to large part or perhaps all of the APR.

Another big improvement in 2.0 is the introduction of filters and connection handlers. The alpha version of mod_python 3.0 already supports filters. (A filter would be the right place to implement inline Python). A connection handler is a handler at a level below HTTP. Using a connection handler one could implement an entirely different protocol, e.g. FTP. At the time of this writing mod_python 3.0 alpha does not support connection handlers, but such support is in the plans.

References

[1] Mod_python. http://www.modpython.org/
[2] Apache Http Server. http://httpd.apache.org/
[3] Httpdapy. http://www.ispol.com/home/grisha/httpdapy
[4] Nsapy. http://www.ispol.com/home/grisha/nsapy
[5] Aaron Watters, Guido van Rossum, James C. Ahlstrom, Internet Programming with Python, M&T Books, 1996.
[6] Guido van Rossum, Fred L. Drake, Jr, Python/C API Reference Manual, PythonLabs. http://www.python.org/doc/current/api/.
[7] R. Fielding, UC Irvine, J. Gettys, J. Mogul, DEC, H. Frystyk, T. Berners-Lee, MIT/LCS, "Hyper Text Transfer Protocol -- HTTP/1.1", RFC 2068, IETF January 1997. http://www.ietf.org/rfc/rfc2068.txt
[9] Crocker, D., "Standard for the Format of ARPA Internet Text Messages", STD 11, RFC 822, UDEL, August 1982. http://www.ietf.org/rfc/rfc822.txt
[10] Zope http://www.zope.org/
[11] Mod_perl, Apache/Perl Integration. http://perl.apache.org/
[12] Apache Portable Runtime. http://apr.apache.org/
[13] Simplified Wrapper and Interface Generator. http://www.swig.org/
[14] Ken A L Coar, The WWW Common Gateway Interface Version 1.1. http://cgi-spec.golux.com/draft-coar-cgi-v11-03.txt
[15] PHP. http://www.php.net/