From Python to PLT Scheme

The Python programming language [1] is a descendant of the ABC programming language, which was a teaching language created by Guido van Rossum in the early 1980s. It includes a sizeable standard library and powerful primitive data types. It has three major interpreters: CPython [2], currently the most widely used interpreter, is implemented in the C language; another Python interpreter, Jython [3], is written in Java; Python has also been ported to .NET [4].

MzScheme is an interpreter [5] for the PLT Scheme programming language [6], which is a dialect of the Scheme language. MzScheme compiles syntactically valid programs into an internal bytecode representation before evaluation. MrEd is a graphical user interface toolkit [7] that extends PLT Scheme and works uniformly across several platforms (Windows, Mac OS X, and the X Window System.) Originally meant for Scheme, DrScheme [8, 9] is an integrated development environment (IDE) based on MzScheme -- it is a MrEd application -- with support for embedding third-party extensions. DrScheme provides developers with useful and modular development tools, such as syntax or flow analyzers. Because MzScheme's syntax system includes precise source information, any reference by a development tool to such data can be mapped back to a reference to the original program text.

DrScheme is thus no longer just a development environment for Scheme. It can now potentially play the role of a program development environment for any language, which users can select from a menu (figure 1). When using any language from within the IDE, the program developer may use DrScheme's development tools, such as Syntax Check, which checks a program's syntax and highlights its bindings (figure 2), or MrFlow, which analyses a program's possible flow of values (MrFlow is still under development though). Also, any new tool added to the DrScheme IDE is supposed to work automatically with all the languages that DrScheme supports (figure 2).

To support a new language, however, DrScheme needs a translator for programs written in that language. In the case of adding Python support to DrScheme, this is the task of the Python-to-Scheme compiler described in this paper. The compiler is packaged as a DrScheme language tool, thus introducing Python as a language in DrScheme's list of choices (figure 1).

The compiler was created as an experiment in porting a language like Python to DrScheme. With Python available as a DrScheme language, Python programmers can use the DrScheme IDE and its accompanying tools to develop Python programs. It also gives Scheme programmers access to the large amount of Python code available on the Internet. The compiler still suffers from several limitations though, primarily relating to the runtime support. The performance of the generated code is also currently poor compared to CPython. While we expect some of the limitations to disappear and the performance to get better as the generated code is optimized, we already consider the experiment to be successful for the insights we have gained about the problem of embedding a real-world language into DrScheme.

Section 2 of this paper presents the overall architecture of the compiler system, including details about code generation and the runtime system. Section 3 describes the current status of the compiler, gives an idea of the current performance of the generated MzScheme code, and evaluates the successfulness of the whole experiment. Section 4 relates other works to this paper. Section 5 lists some of the major parts that still need to be worked on, and we conclude in section 6.

2 Architecture

This section describes the architecture of the Spy compiler. The compiler has a conventional structure with three major components: the front-end, which uses a lexical analyzer to read program text and a parser to check the syntax of the tokens produced by the scanner; the backend, which is a code generator using the parser's output to create MzScheme code; and the runtime system, which provides low-level functions that the generated code makes use of. This section delineates these three components. Section 2.1 describes the scanner and parser; section 2.2, the code generator; and section 2.4, the runtime system.

Note that, even though CPython is based on a virtual machine, we did not consider compiling CPython byte code instead of compiling Python source code. While compiling CPython byte code to Scheme is certainly doable, the semantic mismatch between the stack-based byte code and Scheme is big enough that DrScheme's tools would most likely give poor results on byte code (in addition to the problem of mapping those results for the byte code back into results for Python source code, since, unlike DrScheme, CPython does not preserve in the byte code much information about the source code to byte code transformation).

2.1 Lexical and Syntax Analysis

Python program text is read by the lexical analyzer and transformed into tokens, including special tokens representing indentation changes in the Python source code. From this stream of tokens the parser generates abstract syntax trees (ASTs) in the form of MzScheme objects, with one class for each Python syntactic category. The indentation tokens are used by the parser to determine the extent of code blocks. The list of generated ASTs is then passed on to the code generator.

2.2 Code Generation

The code generator produces Scheme code from a list of ASTs by doing a simple tree traversal and emitting equivalent MzScheme code. The following subsections explain the generation of the MzScheme code for the most important parts of the Python language. They also describe some of the problems we encountered.

2.2.1 Function Definitions

Python functions have a few features not present in Scheme functions. Tuple variables are automatically unpacked, arguments may be specified by keyword instead of position, and those arguments left over (for which no key matches) are placed in a special dictionary argument. These features are implemented using a combination of compile-time rewriting (e.g. for arguments specified by keywords) and runtime processing (e.g. conversion of leftover arguments into a Python tuple). Luckily, function call argument processing is mostly handled by parts of the CPython runtime system embedded in Spy. Python's return statement is emulated using MzScheme escape continuations. For example the following small Python function:

(define f
  (make-py-function
    (make-py-code
      (lambda (x y z rest dict)
        (call-with-escape-continuation
          (lambda (return)
            (return (make-py-number 7)))))
      'f      ;; fn name
      3       ;; required args
      3       ;; total positional args
      true    ;; var-args?
      true))) ;; keyword args?

2.2.2 Function Applications

Functions are applied through py-apply. A function object is passed as the first argument to py-apply, followed by a list of supplied positional arguments (in the order they were supplied), and a list of supplied keyword arguments (also in order). So, for example, the function call addOne(2) becomes:

(py-apply add_one
          (list (make-py-number 2))
          null)

The py-apply function extracts from the addOne function object a PyCode object and hands it to a modified version of CPython's PyEval_EvalCodeEx. PyEval_EvalCodeEx, in turn, extracts from the PyCode object a Scheme procedure that simulates the behavior of the Python function when it is applied to its simulated Python arguments by py-apply.

2.2.3 Class Definitions

In Python classes are also objects. A given class has a unique object representing it and all instances of that class use a reference to that unique class object to describe the class they belong to. The class of class objects (i.e. the type of an object representing a type) is the type special object / class. The type of type is type itself (i.e. type is an object whose class is represented by the object itself). With this in mind consider this small Python class, which inherits from two classes A and B that are not shown here:

In this class C, three members are defined: the two static fields and the method m, which adds the value of the first static field to its argument. This class is converted by the code generator into a static method call to the __call__ method of the type class (which is also a callable object). This call returns a new class object which is then assigned to the variable C:

(define C
  (py-method-call type '__call__
    (list
      (make-py-string 'C)
      (make-py-tuple (list A B))
      (list
        (lambda (this-class)
          (list
            'some_static_field
            (make-py-number 7)))
        (lambda (this-class)
          (list
            'another_static_field
            (let-values
              ([(some_static_field)
                (values (python-get-member
                          this-class
                          'some_static_field #f))])
              (make-py-number 3))))
        (lambda (this-class)
          (list
            'm
            (make-py-function*
               (lambda (this x)
                 (python-method-call
                   (python-get-attribute
                      C 'some_static_field)
                   '__add__
                   (list x)))
               'm
               2
               2
               false
               false)))))))

All instances of the class C then refer to that new class object to represent the class they belong to.

At class creation time member fields (but not methods) have access to the previously created fields and methods. So for example some_static_field must be bound to its value when evaluating the expression used to initialize the field another_static_field in the class C above. To emulate this the generated Scheme code that initializes a field must always be a function that receives as value for its this-class argument the class object currently being created, to allow for the extraction of already created fields and methods from that class object if necessary.

Since the MzScheme object system segregates classes and objects, does not allow either to be modified at runtime, and does not support multiple inheritance, the Python object system could not be mapped to the MzScheme one. All Python object are therefore emulated using MzScheme hash tables (which is also what they are internally in CPython).

2.3 Variable Assignments

Identifiers are normally bound either at the top level or inside functions. Identifiers from imported modules are bound differently (see section 2.3.1).

Assignments at the top level are translated into defines for first assignments or set!s for mutative assignments. In the following Python listing, the first line defines x, while the second line mutates x and defines y as the same value 2 (which is only evaluated once).

Identifiers defined inside functions are bound using let. For example, consider the following function that uses a single variable, x, defined on the fly.

Its body is translated into this Scheme equivalent (omitting the escape continuation code used to handle possible return statements):

(lambda ()
  (let ([x undefined])
    (let ([rhs (make-py-number 1)])
      (set! x rhs))
    py-none))

As a current shortcoming of the compiler, all variables defined throughout the body of a Python function are defined at once in a single let at the start of the corresponding Scheme function. To ensure that using a variable before it is defined still results in a runtime error the let-bound variables have to be given the value undefined. While this works fine in practice, it does not provide for good error messages though. This will be fixed in the future (see section 5).

When a global statement names any variable, the named variable is simply omitted from the Scheme function's initial let bindings, thereby allowing assignments to said variable to mutate an identifier existing in an outer scope (if it exists, otherwise a runtime error occurs).

2.3.1 Importing Modules

Unlike MzScheme modules, Python modules allow assignments to identifiers defined in other modules. Python also allows cycles between modules. It was therefore not possible to map Python modules to MzScheme modules. Rather Python modules are emulated using MzScheme namespaces.

In order to import a Python module at runtime -- and, in fact, to initialize the environment at startup -- the runtime system creates a new MzScheme namespace and populates it with the built-in Python library. The runtime system then compiles the requested module and evaluates it in this new namespace. Finally, new bindings for the necessary values are copied from that namespace into the original namespace of the module importer. For example, when evaluating the statement import popen from os, only the binding for popen is copied into the original namespace from the new one created to compile the os module. A module is always only compiled once, even if it is imported multiple times.

Since import m only copies over a reference to module m and its namespace, references to values in module m, such as m.x, are shared between modules importing m. However, a statement of the form from m import x copies the value of x into the current module namespace. There is no sharing of x between modules then.

2.4 The Runtime System

The Python runtime system can be divided into two parts: modules that are written in Python and modules that are written in C. The code generation described above can be applied to both user code and the parts of the Python runtime that are written in Python. This means that Python programmers can use these runtime modules as they would normally do. This also means that Scheme programmers have access to the parts of the Python runtime written in Python by simply invoking the compiler on them and evaluating the resulting MzScheme code (although there is currently no simple API provided to do that).

The C-level modules of the Python runtime have been compiled as MzScheme extensions using modified CPython C headers and Spy's own implementation of the core CPython API. There is no wrapping and unwrapping of Scheme or Python objects across C foreign-function interface boundaries; a Python C API PyObject pointer points to a MzScheme C API Scheme_Object.

Note that there is currently no way for the Python programmer using DrScheme to access the underlying MzScheme runtime. Giving such access is easy to do through the use of a Python module naming convention that can be treated as a special case by the code generator (e.g. import mzscheme or import ... from mzscheme).

3 Status

Most of the Python language has been implemented, with the exception of the yield and exec statements. The Python eval function has not been implemented yet either but since import is implemented and since it evaluates entire Python files, the necessary machinery to implement both exec and eval is already available. There is no plan to support Unicode strings until MzScheme itself supports them in the upcoming version 300 this year. There is also currently no support for documentation strings.

Because Python features like modules or objects have very dynamic behaviors and therefore must be emulated using MzScheme namespaces and hash tables (respectively), the code generated by our system is in general significantly bigger than the original Python code. See for example the simple Python class from section 2.2.3 that expands into about 30 lines of MzScheme code. In general a growth factor of about three in the number of lines of code can be expected. The generated code also involves a large number of calls to internal runtime functions to do anything from continually converting MzScheme values into Python values and back (or more precisely into the internal representation of Python values our system is using and back) to simulating a call to the __call__ method of the type class object. Finally, each module variable, class field or method access potentially involves multiple namespace or hashtable lookups done at the Scheme level. As a result the performance of the resulting code is poor compared to the performance of the original Python code running on CPython. While no systematic performance measurement has been made yet, anecdotal evidence on a few test programs before some optimizations were added shows a slowdown by around three orders of magnitude.

Using DrScheme tools on Python programs has given mixed results. Syntax Check, which checks a program's syntax and highlights its bindings using arrows, has been successfully used on Python programs without requiring any change to the tool's code (figure 2). Some features of the Python language make Syntax Check slightly less useful for Python programs than for Scheme programs though. For example, since an object can change class at runtime, it is not possible to relate a given method call to a specific method definition using just a simple syntactic analysis of the program. This is a limitation inherent to the Python language though, not to Syntax Check.

A tool like MrFlow, which statically analyzes a program to predict its possible runtime flow of values, could potentially be able to relate a given method call to a given method definition. While MrFlow can already be used on Python programs without any change, it does not currently compute any meaningful information: MrFlow does not know yet how to analyze several of the MzScheme features used in the generated code (e.g. namespaces). Even once this problem is solved, MrFlow will probably still compute poor results. Since all Python classes and object are emulated using MzScheme hash tables, and since value flow analyses are unable to differentiate between runtime hash table keys, MrFlow will compute extremely conservative results for all the object oriented aspects of a Python program. In general there is probably no easy way to statically and efficiently analyze the generated code. In fact there is probably no way to do good value-flow analysis of Python programs at all given Python's extremely dynamic notion of objects and modules.

Another DrScheme tool, the Stepper, does not currently work with Python programs. The Stepper allows a programmer to run a program interactively step by step. To work with Python the Stepper would need to have access to a decompiler, a program capable of transforming a generated but partially reduced MzScheme program back into an equivalent Python program. Creating such a decompiler is a non-trivial task given the complexity of the code generated by the compiler.

Due to the difficulties encountered with the Python runtime and due to the current poor performance of the code generated, the compiler should be considered to be still at an experimental stage. The fact that most of the Python language has been implemented and that a DrScheme tool like Syntax Check can be used on Python programs without any change is encouraging though. The experience that has been gained in porting a real-world language to DrScheme is also valuable. We therefore consider the experiment to be successful, even if a lot of work still remains to be done.

4 Other Work

Over the past years there have been several discussions [10, 11] on Guile related mailing lists about creating a Python to Guile translator. A web site [12] for such a project even exists, but does not contain any software. Richard Stallman indicated [13] that a person has been working on ``finishing up a translator from Python to Scheme'' but no other information on that project could be found.

Jython, built on top of the Java Virtual Machine, is another implementation of the Python language. The implementation is mature and gives users access to the huge Java runtime. All the Python modules implemented in C in CPython were simply re-implemented in Java. Maintaining the CPython and Jython runtimes synchronous requires constant work though.

Python for .NET is an exploratory implementation of the Python language for the .NET framework and has therefore severe limitations (e.g. no multiple inheritance). Like Jython it gives access to the underlying runtime system and libraries. Only a handful of modules from the Python runtime have been implemented. Among those, the ones written in Python became accessible to the user after being modified to fit within the more limited Python language implemented by the interpreter. A few modules originally written in C in CPython were re-implemented using the C# language. There is also another implementation of the Python language for the .NET platform called IronPython (presented at PyCon 2004).

Compilers for other languages beside Python are being developed for DrScheme. Matthew Flatt developed an implementation of the Algol60 language as a proof of concept. David Goldberg is currently working on a compiler for the OCaml language called Dromedary, and Kathy Gray is working on a DrScheme embedding of Java called ProfessorJ.

5 Future Work

In its present state the biggest limitation of the compiler is the lack of full access to the C-level Python runtime. As such we are currently focusing most of our development efforts in that area, embedding the necessary framework one module at a time.

While the performance of the generated code is poor, no attempt has yet been made at profiling it. The performance will be better once the code generator has been modified to create more optimized code, although it is unclear to us at this stage how much improvement can be expected in this regard. The need to simulate some of the main Python features (e.g. the object and module systems) and the large number of runtime function calls and lookups involved means than the generated code will probably never have a performance level on par with the CPython system although an acceptable level should be within reach.

As described in section 3, a few parts of the Python language remain to be implemented. We do not anticipate any problem with these. There is also a general need for better error messages and a more complete test suite.

6 Conclusion

A new implementation of the Python language is now available, based on the MzScheme interpreter and the DrScheme IDE. While most of the core language has been implemented a lot of work remains to be done on the implementation of the Python runtime and on improving the performance. Despite this Python developers can already benefit from some of DrScheme's development tools to write Python code, and Scheme programmers start now to have access to the large number of existing Python libraries.

We are actively seeking volunteers to help continue this project (see section 5). For more information, visit our web site [14] and email us at either {dsilva, meunier} at ccs.neu.edu.

7 Acknowledgment

Many thanks to Scott Owens for developing the DrScheme parser tools and creating the first version of the MzScheme lexer and parser grammar for the Python language.

[1]	G. van Rossum, Python Reference Manual. PythonLabs, February 2003.
[2]	G. van Rossum, ``The Python interpreter.'' PythonLabs, February 2003. Version 2.3a2.
[3]	S. Pedroni and N. Rappin, Jython Essentials. O'Reilly, March 2002.
[4]	M. Hammond, ``Python for .NET: Lessons learned.'' ActiveState Tool Corporation, November 2000.
[5]	M. Flatt, ``The MzScheme interpreter.'' PLT, December 2002. Version 203.
[6]	M. Flatt, PLT MzScheme: Language Manual. PLT, December 2002.
[7]	M. Flatt, PLT MrEd: Graphical Toolbox Manual. PLT, December 2002.
[8]	R. B. Findler, C. Flanagan, M. Flatt, S. Krishnamurthi, and M. Felleisen, ``DrScheme: A Pedagogic Programming Environment for Scheme,'' Programming Languages: Implementations, Logics, and Programs, vol. 1292, pp. 369-388, September 1997.
[9]	``Drscheme.'' http://www.drscheme.org.
[10]	``Guile archive.'' http://www.cygwin.com/ml/guile/2000-01/threads.html#00382, January 2000.
[11]	``Guile archive.'' http://sources.redhat.com/ml/guile/2000-07/threads.html#00072, July 2000.
[12]	T. Bushnell. http://savannah.gnu.org/projects/gpc/.
[13]	R. Stallman, ``Invited Talk: My Lisp Experiences and the Development of GNU Emacs,'' in Proc. International Lisp Conference, October 2002.
[14]	``Spy.'' http://spyweb.hopto.org.