PEP: 219
Title: Stackless Python
Version: $Revision: 745 $
Author: Gordon McMillan <gmcm at>
Status: Draft
Type: Standards Track
Created: 14-Aug-2000
Python-Version: 2.1


    This PEP discusses changes required to core Python in order to
    efficiently support generators, microthreads and coroutines. It is
    related to PEP 220, which describes how Python should be extended
    to support these facilities. The focus of this PEP is strictly on
    the changes required to allow these extensions to work.

    While these PEPs are based on Christian Tismer's Stackless[1]
    implementation, they do not regard Stackless as a reference
    implementation.  Stackless (with an extension module) implements
    continuations, and from continuations one can implement
    coroutines, microthreads (as has been done by Will Ware[2]) and
    generators. But in more that a year, no one has found any other
    productive use of continuations, so there seems to be no demand
    for their support.

    However, Stackless support for continuations is a relatively minor
    piece of the implementation, so one might regard it as "a"
    reference implementation (rather than "the" reference


    Generators and coroutines have been implemented in a number of
    languages in a number of ways. Indeed, Tim Peters has done pure
    Python implementations of generators[3] and coroutines[4] using
    threads (and a thread-based coroutine implementation exists for
    Java). However, the horrendous overhead of a thread-based
    implementation severely limits the usefulness of this approach.

    Microthreads (a.k.a "green" or "user" threads) and coroutines
    involve transfers of control that are difficult to accommodate in
    a language implementation based on a single stack. (Generators can
    be done on a single stack, but they can also be regarded as a very
    simple case of coroutines.)

    Real threads allocate a full-sized stack for each thread of
    control, and this is the major source of overhead. However,
    coroutines and microthreads can be implemented in Python in a way
    that involves almost no overhead.  This PEP, therefor, offers a
    way for making Python able to realistically manage thousands of
    separate "threads" of activity (vs. todays limit of perhaps dozens
    of separate threads of activity).

    Another justification for this PEP (explored in PEP 220) is that
    coroutines and generators often allow a more direct expression of
    an algorithm than is possible in today's Python.


    The first thing to note is that Python, while it mingles
    interpreter data (normal C stack usage) with Python data (the
    state of the interpreted program) on the stack, the two are
    logically separate. They just happen to use the same stack.

    A real thread gets something approaching a process-sized stack
    because the implementation has no way of knowing how much stack
    space the thread will require. The stack space required for an
    individual frame is likely to be reasonable, but stack switching
    is an arcane and non-portable process, not supported by C.

    Once Python stops putting Python data on the C stack, however,
    stack switching becomes easy.

    The fundamental approach of the PEP is based on these two
    ideas. First, separate C's stack usage from Python's stack
    usage. Secondly, associate with each frame enough stack space to
    handle that frame's execution.

    In the normal usage, Stackless Python has a normal stack
    structure, except that it is broken into chunks. But in the
    presence of a coroutine / microthread extension, this same
    mechanism supports a stack with a tree structure.  That is, an
    extension can support transfers of control between frames outside
    the normal "call / return" path.


    The major difficulty with this approach is C calling Python. The
    problem is that the C stack now holds a nested execution of the
    byte-code interpreter. In that situation, a coroutine /
    microthread extension cannot be permitted to transfer control to a
    frame in a different invocation of the byte-code interpreter. If a
    frame were to complete and exit back to C from the wrong
    interpreter, the C stack could be trashed.

    The ideal solution is to create a mechanism where nested
    executions of the byte code interpreter are never needed. The easy
    solution is for the coroutine / microthread extension(s) to
    recognize the situation and refuse to allow transfers outside the
    current invocation.

    We can categorize code that involves C calling Python into two
    camps: Python's implementation, and C extensions. And hopefully we
    can offer a compromise: Python's internal usage (and C extension
    writers who want to go to the effort) will no longer use a nested
    invocation of the interpreter. Extensions which do not go to the
    effort will still be safe, but will not play well with coroutines
    / microthreads.

    Generally, when a recursive call is transformed into a loop, a bit
    of extra bookkeeping is required. The loop will need to keep it's
    own "stack" of arguments and results since the real stack can now
    only hold the most recent. The code will be more verbose, because
    it's not quite as obvious when we're done. While Stackless is not
    implemented this way, it has to deal with the same issues.

    In normal Python, PyEval_EvalCode is used to build a frame and
    execute it. Stackless Python introduces the concept of a
    FrameDispatcher. Like PyEval_EvalCode, it executes one frame. But
    the interpreter may signal the FrameDispatcher that a new frame
    has been swapped in, and the new frame should be executed. When a
    frame completes, the FrameDispatcher follows the back pointer to
    resume the "calling" frame.

    So Stackless transforms recursions into a loop, but it is not the
    FrameDispatcher that manages the frames. This is done by the
    interpreter (or an extension that knows what it's doing).

    The general idea is that where C code needs to execute Python
    code, it creates a frame for the Python code, setting its back
    pointer to the current frame. Then it swaps in the frame, signals
    the FrameDispatcher and gets out of the way. The C stack is now
    clean - the Python code can transfer control to any other frame
    (if an extension gives it the means to do so).

    In the vanilla case, this magic can be hidden from the programmer
    (even, in most cases, from the Python-internals programmer). Many
    situations present another level of difficulty, however.

    The map builtin function involves two obstacles to this
    approach. It cannot simply construct a frame and get out of the
    way, not just because there's a loop involved, but each pass
    through the loop requires some "post" processing. In order to play
    well with others, Stackless constructs a frame object for map

    Most recursions of the interpreter are not this complex, but
    fairly frequently, some "post" operations are required. Stackless
    does not fix these situations because of amount of code changes
    required. Instead, Stackless prohibits transfers out of a nested
    interpreter. While not ideal (and sometimes puzzling), this
    limitation is hardly crippling.


    For normal Python, the advantage to this approach is that C stack
    usage becomes much smaller and more predictable. Unbounded
    recursion in Python code becomes a memory error, instead of a
    stack error (and thus, in non-Cupertino operating systems,
    something that can be recovered from).  The price, of course, is
    the added complexity that comes from transforming recursions of
    the byte-code interpreter loop into a higher order loop (and the
    attendant bookkeeping involved).

    The big advantage comes from realizing that the Python stack is
    really a tree, and the frame dispatcher can transfer control
    freely between leaf nodes of the tree, thus allowing things like
    microthreads and coroutines.


    [3] Demo/threads/ in the source distribution