PEP 273 -- Import Modules from Zip Archives

PEP:	273
Title:	Import Modules from Zip Archives
Version:	$Revision: 1933 $
Last-Modified:	$Date: 2004-09-27 18:11:15 -0700 (Mon, 27 Sep 2004) $
Author:	James C. Ahlstrom <jim at interet.com>
Status:	Final
Type:	Standards Track
Created:	11-Oct-2001
Post-History:	26-Oct-2001
Python-Version:	2.3

Abstract

    This PEP adds the ability to import Python modules
    *.py, *.py[co] and packages from zip archives.  The
    same code is used to speed up normal directory imports
    provided os.listdir is available.

Note

    Zip imports were added to Python 2.3, but the final implementation
    uses an approach different from the one described in this PEP.
    The 2.3 implementation is SourceForge patch #652586, which adds
    new import hooks described in PEP 302.  

    The rest of this PEP is therefore only of historical interest.

Specification

    Currently, sys.path is a list of directory names as strings.  If
    this PEP is implemented, an item of sys.path can be a string
    naming a zip file archive.  The zip archive can contain a
    subdirectory structure to support package imports.  The zip
    archive satisfies imports exactly as a subdirectory would.

    The implementation is in C code in the Python core and works on
    all supported Python platforms.

    Any files may be present in the zip archive, but only files
    *.py and *.py[co] are available for import.  Zip import of
    dynamic modules (*.pyd, *.so) is disallowed.

    Just as sys.path currently has default directory names, a default
    zip archive name is added too.  Otherwise there is no way to
    import all Python library files from an archive.

Subdirectory Equivalence

    The zip archive must be treated exactly as a subdirectory tree so
    we can support package imports based on current and future rules.
    All zip data is taken from the Central Directory, the data must be
    correct, and brain dead zip files are not accommodated.

    Suppose sys.path contains "/A/B/SubDir" and "/C/D/E/Archive.zip",
    and we are trying to import modfoo from the Q package.  Then
    import.c will generate a list of paths and extensions and will
    look for the file.  The list of generated paths does not change
    for zip imports.  Suppose import.c generates the path
    "/A/B/SubDir/Q/R/modfoo.pyc".  Then it will also generate the path
    "/C/D/E/Archive.zip/Q/R/modfoo.pyc".  Finding the SubDir path is
    exactly equivalent to finding "Q/R/modfoo.pyc" in the archive.

    Suppose you zip up /A/B/SubDir/* and all its subdirectories.  Then
    your zip file will satisfy imports just as your subdirectory did.

    Well, not quite.  You can't satisfy dynamic modules from a zip
    file.  Dynamic modules have extensions like .dll, .pyd, and .so.
    They are operating system dependent, and probably can't be loaded
    except from a file.  It might be possible to extract the dynamic
    module from the zip file, write it to a plain file and load it.
    But that would mean creating temporary files, and dealing with all
    the dynload_*.c, and that's probably not a good idea.

    When trying to import *.pyc, if it is not available then
    *.pyo will be used instead.  And vice versa when looking for *.pyo.
    If neither *.pyc nor *.pyo is available, or if the magic numbers
    are invalid, then *.py will be compiled and used to satisfy the
    import, but the compiled file will not be saved.  Python would
    normally write it to the same directory as *.py, but surely we
    don't want to write to the zip file.  We could write to the
    directory of the zip archive, but that would clutter it up, not
    good if it is /usr/bin for example.

    Failing to write the compiled files will make zip imports very slow,
    and the user will probably not figure out what is wrong.  So it
    is best to put *.pyc and *.pyo in the archive with the *.py.

Efficiency

    The only way to find files in a zip archive is linear search.  So
    for each zip file in sys.path, we search for its names once, and
    put the names plus other relevant data into a static Python
    dictionary.  The key is the archive name from sys.path joined with
    the file name (including any subdirectories) within the archive.
    This is exactly the name generated by import.c, and makes lookup
    easy.

    This same mechanism is used to speed up directory (non-zip) imports.
    See below.

zlib

    Compressed zip archives require zlib for decompression.  Prior to
    any other imports, we attempt an import of zlib.  Import of
    compressed files will fail with a message "missing zlib" unless
    zlib is available.

Booting

    Python imports site.py itself, and this imports os, nt, ntpath,
    stat, and UserDict.  It also imports sitecustomize.py which may
    import more modules.  Zip imports must be available before site.py
    is imported.

    Just as there are default directories in sys.path, there must be
    one or more default zip archives too.

    The problem is what the name should be.  The name should be linked
    with the Python version, so the Python executable can correctly
    find its corresponding libraries even when there are multiple
    Python versions on the same machine.

    We add one name to sys.path.  On Unix, the directory is
    sys.prefix + "/lib", and the file name is
    "python%s%s.zip" % (sys.version[0], sys.version[2]).
    So for Python 2.2 and prefix /usr/local, the path
    /usr/local/lib/python2.2/ is already on sys.path, and
    /usr/local/lib/python22.zip would be added.
    On Windows, the file is the full path to python22.dll, with
    "dll" replaced by "zip".  The zip archive name is always inserted
    as the second item in sys.path.  The first is the directory of the
    main.py (thanks Tim).

Directory Imports

    The static Python dictionary used to speed up zip imports can be
    used to speed up normal directory imports too.  For each item in
    sys.path that is not a zip archive, we call os.listdir, and add
    the directory contents to the dictionary.  Then instead of calling
    fopen() in a double loop, we just check the dictionary.  This
    greatly speeds up imports.  If os.listdir doesn't exist, the
    dictionary is not used.

Benchmarks

    Case  Original 2.2a3    Using os.listdir   Zip Uncomp  Zip Compr
    ---- -----------------  -----------------  ----------  ----------
      1  3.2 2.5 3.2->1.02  2.3 2.5 2.3->0.87  1.66->0.93  1.5->1.07
      2  2.8 3.9 3.0->1.32  Same as Case 1.
      3  5.7 5.7 5.7->5.7   2.1 2.1 2.1->1.8   1.25->0.99  1.19->1.13
      4  9.4 9.4 9.3->9.35  Same as Case 3.

    Case 1: Local drive C:, sys.path has its default value.
    Case 2: Local drive C:, directory with files is at the end of sys.path.
    Case 3: Network  drive, sys.path has its default value.
    Case 4: Network  drive, directory with files is at the end of sys.path.

    Benchmarks were performed on a Pentium 4 clone, 1.4 GHz, 256 Meg.
    The machine was running Windows 2000 with a Linux/Samba network server.
    Times are in seconds, and are the time to import about 100 Lib modules.
    Case 2 and 4 have the "correct" directory moved to the end of sys.path.
    "Uncomp" means uncompressed zip archive, "Compr" means compressed.

    Initial times are after a re-boot of the system; the time after
    "->" is the time after repeated runs.  Times to import from C:
    after a re-boot are rather highly variable for the "Original" case,
    but are more realistic.

Custom Imports

    The logic demonstrates the ability to import using default searching
    until a needed Python module (in this case, os) becomes available.
    This can be used to bootstrap custom importers.  For example, if
    "importer()" in __init__.py exists, then it could be used for imports.
    The "importer()" can freely import os and other modules, and these
    will be satisfied from the default mechanism.  This PEP does not
    define any custom importers, and this note is for information only.

Implementation

    A C implementation is available as SourceForge patch 492105.
    Superceded by patch 652586 and current CVS.
    http://python.org/sf/492105

    A newer version (updated for recent CVS by Paul Moore) is 645650.
    Superceded by patch 652586 and current CVS.
    http://python.org/sf/645650

    A competing implementation by Just van Rossum is 652586, which is
    the basis for the final implementation of PEP 302.  PEP 273 has
    been implemented using PEP 302's import hooks.
    http://python.org/sf/652586

Copyright

    This document has been placed in the public domain.