PEP 302 -- New Import Hooks

PEP:	302
Title:	New Import Hooks
Version:	$Revision: 1934 $
Last-Modified:	$Date: 2004-09-28 05:39:00 -0700 (Tue, 28 Sep 2004) $
Author:	Just van Rossum <just at letterror.com>, Paul Moore <gustav at morpheus.demon.co.uk>
Status:	Draft
Type:	Standards Track
Content-Type:	text/plain
Created:	19-Dec-2002
Python-Version:	2.3
Post-History:	19-Dec-2002

Abstract

    This PEP proposes to add a new set of import hooks that offer better
    customization of the Python import mechanism.  Contrary to the
    current __import__ hook, a new-style hook can be injected into the
    existing scheme, allowing for a finer grained control of how modules
    are found and how they are loaded.

Motivation

    The only way to customize the import mechanism is currently to
    override the built-in __import__ function.  However, overriding
    __import__ has many problems.  To begin with:

    - An __import__ replacement needs to *fully* reimplement the entire
      import mechanism, or call the original __import__ before or after
      the custom code.

    - It has very complex semantics and responsibilities.

    - __import__ gets called even for modules that are already in
      sys.modules, which is almost never what you want, unless you're
      writing some sort of monitoring tool.

    The situation gets worse when you need to extend the import
    mechanism from C: it's currently impossible, apart from hacking
    Python's import.c or reimplementing much of import.c from scratch.

    There is a fairly long history of tools written in Python that allow
    extending the import mechanism in various way, based on the
    __import__ hook.  The Standard Library includes two such tools:
    ihooks.py (by GvR) and imputil.py (Greg Stein), but perhaps the most
    famous is iu.py by Gordon McMillan, available as part of his
    Installer [1] package.  Their usefulness is somewhat limited because
    they are written in Python; bootstrapping issues need to worked
    around as you can't load the module containing the hook with the
    hook itself.  So if you want the entire Standard Library to be
    loadable from an import hook, the hook must be written in C.

Use cases

    This section lists several existing applications that depend on
    import hooks.  Among these, a lot of duplicate work was done that
    could have been saved if there had been a more flexible import hook
    at the time.  This PEP should make life a lot easier for similar
    projects in the future.

    Extending the import mechanism is needed when you want to load
    modules that are stored in a non-standard way.  Examples include
    modules that are bundled together in an archive; byte code that is
    not stored in a pyc formatted file; modules that are loaded from a
    database over a network.

    The work on this PEP was partly triggered by the implementation of
    PEP 273 [2], which adds imports from Zip archives as a built-in
    feature to Python.  While the PEP itself was widely accepted as a
    must-have feature, the implementation left a few things to desire.
    For one thing it went through great lengths to integrate itself with
    import.c, adding lots of code that was either specific for Zip file
    imports or *not* specific to Zip imports, yet was not generally
    useful (or even desirable) either.  Yet the PEP 273 implementation
    can hardly be blamed for this: it is simply extremely hard to do,
    given the current state of import.c.

    Packaging applications for end users is a typical use case for
    import hooks, if not *the* typical use case.  Distributing lots of
    source or pyc files around is not always appropriate (let alone a
    separate Python installation), so there is a frequent desire to
    package all needed modules in a single file.  So frequent in fact
    that multiple solutions have been implemented over the years.

    The oldest one is included with the Python source code: Freeze [3].
    It puts marshalled byte code into static objects in C source code.
    Freeze's "import hook" is hard wired into import.c, and has a couple
    of issues.  Later solutions include Fredrik Lundh's Squeeze [4],
    Gordon McMillan's Installer [1] and Thomas Heller's py2exe [5].
    MacPython ships with a tool called BuildApplication.

    Squeeze, Installer and py2exe use an __import__ based scheme (py2exe
    currently uses Installer's iu.py, Squeeze used ihooks.py), MacPython
    has two Mac-specific import hooks hard wired into import.c, that are
    similar to the Freeze hook.  The hooks proposed in this PEP enables
    us (at least in theory; it's not a short term goal) to get rid of
    the hard coded hooks in import.c, and would allow the
    __import__-based tools to get rid of most of their import.c
    emulation code.

    Before work on the design and implementation of this PEP was
    started, a new BuildApplication-like tool for MacOS X prompted one
    of the authors of this PEP (JvR) to expose the table of frozen
    modules to Python, in the imp module.  The main reason was to be
    able to use the freeze import hook (avoiding fancy __import__
    support), yet to also be able to supply a set of modules at
    runtime.  This resulted in sf patch #642578 [6], which was
    mysteriously accepted (mostly because nobody seemed to care either
    way ;-).  Yet it is completely superfluous when this PEP gets
    accepted, as it offers a much nicer and general way to do the same
    thing.

Rationale

    While experimenting with alternative implementation ideas to get
    built-in Zip import, it was discovered that achieving this is
    possible with only a fairly small amount of changes to import.c.
    This allowed to factor out the Zip-specific stuff into a new source
    file, while at the same time creating a *general* new import hook
    scheme: the one you're reading about now.

    An earlier design allowed non-string objects on sys.path.  Such an
    object would have the necessary methods to handle an import.  This
    has two disadvantages: 1) it breaks code that assumes all items on
    sys.path are strings; 2) it is not compatible with the PYTHONPATH
    environment variable.  The latter is directly needed for Zip
    imports.  A compromise came from Jython: allow string *subclasses*
    on sys.path, which would then act as importer objects.  This avoids
    some breakage, and seems to work well for Jython (where it is used
    to load modules from .jar files), but it was perceived as an "ugly
    hack".

    This lead to a more elaborate scheme, (mostly copied from McMillan's
    iu.py) in which each in a list of candidates is asked whether it can
    handle the sys.path item, until one is found that can.  This list of
    candidates is a new object in the sys module: sys.path_hooks.

    Traversing sys.path_hooks for each path item for each new import can
    be expensive, so the results are cached in another new object in the
    sys module: sys.path_importer_cache.  It maps sys.path entries to
    importer objects.

    To minimize the impact on import.c as well as to avoid adding extra
    overhead, it was chosen to not add an explicit hook and importer
    object for the existing file system import logic (as iu.py has), but
    to simply fall back to the built-in logic if no hook on
    sys.path_hooks could handle the path item.  If this is the case, a
    None value is stored in sys.path_importer_cache, again to avoid
    repeated lookups.  (Later we can go further and add a real importer
    object for the built-in mechanism, for now, the None fallback scheme
    should suffice.)

    A question was raised: what about importers that don't need *any*
    entry on sys.path? (Built-in and frozen modules fall into that
    category.)  Again, Gordon McMillan to the rescue: iu.py contains a
    thing he calls the "metapath".  In this PEP's implementation, it's a
    list of importer objects that is traversed *before* sys.path.  This
    list is yet another new object in the sys.module: sys.meta_path.
    Currently, this list is empty by default, and frozen and built-in
    module imports are done after traversing sys.meta_path, but still
    before sys.path.  (Again, later we can add real frozen, built-in and
    sys.path importer objects on sys.meta_path, allowing for some extra
    flexibility, but this could be done as a "phase 2" project, possibly
    for Python 2.4.  It would be the finishing touch as then *every*
    import would go through sys.meta_path, making it the central import
    dispatcher.)

Specification part 1: The Importer Protocol

    This PEP introduces a new protocol: the "Importer Protocol".  It is
    important to understand the context in which the protocol operates,
    so here is a brief overview of the outer shells of the import
    mechanism.

    When an import statement is encountered, the interpreter looks up
    the __import__ function in the built-in name space.  __import__ is
    then called with four arguments, amongst which are the name of the
    module being imported (may be a dotted name) and a reference to the
    current global namespace.

    The built-in __import__ function (known as PyImport_ImportModuleEx
    in import.c) will then check to see whether the module doing the
    import is a package or a submodule of a package.  If it is indeed a
    (submodule of a) package, it first tries to do the import relative
    to the package (the parent package for a submodule).  For example if
    a package named "spam" does "import eggs", it will first look for a
    module named "spam.eggs".  If that fails, the import continues as an
    absolute import: it will look for a module named "eggs".  Dotted
    name imports work pretty much the same: if package "spam" does
    "import eggs.bacon" (and "spam.eggs" exists and is itself a
    package), "spam.eggs.bacon" is tried.  If that fails "eggs.bacon" is
    tried.  (There are more subtleties that are not described here, but
    these are not relevant for implementers of the Importer Protocol.)

    Deeper down in the mechanism, a dotted name import is split up by
    its components.  For "import spam.ham", first an "import spam" is
    done, and only when that succeeds is "ham" imported as a submodule
    of "spam".

    The Importer Protocol operates at this level of *individual*
    imports.  By the time an importer gets a request for "spam.ham",
    module "spam" has already been imported.

    The protocol involves two objects: an importer and a loader.  An
    importer object has a single method:

        importer.find_module(fullname, path=None)

    This method will be called with the fully qualified name of the
    module.  If the importer is installed on sys.meta_path, it will
    receive a second argument, which is None for a top-level module, or
    package.__path__ for submodules or subpackages[7].  It should return
    a loader object if the module was found, or None if it wasn't.  If
    find_module() raises an exception, it will be propagated to the
    caller, aborting the import.

    A loader object also has one method:

        loader.load_module(fullname)

    This method returns the loaded module.  In many cases the importer
    and loader can be one and the same object: importer.find_module()
    would just return self.

    The 'fullname' argument of both methods is the fully qualified
    module name, for example "spam.eggs.ham".  As explained above, when
    importer.find_module("spam.eggs.ham") is called, "spam.eggs" has
    already been imported and added to sys.modules.  However, the
    find_module() method isn't necessarily always called during an
    actual import: meta tools that analyze import dependencies (such as
    freeze, Installer or py2exe) don't actually load modules, so an
    importer shouldn't *depend* on the parent package being available in
    sys.modules.

    The load_module() method has a few responsibilities that it must
    fulfill *before* it runs any code:

    - If there is an existing module object named 'fullname' in
      sys.modules, the loader must use that existing module.
      (Otherwise, the reload() builtin will not work correctly.)
      If a module named 'fullname' does not exist in sys.modules,
      the loader must create a new module object and add it to
      sys.modules.

      In C code, all of these requirements can be met simply by using
      the PyImport_AddModule() function, which returns the existing
      module or creates a new one and adds it to sys.modules for you.
      In Python code, you can use something like:

        module = sys.modules.setdefault(fullname, new.module(fullname))

      to accomplish the same results.

      Note that the module object *must* be in sys.modules before the
      loader executes the module code.  This is crucial because the
      module code may (directly or indirectly) import itself; adding
      it to sys.modules beforehand prevents unbounded recursion in the
      worst case and multiple loading in the best.

    - The __file__ attribute must be set.  This must be a string, but it
      may be a dummy value, for example "<frozen>".  The privilege of
      not having a __file__ attribute at all is reserved for built-in
      modules.

    - If it's a package, the __path__ variable must be set.  This must
      be a list, but may be empty if __path__ has no further
      significance to the importer (more on this later).

    - It should add an __loader__ attribute to the module, set to the
      loader object.  This is mostly for introspection, but can be used
      for importer-specific extras, for example getting data associated
      with an importer.

    If the module is a Python module (as opposed to a built-in module or
    a dynamically loaded extension), it should execute the module's code
    in the module's global name space (module.__dict__).

    Here is a minimal pattern for a load_module() method:

        def load_module(self, fullname):
            ispkg, code = self._get_code(fullname)
            mod = sys.modules.setdefault(fullname, imp.new_module(fullname))
            mod.__file__ = "<%s>" % self.__class__.__name__
            mod.__loader__ = self
            if ispkg:
                mod.__path__ = []
            exec code in mod.__dict__
            return mod

Specification part 2: Registering Hooks

    There are two types of import hooks: Meta hooks and Path hooks.
    Meta hooks are called at the start of import processing, before any
    other import processing (so that meta hooks can override sys.path
    processing, or frozen modules, or even built-in modules).  To
    register a meta hook, simply add the importer object to
    sys.meta_path (the list of registered meta hooks).

    Path hooks are called as part of sys.path (or package.__path__)
    processing, at the point where their associated path item is
    encountered.  A path hook is registered by adding an importer
    factory to sys.path_hooks.

    sys.path_hooks is a list of callables, which will be checked in
    sequence to determine if they can handle a given path item.  The
    callable is called with one argument, the path item.  The callable
    must raise ImportError if it is unable to handle the path item, and
    return an importer object if it can handle the path item.  The
    callable is typically the class of the import hook, and hence the
    class __init__ method is called.  (This is also the reason why it
    should raise ImportError: an __init__ method can't return anything.
    This would be possible with a __new__ method in a new style class,
    but we don't want to require anything about how a hook is
    implemented.)

    The results of path hook checks are cached in
    sys.path_importer_cache, which is a dictionary mapping path entries
    to importer objects.  The cache is checked before sys.path_hooks is
    scanned.  If it is necessary to force a rescan of sys.path_hooks, it
    is possible to manually clear all or part of
    sys.path_importer_cache.

    Just like sys.path itself, the new sys variables must have specific
    types:

        sys.meta_path and sys.path_hooks must be Python lists.
        sys.path_importer_cache must be a Python dict.

    Modifying these variables in place is allowed, as is replacing them
    with new objects.

Packages and the role of path

    If a module has a __path__ attribute, the import mechanism will
    treat it as a package.  The __path__ variable is used instead of
    sys.path when importing submodules of the package.  The rules for
    sys.path therefore also apply to pkg.__path__.  So sys.path_hooks is
    also consulted when pkg.__path__ is traversed.  Meta importers don't
    necessarily use sys.path at all to do their work and may therefore
    ignore the value of pkg.__path__.  In this case it is still advised
    to set it to list, which can be empty.

Optional Extensions to the Importer Protocol

    The Importer Protocol defines two optional extensions.  One is to
    retrieve data files, the other is to support module packaging tools
    and/or tools that analyze module dependencies (for example Freeze
    [3]).  The latter category of tools usually don't actually *load*
    modules, they only need to know if and where they are available.
    Both extensions are highly recommended for general purpose
    importers, but may safely be left out if those features aren't
    needed.

    To retrieve the data for arbitrary "files" from the underlying
    storage backend, loader objects may supply a method named get_data:

        loader.get_data(path)

    This method returns the data as a string, or raise IOError if the
    "file" wasn't found.  It is meant for importers that have some
    file-system-like properties.  The 'path' argument is a path that can
    be constructed by munging module.__file__ (or pkg.__path__ items)
    with the os.path.* functions, for example:

        d = os.path.dirname(__file__)
        data = __loader__.get_data(os.path.join(d, "mydata.txt"))
    
    The following set of methods may be implemented if support for (for
    example) Freeze-like tools is desirable.  It consists of three
    additional methods which, to make it easier for the caller, each of
    which should be implemented, or none at all.

        loader.is_package(fullname)
        loader.get_code(fullname)
        loader.get_source(fullname)

    All three methods should raise ImportError if the module wasn't
    found.

    The loader.is_package(fullname) method should return True if the
    module specified by 'fullname' is a package and False if it isn't.

    The loader.get_code(fullname) method should return the code object
    associated with the module, or None if it's a built-in or extension
    module.  If the loader doesn't have the code object but it _does_
    have the source code, it should return the compiled the source code.
    (This is so that our caller doesn't also need to check get_source()
    if all it needs is the code object.)

    The loader.get_source(fullname) method should return the source code
    for the module as a string (using newline characters for line
    endings) or None if the source is not available (yet it should still
    raise ImportError if the module can't be found by the importer at
    all).

Integration with the 'imp' module

    The new import hooks are not easily integrated in the existing
    imp.find_module() and imp.load_module() calls.  It's questionable
    whether it's possible at all without breaking code; it is better to
    simply add a new function to the imp module.  The meaning of the
    existing imp.find_module() and imp.load_module() calls changes from:
    "they expose the built-in import mechanism" to "they expose the
    basic *unhooked* built-in import mechanism".  They simply won't
    invoke any import hooks.  A new imp module function is proposed (but
    not yet implemented) under the name "get_loader", which is used as
    in the following pattern:

        loader = imp.get_loader(fullname, path)
        if loader is not None:
            loader.load_module(fullname)

    In the case of a "basic" import, one the imp.find_module() function
    would handle, the loader object would be a wrapper for the current
    output of imp.find_module(), and loader.load_module() would call
    imp.load_module() with that output.

    Note that this wrapper is currently not yet implemented, although a
    Python prototype exists in the test_importhooks.py script (the
    ImpWrapper class) included with the patch.

Forward Compatibility

    Existing __import__ hooks will not invoke new-style hooks by magic,
    unless they call the original __import__ function as a fallback.
    For example, ihooks.py, iu.py and imputil.py are in this sense not
    forward compatible with this PEP.

Open Issues

    Modules often need supporting data files to do their job,
    particularly in the case of complex packages or full applications.
    Current practice is generally to locate such files via sys.path (or
    a package.__path__ attribute).  This approach will not work, in
    general, for modules loaded via an import hook.

    There are a number of possible ways to address this problem:

    - "Don't do that".  If a package needs to locate data files via its
      __path__, it is not suitable for loading via an import hook.  The
      package can still be located on a directory in sys.path, as at
      present, so this should not be seen as a major issue.

    - Locate data files from a standard location, rather than relative
      to the module file.  A relatively simple approach (which is
      supported by distutils) would be to locate data files based on
      sys.prefix (or sys.exec_prefix).  For example, looking in
      os.path.join(sys.prefix, "data", package_name).

    - Import hooks could offer a standard way of getting at data files
      relative to the module file.  The standard zipimport object
      provides a method get_data(name) which returns the content of the
      "file" called name, as a string.  To allow modules to get at the
      importer object, zipimport also adds an attribute "__loader__"
      to the module, containing the zipimport object used to load the
      module.  If such an approach is used, it is important that client
      code takes care not to break if the get_data method (or the
      __loader__ attribute) is not available, so it is not clear that
      this approach offers a general answer to the problem.

    Requiring loaders to set the module's __loader__ attribute means
    that the loader will not get thrown away once the load is complete.
    This increases memory usage, and stops loaders from being
    lightweight, "throwaway" objects.  As loader objects are not
    required to offer any useful functionality (any such functionality,
    such as the zipimport get_data() method mentioned above, is
    optional) it is not clear that the __loader__ attribute will be
    helpful, in practice.

    On the other hand, importer objects are mostly permanent, as they
    live or are kept alive on sys.meta_path, sys.path_importer_cache, so
    for a loader to keep a reference to the importer costs us nothing
    extra.  Whether loaders will ever need to carry so much independent
    state for this to become a real issue is questionable.

    It was suggested on python-dev that it would be useful to be able to
    receive a list of available modules from an importer and/or a list
    of available data files for use with the get_data() method.  The
    protocol could grow two additional extensions, say list_modules()
    and list_files().  The latter makes sense on loader objects with a
    get_data() method.  However, it's a bit unclear which object should
    implement list_modules(): the importer or the loader or both?

    This PEP is biased towards loading modules from alternative places:
    it currently doesn't offer dedicated solutions for loading modules
    from alternative file formats or with alternative compilers.  In
    contrast, the ihooks module from the standard library does have a
    fairly straightforward way to do this.  The Quixote project [8] uses
    this technique to import PTL files as if they are ordinary Python
    modules.  To do the same with the new hooks would either mean to add
    a new module implementing a subset of ihooks as a new-style
    importer, or add a hookable built-in path importer object.

    There is no specific support within this PEP for "stacking" hooks. 
    For example, it is not obvious how to write a hook to load modules
    from ..tar.gz files by combining separate hooks to load modules from
    .tar and ..gz files.  However, there is no support for such stacking
    in the existing hook mechanisms (either the basic "replace
    __import__" method, or any of the existing import hook modules) and
    so this functionality is not an obvious requirement of the new
    mechanism.  It may be worth considering as a future enhancement,
    however.

    It is possible (via sys.meta_path) to add hooks which run before
    sys.path is processed.  However, there is no equivalent way of
    adding hooks to run after sys.path is processed.  For now, if a hook
    is required after sys.path has been processed, it can be simulated
    by adding an arbitrary "cookie" string at the end of sys.path, and
    having the required hook associated with this cookie, via the normal
    sys.path_hooks processing.  In the longer term, the path handling
    code will become a "real" hook on sys.meta_path, and at that stage
    it will be possible to insert user-defined hooks either before or
    after it.

Implementation

    The PEP 302 implementation has been integrated with Python as of
    2.3a1.  An earlier version is available as SourceForge patch
    #652586, but more interestingly, the SF item contains a fairly
    detailed history of the development and design.
    http://www.python.org/sf/652586

    PEP 273 has been implemented using PEP 302's import hooks.

References and Footnotes

    [1] Installer by Gordon McMillan
    http://www.mcmillan-inc.com/install1.html

    [2] PEP 273, Import Modules from Zip Archives, Ahlstrom
    http://www.python.org/peps/pep-0273.html

    [3] The Freeze tool
    Tools/freeze/ in a Python source distribution

    [4] Squeeze
    http://starship.python.net/crew/fredrik/ipa/squeeze.htm

    [5] py2exe by Thomas Heller
    http://py2exe.sourceforge.net/

    [6] imp.set_frozenmodules() patch
    http://www.python.org/sf/642578

    [7] The path argument to importer.find_module() is there because the
    pkg.__path__ variable may be needed at this point.  It may either
    come from the actual parent module or be supplied by
    imp.find_module() or the proposed imp.get_loader() function.

    [8] Quixote, a framework for developing Web applications
    http://www.mems-exchange.org/software/quixote/

Copyright

    This document has been placed in the public domain.

Abstract

Motivation

Use cases

Rationale

Specification part 1: The Importer Protocol

Specification part 2: Registering Hooks

Packages and the role of __path__

Optional Extensions to the Importer Protocol

Integration with the 'imp' module

Forward Compatibility

Open Issues

Implementation

References and Footnotes

Copyright

Packages and the role of path