PEP 296 -- Adding a bytes Object Type

PEP:	296
Title:	Adding a bytes Object Type
Version:	$Revision: 1681 $
Last-Modified:	$Date: 2003-09-07 06:55:30 -0700 (Sun, 07 Sep 2003) $
Author:	xscottg at yahoo.com (Scott Gilbert)
Status:	Rejected
Type:	Standards Track
Created:	12-Jul-2002
Python-Version:	2.3
Post-History:

Notice

    This PEP is withdrawn by the author.

Abstract

    This PEP proposes the creation of a new standard type and builtin
    constructor called 'bytes'.  The bytes object is an efficiently
    stored array of bytes with some additional characteristics that
    set it apart from several implementations that are similar.

Rationale

    Python currently has many objects that implement something akin to
    the bytes object of this proposal.  For instance the standard
    string, buffer, array, and mmap objects are all very similar in
    some regards to the bytes object.  Additionally, several
    significant third party extensions have created similar objects to
    try and fill similar needs.  Frustratingly, each of these objects
    is too narrow in scope and is missing critical features to make it
    applicable to a wider category of problems.

Specification

    The bytes object has the following important characteristics:

    1. Efficient underlying array storage via the standard C type "unsigned
    char".  This allows fine grain control over how much memory is
    allocated.  With the alignment restrictions designated in the next
    item, it is trivial for low level extensions to cast the pointer
    to a different type as needed.
    
    Also, since the object is implemented as an array of bytes, it is
    possible to pass the bytes object to the extensive library of
    routines already in the standard library that presently work with
    strings.  For instance, the bytes object in conjunction with the
    struct module could be used to provide a complete replacement for
    the array module using only Python script.

    If an unusual platform comes to light, one where there isn't a
    native unsigned 8 bit type, the object will do its best to
    represent itself at the Python script level as though it were an
    array of 8 bit unsigned values.  It is doubtful whether many
    extensions would handle this correctly, but Python script could be
    portable in these cases.

    2. Alignment of the allocated byte array is whatever is promised by the
    platform implementation of malloc.  A bytes object created from an
    extension can be supplied that provides any arbitrary alignment as
    the extension author sees fit.

    This alignment restriction should allow the bytes object to be
    used as storage for all standard C types - including PyComplex
    objects or other structs of standard C type types.  Further
    alignment restrictions can be provided by extensions as necessary.

    3. The bytes object implements a subset of the sequence operations
    provided by string/array objects, but with slightly different
    semantics in some cases.  In particular, a slice always returns a
    new bytes object, but the underlying memory is shared between the
    two objects.  This type of slice behavior has been called creating
    a "view".  Additionally, repetition and concatenation are
    undefined for bytes objects and will raise an exception.

    As these objects are likely to find use in high performance
    applications, one motivation for the decision to use view slicing
    is that copying between bytes objects should be very efficient and
    not require the creation of temporary objects.  The following code
    illustrates this:

        # create two 10 Meg bytes objects
        b1 = bytes(10000000)
        b2 = bytes(10000000)

        # copy from part of one to another with out creating a 1 Meg temporary
        b1[2000000:3000000] = b2[4000000:5000000]

    Slice assignment where the rvalue is not the same length as the
    lvalue will raise an exception.  However, slice assignment will
    work correctly with overlapping slices (typically implemented with
    memmove).

    4. The bytes object will be recognized as a native type by the pickle and
    cPickle modules for efficient serialization.  (In truth, this is
    the only requirement that can't be implemented via a third party
    extension.)

    Partial solutions to address the need to serialize the data stored
    in a bytes-like object without creating a temporary copy of the
    data into a string have been implemented in the past.  The tofile
    and fromfile methods of the array object are good examples of
    this.  The bytes object will support these methods too.  However,
    pickling is useful in other situations - such as in the shelve
    module, or implementing RPC of Python objects, and requiring the
    end user to use two different serialization mechanisms to get an
    efficient transfer of data is undesirable.

    XXX: Will try to implement pickling of the new bytes object in
    such a way that previous versions of Python will unpickle it as a
    string object.

    When unpickling, the bytes object will be created from memory
    allocated from Python (via malloc).  As such, it will lose any
    additional properties that an extension supplied pointer might
    have provided (special alignment, or special types of memory).

    XXX: Will try to make it so that C subclasses of bytes type can
    supply the memory that will be unpickled into.  For instance, a
    derived class called PageAlignedBytes would unpickle to memory
    that is also page aligned.

    On any platform where an int is 32 bits (most of them), it is
    currently impossible to create a string with a length larger than
    can be represented in 31 bits.  As such, pickling to a string will
    raise an exception when the operation is not possible.

    At least on platforms supporting large files (many of them),
    pickling large bytes objects to files should be possible via
    repeated calls to the file.write() method.

    5. The bytes type supports the PyBufferProcs interface, but a bytes object
    provides the additional guarantee that the pointer will not be
    deallocated or reallocated as long as a reference to the bytes
    object is held.  This implies that a bytes object is not resizable
    once it is created, but allows the global interpreter lock (GIL)
    to be released while a separate thread manipulates the memory
    pointed to if the PyBytes_Check(...) test passes.

    This characteristic of the bytes object allows it to be used in
    situations such as asynchronous file I/O or on multiprocessor
    machines where the pointer obtained by PyBufferProcs will be used
    independently of the global interpreter lock.

    Knowing that the pointer can not be reallocated or freed after the
    GIL is released gives extension authors the capability to get true
    concurrency and make use of additional processors for long running
    computations on the pointer.

    6. In C/C++ extensions, the bytes object can be created from a supplied
    pointer and destructor function to free the memory when the
    reference count goes to zero.

    The special implementation of slicing for the bytes object allows
    multiple bytes objects to refer to the same pointer/destructor.
    As such, a refcount will be kept on the actual
    pointer/destructor.  This refcount is separate from the refcount
    typically associated with Python objects.

    XXX: It may be desirable to expose the inner refcounted object as an
    actual Python object.  If a good use case arises, it should be possible
    for this to be implemented later with no loss to backwards compatibility.

    7. It is also possible to signify the bytes object as readonly, in this
    case it isn't actually mutable, but does provide the other features of a
    bytes object.

    8. The bytes object keeps track of the length of its data with a Python
    LONG_LONG type.  Even though the current definition for PyBufferProcs
    restricts the length to be the size of an int, this PEP does not propose
    to make any changes there.  Instead, extensions can work around this limit
    by making an explicit PyBytes_Check(...) call, and if that succeeds they
    can make a PyBytes_GetReadBuffer(...) or PyBytes_GetWriteBuffer call to
    get the pointer and full length of the object as a LONG_LONG.

    The bytes object will raise an exception if the standard PyBufferProcs
    mechanism is used and the size of the bytes object is greater than can be
    represented by an integer.

    From Python scripting, the bytes object will be subscriptable with longs
    so the 32 bit int limit can be avoided.

    There is still a problem with the len() function as it is PyObject_Size()
    and this returns an int as well.  As a workaround, the bytes object will
    provide a .length() method that will return a long.

    9. The bytes object can be constructed at the Python scripting level by
    passing an int/long to the bytes constructor with the number of bytes to
    allocate.  For example:

       b = bytes(100000) # alloc 100K bytes

    The constructor can also take another bytes object.  This will be useful
    for the implementation of unpickling, and in converting a read-write bytes
    object into a read-only one.  An optional second argument will be used to
    designate creation of a readonly bytes object.

    10. From the C API, the bytes object can be allocated using any of the
    following signatures:

       PyObject* PyBytes_FromLength(LONG_LONG len, int readonly);
       PyObject* PyBytes_FromPointer(void* ptr, LONG_LONG len, int readonly
                void (*dest)(void *ptr, void *user), void* user);
    
    In the PyBytes_FromPointer(...) function, if the dest function pointer is
    passed in as NULL, it will not be called.  This should only be used for
    creating bytes objects from statically allocated space.
    
    The user pointer has been called a closure in other places.  It is a
    pointer that the user can use for whatever purposes.  It will be passed to
    the destructor function on cleanup and can be useful for a number of
    things.  If the user pointer is not needed, NULL should be passed instead.
 
    11. The bytes type will be a new style class as that seems to be where all
    standard Python types are headed.

Contrast to existing types

    The most common way to work around the lack of a bytes object has been to
    simply use a string object in its place.  Binary files, the struct/array
    modules, and several other examples exist of this.  Putting aside the
    style issue that these uses typically have nothing to do with text
    strings, there is the real problem that strings are not mutable, so direct
    manipulation of the data returned in these cases is not possible.  Also,
    numerous optimizations in the string module (such as caching the hash
    value or interning the pointers) mean that extension authors are on very
    thin ice if they try to break the rules with the string object.

    The buffer object seems like it was intended to address the purpose that
    the bytes object is trying fulfill, but several shortcomings in its
    implementation [1] have made it less useful in many common cases.  The
    buffer object made a different choice for its slicing behavior (it returns
    new strings instead of buffers for slicing and other operations), and it
    doesn't make many of the promises on alignment or being able to release
    the GIL that the bytes object does.

    Also in regards to the buffer object, it is not possible to simply replace
    the buffer object with the bytes object and maintain backwards
    compatibility.  The buffer object provides a mechanism to take the
    PyBufferProcs supplied pointer of another object and present it as its
    own.  Since the behavior of the other object can not be guaranteed to
    follow the same set of strict rules that a bytes object does, it can't be
    used in places that a bytes object could.

    The array module supports the creation of an array of bytes, but it does
    not provide a C API for supplying pointers and destructors to extension
    supplied memory.  This makes it unusable for constructing objects out of
    shared memory, or memory that has special alignment or locking for things
    like DMA transfers.  Also, the array object does not currently pickle.
    Finally since the array object allows its contents to grow, via the extend
    method, the pointer can be changed if the GIL is not held while using it.

    Creating a buffer object from an array object has the same problem of
    leaving an invalid pointer when the array object is resized.

    The mmap object caters to its particular niche, but does not attempt to
    solve a wider class of problems.

    Finally, any third party extension can not implement pickling without
    creating a temporary object of a standard python type.  For example in the
    Numeric community, it is unpleasant that a large array can't pickle
    without creating a large binary string to duplicate the array data.

Backward Compatibility

    The only possibility for backwards compatibility problems that the author
    is aware of are in previous versions of Python that try to unpickle data
    containing the new bytes type.

Reference Implementation

    XXX: Actual implementation is in progress, but changes are still possible
    as this PEP gets further review.

    The following new files will be added to the Python baseline:

        Include/bytesobject.h  # C interface
        Objects/bytesobject.c  # C implementation
        Lib/test/test_bytes.py # unit testing
        Doc/lib/libbytes.tex   # documentation

    The following files will also be modified:

        Include/Python.h       # adding bytesmodule.h include file
        Python/bltinmodule.c   # adding the bytes type object
        Modules/cPickle.c      # adding bytes to the standard types
        Lib/pickle.py          # adding bytes to the standard types

    It is possible that several other modules could be cleaned up and
    implemented in terms of the bytes object.  The mmap module comes to mind
    first, but as noted above it would be possible to reimplement the array
    module as a pure Python module.  While it is attractive that this PEP
    could actually reduce the amount of source code by some amount, the author
    feels that this could cause unnecessary risk for breaking existing
    applications and should be avoided at this time.

Additional Notes/Comments

    - Guido van Rossum wondered whether it would make sense to be able
    to create a bytes object from a mmap object.  The mmap object
    appears to support the requirements necessary to provide memory
    for a bytes object.  (It doesn't resize, and the pointer is valid
    for the lifetime of the object.)  As such, a method could be added
    to the mmap module such that a bytes object could be created
    directly from a mmap object.  An initial stab at how this would be
    implemented would be to use the PyBytes_FromPointer() function
    described above and pass the mmap_object as the user pointer.  The
    destructor function would decref the mmap_object for cleanup.

    - Todd Miller notes that it may be useful to have two new functions:
    PyObject_AsLargeReadBuffer() and PyObject_AsLargeWriteBuffer that are
    similar to PyObject_AsReadBuffer() and PyObject_AsWriteBuffer(), but
    support getting a LONG_LONG length in addition to the void* pointer.
    These functions would allow extension authors to work transparently with
    bytes object (that support LONG_LONG lengths) and most other buffer like
    objects (which only support int lengths).  These functions could be in
    lieu of, or in addition to, creating a specific PyByte_GetReadBuffer() and
    PyBytes_GetWriteBuffer() functions.

    XXX: The author thinks this is very a good idea as it paves the way for
    other objects to eventually support large (64 bit) pointers, and it should
    only affect abstract.c and abstract.h.  Should this be added above?

    - It was generally agreed that abusing the segment count of the
    PyBufferProcs interface is not a good hack to work around the 31 bit
    limitation of the length.  If you don't know what this means, then you're
    in good company.  Most code in the Python baseline, and presumably in many
    third party extensions, punt when the segment count is not 1.

References

    [1] The buffer interface
        http://mail.python.org/pipermail/python-dev/2000-October/009974.html

Copyright

    This document has been placed in the public domain.