PEP 237 -- Unifying Long Integers and Integers

PEP:	237
Title:	Unifying Long Integers and Integers
Version:	$Revision: 1882 $
Author:	Moshe Zadka, Guido van Rossum
Status:	Draft
Type:	Standards Track
Created:	11-Mar-2001
Python-Version:	2.2
Post-History:	16-Mar-2001, 14-Aug-2001, 23-Aug-2001

Abstract

    Python currently distinguishes between two kinds of integers
    (ints): regular or short ints, limited by the size of a C long
    (typically 32 or 64 bits), and long ints, which are limited only
    by available memory.  When operations on short ints yield results
    that don't fit in a C long, they raise an error.  There are some
    other distinctions too.  This PEP proposes to do away with most of
    the differences in semantics, unifying the two types from the
    perspective of the Python user.

Rationale

    Many programs find a need to deal with larger numbers after the
    fact, and changing the algorithms later is bothersome.  It can
    hinder performance in the normal case, when all arithmetic is
    performed using long ints whether or not they are needed.

    Having the machine word size exposed to the language hinders
    portability.  For examples Python source files and .pyc's are not
    portable between 32-bit and 64-bit machines because of this.

    There is also the general desire to hide unnecessary details from
    the Python user when they are irrelevant for most applications.
    An example is memory allocation, which is explicit in C but
    automatic in Python, giving us the convenience of unlimited sizes
    on strings, lists, etc.  It makes sense to extend this convenience
    to numbers.

    It will give new Python programmers (whether they are new to
    programming in general or not) one less thing to learn before they
    can start using the language.

Implementation

    Initially, two alternative implementations were proposed (one by
    each author):

    1. The PyInt type's slot for a C long will be turned into a 

        union {
            long i;
            struct {
                unsigned long length;
                digit digits[1];
            } bignum;
        };

       Only the n-1 lower bits of the long have any meaning; the top
       bit is always set.  This distinguishes the union.  All PyInt
       functions will check this bit before deciding which types of
       operations to use.

    2. The existing short and long int types remain, but operations
       return a long int instead of raising OverflowError when a
       result cannot be represented as a short int.  A new type,
       integer, may be introduced that is an abstract base type of
       which both the int and long implementation types are
       subclassed.  This is useful so that programs can check
       integer-ness with a single test:

           if isinstance(i, integer): ...

    After some consideration, the second implementation plan was
    selected, since it is far easier to implement, is backwards
    compatible at the C API level, and in addition can be implemented
    partially as a transitional measure.

Incompatibilities

    The following operations have (usually subtly) different semantics
    for short and for long integers, and one or the other will have to
    be changed somehow.  This is intended to be an exhaustive list.
    If you know of any other operation that differ in outcome
    depending on whether a short or a long int with the same value is
    passed, please write the second author.

    - Currently, all arithmetic operators on short ints except <<
      raise OverflowError if the result cannot be represented as a
      short int.  This will be changed to return a long int instead.
      The following operators can currently raise OverflowError: x+y,
      x-y, x*y, x**y, divmod(x, y), x/y, x%y, and -x.  (The last four
      can only overflow when the value -sys.maxint-1 is involved.)

    - Currently, x<<n can lose bits for short ints.  This will be
      changed to return a long int containing all the shifted-out
      bits, if returning a short int would lose bits (where changing
      sign is considered a special case of losing bits).

    - Currently, hex and oct literals for short ints may specify
      negative values; for example 0xffffffff == -1 on a 32-bit
      machine.  This will be changed to equal 0xffffffffL (2**32-1).

    - Currently, the '%u', '%x', '%X' and '%o' string formatting
      operators and the hex() and oct() built-in functions behave
      differently for negative numbers: negative short ints are
      formatted as unsigned C long, while negative long ints are
      formatted with a minus sign.  This will be changed to use the
      long int semantics in all cases (but without the trailing 'L'
      that currently distinguishes the output of hex() and oct() for
      long ints).  Note that this means that '%u' becomes an alias for
      '%d'.  It will eventually be removed.

    - Currently, repr() of a long int returns a string ending in 'L'
      while repr() of a short int doesn't.  The 'L' will be dropped;
      but not before Python 3.0.

    - Currently, an operation with long operands will never return a
      short int.  This *may* change, since it allows some
      optimization.  (No changes have been made in this area yet, and
      none are planned.)

    - The expression type(x).__name__ depends on whether x is a short
      or a long int.  Since implementation alternative 2 is chosen,
      this difference will remain.  (In Python 3.0, we *may* be able
      to deploy a trick to hide the difference, because it *is*
      annoying to reveal the difference to user code, and more so as
      the difference between the two types is less visible.)

    - Long and short ints are handled different by the marshal module,
      and by the pickle and cPickle modules.  This difference will
      remain (at least until Python 3.0).

    - Short ints with small values (typically between -1 and 99
      inclusive) are "interned" -- whenever a result has such a value,
      an existing short int with the same value is returned.  This is
      not done for long ints with the same values.  This difference
      will remain.  (Since there is no guarantee of this interning, is
      is debatable whether this is a semantic difference -- but code
      may exist that uses 'is' for comparisons of short ints and
      happens to work because of this interning.  Such code may fail
      if used with long ints.)

Literals

    A trailing 'L' at the end of an integer literal will stop having
    any meaning, and will be eventually become illegal.  The compiler
    will choose the appropriate type solely based on the value.
    (Until Python 3.0, it will force the literal to be a long; but
    literals without a trailing 'L' may also be long, if they are not
    representable as short ints.)

Built-in Functions

    The function int() will return a short or a long int depending on
    the argument value.  In Python 3.0, the function long() will call
    the function int(); before then, it will continue to force the
    result to be a long int, but otherwise work the same way as int().
    The built-in name 'long' will remain in the language to represent
    the long implementation type (unless it is completely eradicated
    in Python 3.0), but using the int() function is still recommended,
    since it will automatically return a long when needed.

C API

    The C API remains unchanged; C code will still need to be aware of
    the difference between short and long ints.  (The Python 3.0 C API
    will probably be completely incompatible.)

    The PyArg_Parse*() APIs already accept long ints, as long as they
    are within the range representable by C ints or longs, so that
    functions taking C int or long argument won't have to worry about
    dealing with Python longs.

Transition

    There are three major phases to the transition:

    A. Short int operations that currently raise OverflowError return
       a long int value instead.  This is the only change in this
       phase.  Literals will still distinguish between short and long
       ints.  The other semantic differences listed above (including
       the behavior of <<) will remain.  Because this phase only
       changes situations that currently raise OverflowError, it is
       assumed that this won't break existing code.  (Code that
       depends on this exception would have to be too convoluted to be
       concerned about it.)  For those concerned about extreme
       backwards compatibility, a command line option (or a call to
       the warnings module) will allow a warning or an error to be
       issued at this point, but this is off by default.

    B. The remaining semantic differences are addressed.  In all cases
       the long int semantics will prevail.  Since this will introduce
       backwards incompatibilities which will break some old code,
       this phase may require a future statement and/or warnings, and
       a prolonged transition phase.  The trailing 'L' will continue
       to be used for longs as input and by repr().

    C. The trailing 'L' is dropped from repr(), and made illegal on
       input.  (If possible, the 'long' type completely disappears.)

    Phase A will be implemented in Python 2.2.

    Phase B will be implemented gradually in Python 2.3 and Python
    2.4.  Envisioned stages of phase B:

    B0. Warnings are enabled about operations that will change their
        numeric outcome in stage B1, in particular hex() and oct(),
        '%u', '%x', '%X' and '%o', hex and oct literals in the
        (inclusive) range [sys.maxint+1, sys.maxint*2+1], and left
        shifts losing bits.

    B1. The new semantic for these operations are implemented.
        Operations that give different results than before will *not*
        issue a warning.

    We propose the following timeline:

    B0. Python 2.3.

    B1. Python 2.4.

    Phase C will be implemented in Python 3.0 (at least two years
    after Python 2.4 is released).

OverflowWarning

    Here are the rules that guide warnings generated in situations
    that currently raise OverflowError.  This applies to transition
    phase A.  Historical note:  despite that phase A was completed in
    Python 2.2, and phase B0 in Python 2.3, nobody noticed that
    OverflowWarning was still generated in Python 2.3.  It was finally
    disabled in Python 2.4.  The Python builtin OverflowWarning, and
    the corresponding C API PyExc_OverflowWarning, are no longer
    generated or used in Python 2.4, but will remain for the (unlikely)
    case of user code until Python 2.5.

    - A new warning category is introduced, OverflowWarning.  This is
      a built-in name.

    - If an int result overflows, an OverflowWarning warning is
      issued, with a message argument indicating the operation,
      e.g. "integer addition".  This may or may not cause a warning
      message to be displayed on sys.stderr, or may cause an exception
      to be raised, all under control of the -W command line and the
      warnings module.

    - The OverflowWarning warning is ignored by default.

    - The OverflowWarning warning can be controlled like all warnings,
      via the -W command line option or via the
      warnings.filterwarnings() call.  For example:

        python -Wdefault::OverflowWarning

      cause the OverflowWarning to be displayed the first time it
      occurs at a particular source line, and

        python -Werror::OverflowWarning

      cause the OverflowWarning to be turned into an exception
      whenever it happens.  The following code enables the warning
      from inside the program:

        import warnings
        warnings.filterwarnings("default", "", OverflowWarning)

      See the python man page for the -W option and the the warnings
      module documentation for filterwarnings().

    - If the OverflowWarning warning is turned into an error,
      OverflowError is substituted.  This is needed for backwards
      compatibility.

    - Unless the warning is turned into an exceptions, the result of
      the operation (e.g., x+y) is recomputed after converting the
      arguments to long ints.

Example

    If you pass a long int to a C function or built-in operation that
    takes an integer, it will be treated the same as as a short int as
    long as the value fits (by virtue of how PyArg_ParseTuple() is
    implemented).  If the long value doesn't fit, it will still raise
    an OverflowError.  For example:

      def fact(n):
          if n <= 1:
              return 1
          return n*fact(n-1)

      A = "ABCDEFGHIJKLMNOPQ"
      n = input("Gimme an int: ")
      print A[fact(n)%17]

    For n >= 13, this currently raises OverflowError (unless the user
    enters a trailing 'L' as part of their input), even though the
    calculated index would always be in range(17).  With the new
    approach this code will do the right thing: the index will be
    calculated as a long int, but its value will be in range.

Resolved Issues

    These issues, previously open, have been resolved.

    - What to do about sys.maxint?  Leave it in, since it is still
      relevant whenever the distinction between short and long ints is
      still relevant (e.g. when inspecting the type of a value).

    - Should we remove '%u' completely?  Remove it.

    - Should we warn about << not truncating integers?  Yes.

    - Should the overflow warning be on a portable maximum size?  No.

Implementation

    The implementation work for the Python 2.x line is completed;
    phase A was released with Python 2.2, phase B0 with Python 2.3,
    and phase B1 will be released with Python 2.4 (and is already in
    CVS).

Copyright

    This document has been placed in the public domain.