PEP 261 -- Support for "wide" Unicode characters

PEP:	261
Title:	Support for "wide" Unicode characters
Version:	$Revision: 1603 $
Last-Modified:	$Date: 2003-04-21 08:20:13 -0700 (Mon, 21 Apr 2003) $
Author:	Paul Prescod <paulp at activestate.com>
Status:	Final
Type:	Standards Track
Created:	27-Jun-2001
Python-Version:	2.2
Post-History:	27-Jun-2001

Abstract

    Python 2.1 unicode characters can have ordinals only up to 2**16 -1.  
    This range corresponds to a range in Unicode known as the Basic
    Multilingual Plane. There are now characters in Unicode that live
    on other "planes". The largest addressable character in Unicode
    has the ordinal 17 * 2**16 - 1 (0x10ffff). For readability, we
    will call this TOPCHAR and call characters in this range "wide 
    characters".

Glossary

    Character 
        
        Used by itself, means the addressable units of a Python 
        Unicode string.

    Code point

        A code point is an integer between 0 and TOPCHAR.
        If you imagine Unicode as a mapping from integers to
        characters, each integer is a code point. But the 
        integers between 0 and TOPCHAR that do not map to
        characters are also code points. Some will someday 
        be used for characters. Some are guaranteed never 
        to be used for characters.

    Codec

        A set of functions for translating between physical
        encodings (e.g. on disk or coming in from a network)
        into logical Python objects.

    Encoding

        Mechanism for representing abstract characters in terms of
        physical bits and bytes. Encodings allow us to store
        Unicode characters on disk and transmit them over networks
        in a manner that is compatible with other Unicode software.

    Surrogate pair

        Two physical characters that represent a single logical
        character. Part of a convention for representing 32-bit
        code points in terms of two 16-bit code points.

    Unicode string

          A Python type representing a sequence of code points with
          "string semantics" (e.g. case conversions, regular
          expression compatibility, etc.) Constructed with the 
          unicode() function.

Proposed Solution

    One solution would be to merely increase the maximum ordinal 
    to a larger value. Unfortunately the only straightforward
    implementation of this idea is to use 4 bytes per character.
    This has the effect of doubling the size of most Unicode 
    strings. In order to avoid imposing this cost on every
    user, Python 2.2 will allow the 4-byte implementation as a
    build-time option. Users can choose whether they care about
    wide characters or prefer to preserve memory.

    The 4-byte option is called "wide Py_UNICODE". The 2-byte option
    is called "narrow Py_UNICODE".

    Most things will behave identically in the wide and narrow worlds.

    * unichr(i) for 0 <= i < 2**16 (0x10000) always returns a
      length-one string.

    * unichr(i) for 2**16 <= i <= TOPCHAR will return a
      length-one string on wide Python builds. On narrow builds it will 
      raise ValueError.

        ISSUE 

            Python currently allows \U literals that cannot be
            represented as a single Python character. It generates two
            Python characters known as a "surrogate pair". Should this
            be disallowed on future narrow Python builds?

        Pro:

            Python already the construction of a surrogate pair
            for a large unicode literal character escape sequence.
            This is basically designed as a simple way to construct
            "wide characters" even in a narrow Python build. It is also
            somewhat logical considering that the Unicode-literal syntax
            is basically a short-form way of invoking the unicode-escape
            codec.

        Con:

            Surrogates could be easily created this way but the user
            still needs to be careful about slicing, indexing, printing 
            etc. Therefore some have suggested that Unicode
            literals should not support surrogates.


        ISSUE 

            Should Python allow the construction of characters that do
            not correspond to Unicode code points?  Unassigned Unicode 
            code points should obviously be legal (because they could 
            be assigned at any time). But code points above TOPCHAR are 
            guaranteed never to be used by Unicode. Should we allow access 
            to them anyhow?

        Pro:

            If a Python user thinks they know what they're doing why
            should we try to prevent them from violating the Unicode
            spec? After all, we don't stop 8-bit strings from
            containing non-ASCII characters.

        Con:

            Codecs and other Unicode-consuming code will have to be
            careful of these characters which are disallowed by the
            Unicode specification.

    * ord() is always the inverse of unichr()

    * There is an integer value in the sys module that describes the
      largest ordinal for a character in a Unicode string on the current
      interpreter. sys.maxunicode is 2**16-1 (0xffff) on narrow builds
      of Python and TOPCHAR on wide builds.

        ISSUE: Should there be distinct constants for accessing
               TOPCHAR and the real upper bound for the domain of 
               unichr (if they differ)? There has also been a
               suggestion of sys.unicodewidth which can take the 
               values 'wide' and 'narrow'.

    * every Python Unicode character represents exactly one Unicode code 
      point (i.e. Python Unicode Character = Abstract Unicode character).

    * codecs will be upgraded to support "wide characters"
      (represented directly in UCS-4, and as variable-length sequences
      in UTF-8 and UTF-16). This is the main part of the implementation 
      left to be done.

    * There is a convention in the Unicode world for encoding a 32-bit
      code point in terms of two 16-bit code points. These are known
      as "surrogate pairs". Python's codecs will adopt this convention
      and encode 32-bit code points as surrogate pairs on narrow Python
      builds. 

        ISSUE 

            Should there be a way to tell codecs not to generate
            surrogates and instead treat wide characters as 
            errors?

        Pro:

            I might want to write code that works only with
            fixed-width characters and does not have to worry about
            surrogates.


        Con:

            No clear proposal of how to communicate this to codecs.

    * there are no restrictions on constructing strings that use 
      code points "reserved for surrogates" improperly. These are
      called "isolated surrogates". The codecs should disallow reading
      these from files, but you could construct them using string 
      literals or unichr().

Implementation

    There is a new define:

        #define Py_UNICODE_SIZE 2

    To test whether UCS2 or UCS4 is in use, the derived macro
    Py_UNICODE_WIDE should be used, which is defined when UCS-4 is in
    use.

    There is a new configure option:

        --enable-unicode=ucs2 configures a narrow Py_UNICODE, and uses
                              wchar_t if it fits
        --enable-unicode=ucs4 configures a wide Py_UNICODE, and uses
                              wchar_t if it fits
        --enable-unicode      same as "=ucs2"
        --disable-unicode     entirely remove the Unicode functionality.

    It is also proposed that one day --enable-unicode will just
    default to the width of your platforms wchar_t.

    Windows builds will be narrow for a while based on the fact that
    there have been few requests for wide characters, those requests
    are mostly from hard-core programmers with the ability to buy
    their own Python and Windows itself is strongly biased towards
    16-bit characters.

Notes

    This PEP does NOT imply that people using Unicode need to use a
    4-byte encoding for their files on disk or sent over the network. 
    It only allows them to do so. For example, ASCII is still a 
    legitimate (7-bit) Unicode-encoding.

    It has been proposed that there should be a module that handles
    surrogates in narrow Python builds for programmers. If someone 
    wants to implement that, it will be another PEP. It might also be 
    combined with features that allow other kinds of character-, 
    word- and line- based indexing.

Rejected Suggestions

    More or less the status-quo

        We could officially say that Python characters are 16-bit and
        require programmers to implement wide characters in their
        application logic by combining surrogate pairs. This is a heavy 
        burden because emulating 32-bit characters is likely to be
        very inefficient if it is coded entirely in Python. Plus these
        abstracted pseudo-strings would not be legal as input to the
        regular expression engine.

    "Space-efficient Unicode" type

        Another class of solution is to use some efficient storage
        internally but present an abstraction of wide characters to
        the programmer. Any of these would require a much more complex
        implementation than the accepted solution. For instance consider
        the impact on the regular expression engine. In theory, we could
        move to this implementation in the future without breaking Python
        code. A future Python could "emulate" wide Python semantics on
        narrow Python. Guido is not willing to undertake the
        implementation right now.

    Two types

        We could introduce a 32-bit Unicode type alongside the 16-bit
        type. There is a lot of code that expects there to be only a 
        single Unicode type.

    This PEP represents the least-effort solution. Over the next
    several years, 32-bit Unicode characters will become more common
    and that may either convince us that we need a more sophisticated 
    solution or (on the other hand) convince us that simply 
    mandating wide Unicode characters is an appropriate solution.
    Right now the two options on the table are do nothing or do
    this.

References

    Unicode Glossary: http://www.unicode.org/glossary/

Copyright

    This document has been placed in the public domain.