PEP: 218
Title: Adding a Built-In Set Object Type
Version: $Revision: 1893 $
Author: gvwilson at ddj.com (Greg Wilson), python at rcn.com (Raymond Hettinger)
Status: Final
Type: Standards Track
Python-Version: 2.2
Created: 31-Jul-2000
Post-History: 

Introduction

    This PEP proposes adding a Set module to the standard Python
    library, and to then make sets a built-in Python type if that
    module is widely used.  After explaining why sets are desirable,
    and why the common idiom of using dictionaries in their place is
    inadequate, we describe how we intend built-in sets to work, and
    then how the preliminary Set module will behave.  The last
    section discusses the mutability (or otherwise) of sets and set
    elements, and the solution which the Set module will implement.


Rationale

    Sets are a fundamental mathematical structure, and are very
    commonly used in algorithm specifications.  They are much less
    frequently used in implementations, even when they are the "right"
    structure.  Programmers frequently use lists instead, even when
    the ordering information in lists is irrelevant, and by-value
    lookups are frequent.  (Most medium-sized C programs contain a
    depressing number of start-to-end searches through malloc'd
    vectors to determine whether particular items are present or
    not...)

    Programmers are often told that they can implement sets as
    dictionaries with "don't care" values.  Items can be added to
    these "sets" by assigning the "don't care" value to them;
    membership can be tested using "dict.has_key"; and items can be
    deleted using "del".  However, the other main operations on sets
    (union, intersection, and difference) are not directly supported
    by this representation, since their meaning is ambiguous for
    dictionaries containing key/value pairs.


Proposal

    The long-term goal of this PEP is to add a built-in set type to
    Python.  This type will be an unordered collection of unique
    values, just as a dictionary is an unordered collection of
    key/value pairs.

    Iteration and comprehension will be implemented in the obvious
    ways, so that:

        for x in S:

    will step through the elements of S in arbitrary order, while:

        set(x**2 for x in S)

    will produce a set containing the squares of all elements in S,
    Membership will be tested using "in" and "not in", and basic set
    operations will be implemented by a mixture of overloaded
    operators:

        |               union
        &               intersection
        ^               symmetric difference
        -               asymmetric difference
        == !=           equality and inequality tests
        < <= >= >       subset and superset tests


    and methods:

        S.add(x)        Add "x" to the set.

        S.update(s)     Add all elements of sequence "s" to the set.

        S.remove(x)     Remove "x" from the set.  If "x" is not
                        present, this method raises a LookupError
                        exception.

        S.discard(x)    Remove "x" from the set if it is present, or
                        do nothing if it is not.

        S.pop()         Remove and return an arbitrary element,
                        raising a LookupError if the element is not
                        present.

        S.clear()       Remove all elements from this set.

        S.copy()        Make a new set.

        s.issuperset()  Check for a superset relationship.

        s.issubset()    Check for a subset relationship.
        

    and two new built-in conversion functions:

        set(x)          Create a set containing the elements of the
                        collection "x".

        frozenset(x)    Create an immutable set containing the elements
                        of the collection "x".

    Notes:

    1. We propose using the bitwise operators "|&" for intersection
       and union.  While "+" for union would be intuitive, "*" for
       intersection is not (very few of the people asked guessed what
       it did correctly).

    2. We considered using "+" to add elements to a set, rather than
       "add".  However, Guido van Rossum pointed out that "+" is
       symmetric for other built-in types (although "*" is not).  Use
       of "add" will also avoid confusion between that operation and
       set union.


Set Notation

    The PEP originally proposed {1,2,3} as the set notation and {-} for
    the empty set.  Experience with Python 2.3's sets.py showed that
    the notation was not necessary.  Also, there was some risk of making
    dictionaries less instantly recognizable.

    It was also contemplated that the braced notation would support set
    comprehensions; however, Python 2.4 provided generator expressions
    which fully met that need and did so it a more general way.
    (See PEP 289 for details on generator expressions).

    So, Guido ruled that there would not be a set syntax; however, the
    issue could be revisited for Python 3000 (see PEP 3000).


History

    To gain experience with sets, a pure python module was introduced
    in Python 2.3.  Based on that implementation, the set and frozenset
    types were introduced in Python 2.4.  The improvements are:

        * Better hash algorithm for frozensets
        * More compact pickle format (storing only an element list
          instead of a dictionary of key:value pairs where the value
          is always True).
        * Use a __reduce__ function so that deep copying is automatic.
        * The BaseSet concept was eliminated.
        * The union_update() method became just update().
        * Auto-conversion between mutable and immutable sets was dropped.
        * The _repr method was dropped (the need is met by the new
          sorted() built-in function).

    Tim Peters believes that the class's constructor should take a
    single sequence as an argument, and populate the set with that
    sequence's elements.  His argument is that in most cases,
    programmers will be creating sets from pre-existing sequences, so
    that this case should be the common one.  However, this would
    require users to remember an extra set of parentheses when
    initializing a set with known values:

    >>> Set((1, 2, 3, 4))       # case 1

    On the other hand, feedback from a small number of novice Python
    users (all of whom were very experienced with other languages)
    indicates that people will find a "parenthesis-free" syntax more
    natural:

    >>> Set(1, 2, 3, 4)         # case 2

    Ultimately, we adopted the first strategy in which the initializer
    takes a single iterable argument.


Mutability

    The most difficult question to resolve in this proposal was
    whether sets ought to be able to contain mutable elements.  A
    dictionary's keys must be immutable in order to support fast,
    reliable lookup.  While it would be easy to require set elements
    to be immutable, this would preclude sets of sets (which are
    widely used in graph algorithms and other applications).

    Earlier drafts of PEP 218 had only a single set type, but the
    sets.py implementation in Python 2.3 has two, Set and
    ImmutableSet.  For Python 2.4, the new built-in types were named
    set and frozenset which are slightly less cumbersome.

    There are two classes implemented in the "sets" module.  Instances
    of the Set class can be modified by the addition or removal of
    elements, and the ImmutableSet class is "frozen", with an
    unchangeable collection of elements.  Therefore, an ImmutableSet
    may be used as a dictionary key or as a set element, but cannot be
    updated.  Both types of set require that their elements are
    immutable, hashable objects.  Parallel comments apply to the "set"
    and "frozenset" built-in types.


Copyright

    This document has been placed in the Public Domain.