This is a summary of traffic on the python-dev mailing list between September 01, 2002 and September 15, 2002 (exclusive). It is intended to inform the wider Python community of ongoing developments on the list. To comment on anything mentioned here, just post to python-list@python.org or comp.lang.python in the usual way. Give your posting a meaningful subject line, and if it's about a PEP, include the PEP number (e.g. Subject: PEP 201 - Lockstep iteration) All python-dev members are interested in seeing ideas discussed by the community, so don't hesitate to take a stance on a PEP (or anything else for that matter) if you have an opinion.

This is the second summary written by Brett Cannon (hopefully my sophomoric performance will be better then most sophomore music albums).

Summaries by me (2002-09-15 to ... when I burn out) are archived at:
http://www.ocf.berkeley.edu/~bac/python-dev/summaries/index.php
You can find summaries by Michael Hudson (2002-02-01 to 2001-07-05) at:
http://starship.python.net/crew/mwh/summaries/index.html
Summaries by A.M. Kuchling (2000-12-01 to 2001-01-31) are at:
http://www.amk.ca/python/dev/

Please note that this summary is written using reStructuredText which can be found at http://docutils.sourceforge.net/rst.html . Any unfamiliar punctuation is probably markup for reST; you can safely ignore it (although I suggest learning reST; its nice and is accepted for PEP markup). Also, because of the wonders of reformatting thanks to whatever you are using to read this, I cannot guarantee you will be able to run this text through DocUtils as-is. If you want to do that, get the original text from the archive.

I am considering keeping a list of names that people are often referred to in emails. This would serve a dual purpose: allows people who read emails from the list to have a reference to be able to figure out who is who and makes the summaries easier for me because I can then make reference to people by the names I know them by. =) Any comments on this idea are appreciated.

To commit or not commit

Walter Dorwald asked if there were "any objections against committing the patch" for implementing PEP 293 (Codec Error Handling Callbacks). Guido asked what Martin V. Lowis and M.A. Lemburg had to say about it. MAL responded that he was +1 on the patch. Martin was "concerned about the massive amounts of C code, most of which could be expressed way more compact in Python code", but "Walter convinced [MvL] that this does have a real performance impact for real data" so he would live with it. In the end he gave it his vote.

Walter said he would check it in (and he has). The PEP has now been moved to the finished PEP list.

Proposed Mixins for Wide Interfaces

Raymond Hettinger suggested adding mixin classes that automatically implement magic methods when certain basic magic methods were already implemented (e.g., "given an __eq__ method in a subclass, adds a __ne__ method"). David Abrahams said that he thought "these are a great idea, in the context of an understanding of what we want interfaces to be, say, and do." Guido brought up some points about the initial suggestions Raymond made. He then said that he thought that there wasn't "enough here to warrant putting this into the standard library"; the issue will be revisited when a standard type or interface hierarchy is added to Python (not in 2.3).

mysterious hangs in socket code

Jeremy Hylton wrote some threaded code to fetch some web pages that hung when performing a slow DNS operation. Apparently, in Python 2.1 "it produces a steady stream of output -- urls and the time it took to load them". In Python 2.2 and 2.3, though, "it produces little bursts of output, then pauses for a long time, then repeats". Jeremy guessed that it might have something to do with Linux's getaddrinfo() being thread-safe by allowing only a single lookup at a time. Aahz said that "gethostbyname() IIRC has frequently been non-reentrant".

Two random and nearly unrelated ideas

Skip Montanaro had two ideas; one was to make the info in Misc/NEWS (which is a summary of what has been changed in Python for each release) a web page and the other was "to get rid of the ticker altogether in systems with proper signal support" (see the 2002-08-16 - 2002-09-01 summary for an explanation of what the ticker is). That would get rid of the polling of the ticker and thus reduce the overhead on threads.

For the first idea, Guido asked Skip to try seeing what it would look like with reST markup and what the resulting page would look like.

In response to the second idea, Oren Tirosh said it couldn't be done until "all Python I/O calls are converted to be EINTR-safe" (EINTER-safe means to be able to handle the EINTER signal which what is raised "When an I/O operation is interrupted by an unmasked signal"). That "requires a lot of work in some of the hairiest places in the Python codebase." Fredrik Lundh said that this "sounds like a good topic for a "here's what I learned when trying to fix this problem" PEP. This is most likely in reference to Skip writing the patch to make the ticker global instead of a per-thread issue. Guido said, in terms of signals, to "just say no"; "it is impossible to write correct code in the presense of signals". Guido, in a later email, gave this whole idea a vote of -1,000,000; so it ain't ever going to happen. Some discussion on signals ensued, but Guido never budged from his position.

Oren pointed out that if some C code used signals and people didn't handle it in their Python code by checking if IOError was caused by EINTER (as shown below by Oren's code):

while 1:
    try:
        <code>
    except IOError, exc:
        if exc.errno == errno.EINTR:
            continue
        else:
            raise

, it would not restart properly even though there was no reason for it to have stopped. Oren said that Python could add the loop in the C code of the core where EINTR might be raised ("Only low-level functions like os.read and os.write that map directly to stdio functions should ever return EINTR"). The proposed idea was to wrap functions that might raise this that can be re-entered safely.

Should KeyError use repr() on its arguments?

Originally, when an exception was raised and you passed in an optional object to act as a description of why the exception was raised (such as KeyError("there is no spoon") where there is no spoon is the optional argument bound to <exception>.args), it just returned what args was bound to when you called; str(<exception>) == <exception>.args. Now it calls repr() on what args is bound to; str(<exception>) == str(<exception>.args). Much better. =)

New 'spambayes' project on SourceForge

Thanks to great work done by Tim Peters and several other contributors, Barry Warsaw started an SF project to host the spambayes code. It can be found at http://sf.net/projects/spambayes . There are two mailing lists: http://mail.python.org/mailman-21/listinfo/spambayes and http://mail.python.org/mailman-21/listinfo/spambaye-checkins (yes, that is Mailman 2.1, and yes, you will "help be a guinea pig for Mailman 2.1").

Subsecond time stamps

Martin V. Lowis wanted to introduce subsecond timestamps on platforms that supported it. He suggested adding another field to stat, create a new type, or make st_mtime a floating point. The first one option is easy, the second has the usual problems of defining a new type, and the third does not guarantee enough accuracy.

Paul Svensson and Guido said that the last option (turning st_mtime into a float) was the most Pythonic. MvL agreed, but worried about breaking code that expected an int. Guido then suggested that maybe the new field is the way to go; define something like st_mtimef that will contain the float if available or contain an int otherwise. Tim Peters also weighed in with his IEEE 754 voodoo about how a float can hold enough info to be accurate up to 100 nanoseconds if you only span a 33 years. That causes an issue starting in 2003 since that is 33 years past the epoch (1970).

But then MvL discovered that st_mtime was already a float on the Mac; had that caused issues? Jack Jansen of course chimed in on this by saying that it caused him a headache about once a year in the form of a failing test (other issues caused by timestamps is the Classic Macs having the epoch at 1904 and not using UTC time). He said he would prefer to see the timestamp as a cookie that was passed into a function that spit out "something guaranteed to be of your liking".

To address the other issues that Jack mentioned, Guido suggested that all timestamps be converted to UTC time with the epoch at 1970.

MvL has SF patch 606592 up on SF that has already been closed that makes all the relevant changes to have timestamps return floats.

64-bit process optimization 1

Bob Ledwith posted a simple patch for Include/object.h that changed the order of certain parts of the PyObject_HEAD macros, affecting PyObject and PyVarObject. This was for a 64-bit platform performance boost (40% for large data sets according to Bob). The reordering eliminated some padding in the struct and allows more Python objects to fit in the L2 cache, or at least that is what Bob thinks is going on.

Guido pointed out that this would save 8 bytes per object; he thought all of this was "Interesting!". But alas, using this patch would break binary compatibility. Guido was not sure, though, whether it had been broken yet between Python 2.2 and 2.3 and thus he might be "being too conservative here" in terms of saying that it should be held back for now.

A problem Guido pointed out for 64-bit systems, is that theoretically the reference count for an object could go negative with enough references as things stand now. Guido then suggested that perhaps refcnt (struct item that holds the reference count) should be a long. And while dealing with that, Guido suggested that anything that stores a length should store that number in a long.

Chime in Tim Peters. He pointed out that it was agreed upon years ago to move refcnt to long but no one had bothered to do it. Heck, even Guido thought for a long time that it was a long when it wasn't; it required Tim to "beat that out of [Guido] <wink>" to stop him from saying that it was a long. He then pointed out that Win64 was still only 4 bytes for a long; what was really desired was for it to be Py_intptr_t which is the Python way for spelling the C99 type that we wanted. Apparently C99 has a way to specify that things be a specific byte length (now if everyone just had a C99 compiler we wouldn't need these macros; oh, to dream...).

Tim also pointed out that what we wanted for the type that held a length argument to be size_t since that is what strlen() and malloc() are restricted by. He said that he writes all of his "string-slinging code as using size_t vars now".

Tim pointed out that the issue then became "Whether it's worth the pain to change this stuff" which "depends on whether we think 64-bit boxes are just another passing fad like the Internet <wink>". =)

Martin V. Lowis agreed with the changing of refcnt to a long but had reservations about using size_t for the length field (ob_size). He pointed out that some objects put negative values into that field.

Frederik suggested that the proposed changes be default on 64-bit systems since the chances that they are willing to recompile is higher then people on 32-bit systems. He also suggested making it a compiler option. Guido thought it was a good idea. But then Mats Wichmann discovered that the switch to long killed the performance boost. So Guido re-iterated that he thinks it should be a compiler option only on 64-bit systems; have "compat", "optimal", and "right" compiler options.

As of yet nothing has done about this.

Weeding out obsolete modules and Demos

Jack Jansen noticed that there demos for some of the SGI-specific modules that use severely outdated systems and hardware (stuff discontinued 8 to 12 years ago). Guido gave the go-ahead to yank them from CVS.

This has yet to be done.

utf8 issue

(This thread actually started in August) There was a bug in Python 2.2 that raised a UnicodeError when trying to decode a lone surrogate (explanation of surrogates to follow this summary). This caused issues in importing .pyc files that contained a lone surrogate because marshal (which is what is used to create .pyc files) encodes Unicode literals in UTF-8. This has all been fixed in Python 2.3, but Guido was wondering how to backport this for Python 2.2.2.

The option of bumping the magic number for .pyc files was raised and instantly thrown out by Guido; "Bumping MAGIC is a no-no between dot releases". So M.A. Lemburg suggested to either fix the Unicode encoder or change the Unicode decoder to handle the malformed Unicode. MAL wasn't sure, though, if some security issue would be raised by the latter option.

Guido said go for the latter and didn't see any possible security issue since "If someone you don't trust can write your .pyc files, they can cause your interpreter to crash by inserting bogus bytecode".

Explanation of lone surrogates:

In Unicode, a surrogate pair is when you create the representation of a character by using two values. So, for instance, UTF-32 can cover the entire Unicode space (since Unicode is 20 bits, although MvL says it is really more like 21 bits), but UTF-16 can't. To solve the issue for an encoding that cannot cover all possible characters in a single value a character can be represented as a pair of UTF-16 values. The high surrogate cover the high 10 bits while the low surrogate cover the lower 10 bits. High and low surrogates can never be the same since they are defined by a range of possible values and those ranges do not overlap. So with the proper high and low surrogate paired together you can make any possible Unicode character.

The problem in Python 2.2.1 is that when there is only a lone surrogate (instead of there being a pair of surrogates), the encoder for UTF-8 messes up and leaves off a UTF-8 value. The following line is an example:

>>> u'\ud800'.encode('utf-8')
'\xa0\x80'  #In Python 2.2.1
'\xed\xa0\x80'  #In Python 2.3a0

Notice how in Python 2.3a0 the extra value is inserted so as to make the representation a complete Unicode character instead of only encoding the half of the surrogate pair that the encode was given.

You can read http://216.239.37.100/search?q=cache:Dk12BZNt6skC:uk.geocities.com/BabelStone1357/Software/surrogates.html for more info. Thanks goes to Frederik for the link and Guido for some clarification.

Documentation inconsistency in re

Christopher Craig noticed that the docs for the re module for the \b metacharacter was incorrect; it says that "the end of a word is indicated by whitespace or a non-alphanumeric character". That would indicate that an underscore would be the end of a word, which turns out to be false. Frederik said that "b is defined in terms of w and W" and thus allows underscore to be a alphanumeric character. The documentaiton has been fixed.

Codecs lookup order

Francois Pinard discovered that for the codecs module "one should be careful about not [altered emphasis] naming a module after the encoding name, when closely following the documentation in the Library Reference manual". This is because the codecs module first searches the registry of codecs, then searches for a module with the same name and use that module. The issue comes up when the module does not contain a function named getregentry(); "`encodings.lookup()` expects a `getregentry` function in that module, does not find it, and raises a CodecRegistryError, not leaving a chance to subsequent codec search functions to be used".

M.A. Lemburg said that this has been fixed in Python 2.3 and will be in 2.2.2 by having encodings.lookup() return None if getregentry() is not found and thus allowing the search to continue.

raw headers in rfc822.Message

John Spurling provided a two-line hack to keep the raw headers in an rfc822.Message . Barry responded that email.Message.Message keeps the raw headers around.

But the reason I am summarizing this is what this thread quickly changed to is how to properly generate a patch. Patches should be generated using UNIX diff, either the -c or -u option with preference for -c (using cvs diff -c is even better; puts the version of the file you are diffing with in the output); Mac folk can send MPW diffs, but UNIX diff is the definitely preference. Always put the order of the files diff -c OLD_FILE NEW_FILE . And always post the patches to SourceForge! Getting random patches, no matter how small, on the list is annoying (at least to me) because the point of the list is to discuss the design and implementation of Python, not to patch Python. SF is used so that Python-dev does not need to be bothered with mundame problems like applying patches (and to annoy Aahz with SF's UI sucking in Lynx =). So please, for my sake and everyone else on Python-dev, use SF!

For a funny email from Raymond Hettinger about developing for Python read http://mail.python.org/pipermail/python-dev/2002-September/028725.html .

type categories

Yes, the same thread from the last summary is back. This thread has become the bane of my summarizing existence. =)

Aahz asked "why wouldn't we simply use attributes to hold" interfaces that a class implemented (think of __slots__). David Abrahams then brought up the idea of just adding interfaces to the __class__ attribute.

Guido then chimed in on the attributes idea. He pointed out that this is how Zope does it, using the __inherits__ attribute. The limitation is that "it isn't automatically merged properly on multiple inheritance, and adding one new interface to it means you have to copy or reference the base class __inherits__ attribute". And as for David's idea of just adding to __class__, that doesn't work because there is no way to limit the interface; you need "Something like private inheritance" for when an interface is broken by some inherited class. David subsequently added the issue of being able to disinherit when an interface is not valid but is inherited by default as another problem for using inheritence for interfaces.

David then brought up the issue of having Python being so dynamic that you could inject an interface if you used __class__ like he suggested through black magic code. If the injected interface didn't work because of the inheritence chain, then you have a problem.

Barry Warsaw brought in his objections. He tried playing Devil's Advocate by saying that Guido had said that inheritance would not be the only way to handle interfaces, but that it would be the predominent way. But this duality would complicate any conformsto()-like function since it would have to handle two different ways for a class to get an interface. Barry then brought up the objection that he didn't like the idea of using straight inheritence because he wanted a syntactic way to separate out interfaces.

As a side note, Guido pointed out that __slots__ is provisional; nicer syntax will eventually surface when Guido gets over his "fear of adding new keywords".

flextype.c -- extended type system

Christian Tismer has come up with a replacement for the etype which is "a hidden structure that extends types when they are allocated on the heap" (you can find it in Objects/typeobject.c in the CVS). There is a limitation with the etype where it could not be extended by metatypes. Well, Chris worked his magic and came up with a new flextype that allows overriding of methods. So with Christian's code you would be able to override methods in a type without having to hack something together to handle the overriding correctly; it would be handled automatically.

Through some clarification from Christian and Guido, it was pointed out to me (as of this moment I am the only one to make any noise on this thread, and it was for this summary) that this simplifies an esoteric issue; note the use of the words "metatype" above. This is type/metatype black magic hacking. Spiffy, but something most of us "normal" folk will not have to worry about.