The first version I came up with was totally straightforward:
def f1(list): string = "" for item in list: string = string + chr(item) return stringThat can't be the fastest way to do it, said my friend. How about this one:
def f2(list): return reduce(lambda string, item: string + chr(item), list, "")This version performs exactly the same set of string operations as the first one, but gets rid of the for loop overhead in favor of the faster, implied loop of the reduce() function.
Sure, I replied, but it does so at the cost of a function call (the lambda function) per list item. I betcha it's slower, since function call overhead in Python is bigger than for loop overhead.
(OK, so I had already done the comparisons. f2() took 60% more time than f1(). So there :-)
Hmm, said my friend. I need this to be faster. OK, I said, how about this version:
def f3(list): string = "" for character in map(chr, list): string = string + character return stringTo both our surprise, f3() clocked twice as fast as f1()! The reason that this surprised us was twofold: first, it uses more storage (the result of map(chr, list) is another list of the same length); second, it contains two loops instead of one (the one implied by the map() function, and the for loop).
Of course, space versus time is a well-known trade-off, so the first one shouldn't have surprised us. However, how come two loops are faster than one? Two reasons.
First, in f1(), the built-in function chr() is looked up on every iteration, while in f3() it is only looked up once (as the argument to map()). This look-up is relatively expensive, I told my friend, since Python's dynamic scope rules mean that it is first looked up (unsuccessfully) in the current module's global dictionary, and then in the dictionary of built-in function (where it is found). Worse, unsuccessful dictionary lookups are (on average) a bit slower than successful ones, because of the way the hash chaining works.
The second reason why f3() is faster than f1() is that the call to chr(item), as executed by the bytecode interpreter, is probably a bit slower than when executed by the map() function - the bytecode interpreter must execute three bytecode instructions for each call (load 'chr', load 'item', call), while the map() function does it all in C.
This led us to consider a compromise, which wouldn't waste extra space, but which would speed up the lookup for the chr() function:
def f4(list): string = "" lchr = chr for item in list: string = string + lchr(item) return stringAs expected, f4() was slower than f3(), but only by 25%; it was about 40% faster than f1() still. This is because local variable lookups are much faster than global or built-in variable lookups: the Python "compiler" optimizes most function bodies so that for local variables, no dictionary lookup is necessary, but a simple array indexing operation is sufficient. The relative speed of f4() compared to f1() and f3() suggests that both reasons why f3() is faster contribute, but that the first reason (fewer lookups) is a bit more important. (To get more precise data on this, we would have to instrument the interpreter.)
Still, our best version, f3(), was only twice as fast as the most straightforward version, f1(). Could we do better?
I was worried that the quadratic behavior of the algorithm was killing us. So far, we had been using a list of 256 integers as test data, since that was what my friend needed the function for. But what if it were applied to a list of two thousand characters? We'd be concatenating longer and longer strings, one character at a time. It is easy to see that, apart from overhead, to create a list of length N in this way, there are 1 + 2 + 3 + ... + (N-1) characters to be copied in total, or N*(N-1)/2, or 0.5*N**2 - 0.5*N. In addition to this, there are N string allocation operations, but for sufficiently large N, the term containing N**2 will take over. Indeed, for a list that's 8 times as long (2048 items), these functions all take much more than 8 times as long; close to 16 times as long, in fact. I didn't dare try a list of 64 times as long.
There's a general technique to avoid quadratic behavior in algorithms like this. I coded it as follows for strings of exactly 256 items:
def f5(list): string = "" for i in range(0, 256, 16): # 0, 16, 32, 48, 64, ... s = "" for character in map(chr, list[i:i+16]): s = s + character string = string + s return stringUnfortunately, for a list of 256 items, this version ran a bit slower (though within 20%) of f3(). Since writing a general version would only slow it down more, we didn't bother to pursue this path any further (except that we also compared it with a variant that didn't use map(), which of course was slower again).
Finally, I tried a radically different approach: use only implied loops. Notice that the whole operation can be described as follows: apply chr() to each list item; then concatenate the resulting characters. We were already using an implied loop for the first part: map(). Fortunately, there are some string concatenation functions in the string module that are implemented in C. In particular, string.joinfields(list_of_strings, delimiter) concatenates a list of strings, placing a delimiter of choice between each two strings. Nothing stops us from concatenating a list of characters (which are just strings of length one in Python), using the empty string as delimiter. Lo and behold:
import string def f6(list): return string.joinfields(map(chr, list), "")This function ran four to five times as fast as our fastest contender, f3(). Moreover, it doesn't have the quadratic behavior of the other versions.
import array def f7(list): return array.array('B', list).tostring()This is about three times as fast as f6(), or 12 to 15 times as fast as f3()! it also uses less intermediate storage - it only allocates 2 objects of N bytes (plus fixed overhead), while f6() begins by allocating a list of N items, which usually costs 4N bytes (8N bytes on a 64-bit machine) - assuming the character objects are shared with similar objects elsewhere in the program (like small integers, Python caches strings of length one in most cases).
Stop, said my friend, before you get into negative times - this is fast enough for my program. I agreed, though I had wanted to try one more approach: write the whole function in C. This could have minimal storage requirements (it would allocate a string of length N right away) and save a few instructions in the C code that I knew were there in the array module, because of its genericity (it supports integer widths of 1, 2, and 4 bytes). However, it wouldn't be able to avoid having to extract the items from the list one at a time, and to extract the C integer from them, both of which are fairly costly operations in the Python-C API, so I expected at most modest speed up compared to f7(). Given the effort of writing and testing an extension (compared to whipping up those Python one-liners), as well as the dependency on a non-standard Python extension, I decided not to pursue this option...
import time def timing(f, n, a): print f.__name__, r = range(n) t1 = time.clock() for i in r: f(a); f(a); f(a); f(a); f(a); f(a); f(a); f(a); f(a); f(a) t2 = time.clock() print round(t2-t1, 3)
But this time, it was relatively painless. There are two candidates, the obvious:
def g1(string): return map(ord, string)and the somewhat less obvious:
import array def g2(string): return array.array('b', string).tolist()Timing these reveals that g2() is about five times as fast as g1(). There's a catch though: g2() returns integers in the range -128..127, while g1() returns integers in the range 0..255. If you need the positive integers, g1() is going to be faster than anything postprocessing you could do on the result from g2(). (Note: since this essay was written, the 'B' typecode was added to the array module, which stores unsigned bytes, so there's no reason to prefer g1() any more.)