"Scripting Language" My Arse: Using Python for Voice over IP

[This is the paper, not the talk. The talk is yet to come]

[third draft]


A common complaint made of Python is that it is not suitable for serious
application development, and is only suitable for "scripting" or
"prototyping" tasks. The Shtoom toolkit 
(http://divmod.org/Home/Projects/Shtoom) is a Voice over IP (VoIP) 
framework implemented in Python using the Twisted framework. It 
includes 'shtoom' itself, a software phone using the toolkit.

This paper covers the basics of SIP and RTP (the protocols underlying
Voice over IP), examines some of the issues relating to the
implementation of Shtoom (with a digression on issues relating
to timing), and will hopefully help demonstrate why implementing
applications in Python is perfectly feasible.


Why would I choose Python for VoIP? Python's a high-level language, with
many constructs that make it extremely pleasant to work with. In 
addition, the Twisted framework provides an efficient and elegant 
model for implementing network protocols. 

In implementing the software phone, a nice-to-have was that the phone
would work in a cross-platform way - I am not aware of any existing
cross-platform software phones.

Why would I not choose Python for VoIP? The primary reason would seem 
to be performance - VoIP is a complex beast, with requirements for 
throwing around packets of audio at some speed. It would seem from 
a first look that an interpreted language like Python would not be 
suitable for this task. 

Why Shtoom?

I had a need for a VoIP client that could be scripted for automated
testing of our cisco gateways running a number of large, complex IVR
scripts (an IVR is one of those automated phone systems you interact
with via the phone, pushing phone buttons to respond to menus). In
addition, I was looking for a replacement for the current conference
calling application we use (stupidmcu, derived from OpenH323's openmcu).

We had been using Openh323[openh323] internally for a number of
applications, so my first approach was to examine it's suitability for
wrapping in Python. It's a large complex C++ library and, as usual for
a large C++ library, it implements it's own basic types. I started down
the path of using Boost.Python to wrap this library, but abandoned
this after a few days work and pain. Just wrapping the basic types it
needed would have been a couple of weeks tedious work. As a programmer
who prefers to code in Python, this struck me as a very very boring
approach. In addition, the openh323 libraries were (in my experience)
extremely awkward to debug --  this is largely because the underlying 
H.323 protocol is itself a nightmare. I'll come back to H.323 in a bit.

I then investigated using SIP (the Session Initiation Protocol) instead 
of H.323. SIP is the Internet's answer to H.323 (much more on SIP in 
later sections). There was a partial implementation of the SIP protocol 
as part of Twisted (enough to implement a SIP Registration server), so 
this was a good base to begin with. I'd already had experience with 
implementing RTP (Real-Time Protocol), the UDP-based protocol that 
provides the underlying transport of audio over the Internet portion, 
in C code so felt I was up to the task.

Why Python?

There's a few obvious reasons for choosing Python for Shtoom:

It's easy to work with, and to debug. For implementing a network
protocol from scratch, Python is hard to beat. 

It's cross platform - while my initial requirements were for something
that would work on Linux and Solaris, having it work on other platforms
would be a nice-to-have. There's a variety of UI toolkits available from
Python, as well as one (Tkinter) that's cross platform.

And finally, of course, Python is fun to write. 

Why not Python?

The first concern I had was whether Python would be fast enough to
handle VoIP. VoIP is a lot of little packets flying back and forth,
and with certain applications (such as conferencing) you need to do
software mixing of multiple audio samples down to a single sample.

The next concern about Python was the interfaces to the audio hardware,
in particular, capturing audio. We'll cover this more, later.

There's no single user interface for Python. I regard this as something
of a positive - Shtoom has a pluggable user interface layer. Currently
the code has Qt, GNOME, Tk and command line user interfaces. An MFC 
(Windows) and Cocoa (Mac OS X) UI are planned.

The underlying RTP protocol has fairly harsh timing requirements - you
need to send a packet of audio every 20ms. This requirement was my
major concern about Python's suitability for this task.

Why Twisted?

Twisted is an open-source Python framework for writing network
applications, using an asynchronous event model. I'd previously used
Twisted in another project [pydirector] and was impressed with the
stability and flexibility of the core library. 

Twisted also features a whole pile of useful code that was already 
available - this meant I could concentrate on the interesting bits 
of the problem.

Voice over IP - A Short and Biased Summary

Voice over IP (VoIP) refers to the carriage of telephone calls over the
Internet, rather than the traditional public switched telephone network
(PSTN) -- the copper wires and fibres that connect every house together.
VoIP is used heavily by carriers (telephone companies) for their
internal networks, and is gaining increasing popularity as high-speed
Internet links to the home become more common.

As well as being considerably cheaper than traditional phone calls
(effectively free, assuming your Internet link is already paid for),
VoIP allows for a variety of more sophisticated telephone services, such
as video, multi-party communications (conferencing), and, well, pretty 
much anything you can think of. This is one of the most exciting aspects
of the Internet taking over the telephone world - it takes control of
the network off the existing carriers, and allows for a wide variety
of people do come up with new and interesting services.

Once your phone call is being routed over the Internet, it can, in
theory go anywhere. Well, anywhere that's on the Internet. This of
course probably doesn't include your mother, or the friend who's walking
down the street with a mobile phone. To get around this problem, many
people provide gateways to the PSTN from VoIP. Most of these gateways
are commercial, but they are usually much cheaper than the phone call
over a landline would be. The standardisation of the VoIP protocols also
mean you have a large variety of companies who can accept your business.

So how do you use this wonderful VoIP thingy? Well, obviously you're
going to need an Internet link. And then you need a device that allows
you to enter a phone number or net address, connects you to the other
end, and then transmits the audio over the Internet. There are two sorts
of devices that can do this.

The first is a hardware phone. These have gone from being an expensive
toy requiring extensive infrastructure and used only by large corporates
only a couple of years ago, to a much more affordable consumer item that
you can pick up for around US$100-US$200 today. SIPphone[sipphone],
started up by MP3.com's Michael Robertson, sells an adapter that has a
phone jack one side and an Ethernet port on the other side. You simply
plug an existing handset into one side of the adapter, an Ethernet cable
into the other side, and you're ready to go. Other carriers, such as
Vonage, also provide these interfaces. There are also dedicated SIP
phones - this looks something like a regular phone, but with an Ethernet
port on the back.

The second is known as a soft-phone, or, to use a term most people
should be familiar with, a computer program. (Telephony types _love_
their terminology). It uses the existing PC sound hardware (speakers
and microphone) and communicates via an existing Internet connection.
There are a few free softphones out there, as well as many commercial
phones. I had a look through the existing phones before I started on
the implementation of Shtoom, to figure out what I liked and disliked.
At the moment, the most polished of the free phones that I looked at 
is XTen's X-Lite. This is a closed-source Windows phone, so was only 
useful to me for interoperability testing.

In addition many chat programs, including Microsoft's Messenger and
Apple's iChat, are in fact SIP clients - they use SIP under the hood
for voice chats.

VoIP: The Protocols


Once upon a time, the only VoIP protocol was H.323. This was a standard
created by the ITU-T, the same organisation that gave us the run away
success of the X.500 directory service and X.400 email. H.323 has much
in common with other ITU-T standards - it features a complex binary wire
protocol, a nightmarish implementation, and a bulk that can be used
to fell medium-to-large predatory animals. OpenH323, an open-source
implementation of this protocol, consists of over 7 MB of C++ code (the
UNIX utility 'wc' reports that it's over 2.4 million lines of code).
This doesn't include the code to actually encode and decode the audio.
I don't intend to cover H.323 in detail in this paper - there are many
fine resources on the net for you to peruse if you wish to inflict this
on yourself. It should be noted, though, that H.323 is only one of a
suite of protocols - it depends on H.225, H.245 and a swarm of other

I'm unaware of anyone having implemented even a fraction of H.323 in
Python. Doing so would require a special kind of dedication, and quite
possibly a large amount of whiskey and prescription medication.


SIP (the Session Initiation Protocol) is a creation of the IETF, the 
organisation that produces Internet standards. While it is a complex
protocol, it features many advantages over H.323:

  - It uses text message bodies, in a format that should be familiar
    to anyone who's looked at the headers of an email message or a
    web request.

  - It is based on a variety of existing IETF protocols, including SDP
    and HTTP. 

  - It wasn't designed by an organisation of telcos based in Switzerland.

One common complaint of SIP regards its complexity. While it is on the
large end of a typical Internet protocol (the base RFC, 3261, comes
in at 269 pages), the problem it's solving is a complex one and
to supplant H.323 it needs to support ridiculous number of options.
But again, it's only complex compared to other Internet protocols.
Compared to ITU protocols, it's a work of austere elegance.

An aside on Standards

Standards are good. Standards make a lot of pain go away, and make
everything easier. This is particularly true for VoIP -- the 
whole point of VoIP is being able to talk to other people. This 
obviously gets somewhat tricky if the phones don't talk the 
same protocol. There's a number of non-standard approaches out there. 

The most visible is Skype [skype], a Windows softphone that uses a
proprietary protocol. Skype claim a whole pile of benefits to having
their own protocol - it works better with a variety of firewalls, it's
more efficient, blah blah blah. Unfortunately the trade off for this is
that you can only talk to other Skype users - anyone on a non-Windows
platform need not apply. There's also unlikely to be the variety of 
hardware SIP phones and phone adapters that you can get for SIP.
In addition, when Skype eventually gets gatewayed to the existing PSTN 
network, it's very likely that your only choice for a gateway will be 

Another non-standard is Asterisk's IAX. While this is an open protocol,
the only documentation of it is in the C code of Asterisk. This would
be an amount of not-fun to reverse-engineer. Worse yet, as it's
not documented anywhere means it could change as the Asterisk code 
changes. Once the Asterisk project takes the time to write down their
protocol, I'll consider implementing it. 

Implementing VoIP

The two main divisions of work in implementing a VoIP application are
the implementation of SIP, which controls the call negotiation and
setup, and the implementation of the underlying protocol that passes
the audio back and forth. The latter uses a protocol known as RTP, the
Real Time Protocol[rtp]. This is a quite venerable Internet protocol,
initially developed for use in Multicast applications.

RTP consists of small packets of audio, transmitted as UDP. A typical
packet size is just 20ms of audio. There is a companion protocol, RTCP
(Real Time Control Protocol) that is used to communicate information
such as delivery reports. The audio can be in a number of different
formats - the format negotiation is explicitly _not_ part of RTP, but is
left to a higher level protocol, such as SIP.

One interesting aspect of implementing SIP is that every SIP
implementation is both a client and a server. Either end of a SIP
conversation can initiate a request or reply to a request. This is quite
different to HTTP, which SIP superficially resembles. The protocol 
itself is also quite stateful - in the implementation there's a number
of state machines for handling the various states of a call.

Shtoom Details

Shtoom Architecture

Ooo. ASCII art::

    |    UI     |
          |         +-------+
   +-------------  /|  SIP  |
   |             |/ +-------+
   | application |
   |             |\ +-------+
   +-------------+ \|  RTP  |
          |         +-------+
    |   audio   |

The application is the core element of a Shtoom application. It controls
the flow of calls, handles the (high level) incoming events, and deals 
with the flow of data between the other components (for instance, between
the audio layer and the RTP layer). 

The audio layer is an abstraction on top of the audio hardware and any 
audio codecs that might be present. The application calls into the audio
layer to query and select audio formats, and to deliver and retrieve 

The UI layer is only present on those applications that require a user
interface (currently only the phone). The application passes requests
to the UI (for instance, when an incoming call arrives) and the UI calls
into the application when the user requests something (for instance, when
the user enters an address and hits 'call').

The SIP layer is an implementation of SIP. It listens for requests and
responses and passes higher level requests to the application. At the 
moment Shtoom's SIP implementation is not complete - I'm adding to it 
as I hit a requirement for a new feature.

An RTP layer is created for each incoming or outgoing call. It merely 
passes the audio to and from the network. Each RTP layer is responsible
for its own timer loop. In the future, it would be possible for an RTP
layer to be instantiated on a different machine, to allow load spreading.

Multiple User Interfaces

One nice thing about Python is the wide variety of user interfaces
available, and the ease of working with them. I don't think any 
application implemented in a lower-level language would attempt to
ship with 4, 5 or 6 user interfaces. In Python, though, this is really
quite easy. In addition, I've made efforts in Shtoom to produce a 
higher-level API to reduce my workload. 

One thing that's reduced my workload significantly is the Preferences
interface. Trying to maintain various preferences dialogs and keep them
in sync for the different platforms struck me as a very boring task,
so instead I developed code that described the preferences available
in an application, and then the user interface layer inspects the
options object to build the preferences UI. This allows me to tweak the
preferences without having to rebuild the dialogs in each UI. There's
additional code that works from the same options object to build a
command-line parser (using optparse) and to load and save from 
Config.ini-style settings files. This is probably useful enough 
that I'll look at releasing this independently of Shtoom.

Another reason for Shtoom's multiple user interfaces (aside from
indecision on my part) was a desire to have a nice example of the
different user interface toolkits and how they interface with Python.
Hopefully this will be useful in the future - both for people trying
to choose between UI toolkits and for people wondering about converting
from one toolkit to another. I'm not aware of any projects that provide
the same UI using a number of different toolkits. 

However, I'm not silly enough to offer an opinion as to which one I
consider "the best" - no matter which I choose, someone will disagree
violently, and attempt to engage me in a long and tedious discussion
about the merits of their toolkit of choice. I really don't care enough
to put myself through this. My only comment would be that while Tk is
very simple and easy to code, it's... very simple. A lot of things you
take for granted in a more modern toolkit require additional packages on
top of Tk.

Other Shtoom applications

While the most visible part of shtoom is the phone application, there
are a number of other applications in the package. This will grow as 
I have time to write them, and as I develop the Doug application server
further (more on Doug, soon). The first two are a simple announcements 
server (available by placing a call to 'sip:testcall@divmod.com')
and a basic voicemail server. The latter plays a per-user announcement,
then records the audio from the person calling. When the person hangs
up the call it saves the audio off - this can then be sent to the user
as an email, or whatever. There's also a simple echo server - it simply
replays the audio sent to it back to the caller. This is extremely useful
for debugging.

The next application that I intend to ship is shtoomcu - a conferencing
server. Multiple people call into the conferencing server and can talk
to each other. This is less complex than it sounds - you simply keep 
track of all participants in a conference, and when a bit of audio 
comes in, you pass it to the other users. The tricky bit is mixing 
audio - when multiple people are talking, you need to make sure that
you do the right thing and mix the audio samples together. I'll come
back to this a bit later in the paper in a discussion on performance.

A bit further down the track, the conferencing will also be folded into 
the phone program - this will allow users to connect together multiple 
calls into a single multi party call. 

Doug: The Shtoom Application Server

The next big step is implementing a full voice application server, known
as Doug. This will be an event driven application server for writing
voice applications. If you've used the Tcl engine embedded in cisco
voice gateways, you'll have an idea of the sorts of things possible. I
don't pretend to know what the next great voice applications will be -
but I'd like to make it easy for people to write these applications.

Doug is in a very early stage as yet. Once it's done, the previously
mentioned applications (message, voicemail, conferencing) will be 
rewritten as Doug applications.

Timing and Buffering

There are a issues you encounter as part of implementing RTP
(the lowlevel protocol used to transmit the audio data).

The first is the simple trade-off of buffering vs latency. Simply put,
the more you buffer before playing, the more robust you are in the
face of a glitchy network that delays individual packets of audio, but 
the more delay there is in playing the audio. For now, shtoom takes a 
simple approach of not buffering at all - as soon as a packet arrives, 
it is sent to the audio device. And every 20ms, a lump of audio is read
from the audio device and sent to the network. While this doesn't give
the best performance in the world, it's the easiest to implement. In
practice, this is usually "good enough", but I do intend to revisit this 
issue in the future.

The next issue is that RTP requires a reliable source of audio - you 
need to send the audio every 20ms. The real problem with this is that 
most modern computers have a timer clock that runs at only 100Hz. This 
means that the resolution of the timer is just 10ms. This has the 
unfortunate implication that if you miss the 20ms clock tick (even by 
a single millisecond) you get a 10ms delay. This delay is quite obvious 
to the listener and can render an audio stream unusable, even if only 
one in 10 samples is delayed in this way.

I initially assumed that this real-time requirement would make Python an
unsuitable language for implementing RTP - indeed, in a previous (non
open-source) project, I just assumed that this would be the case, and
implemented the RTP component of the application in C. This time around,
though, I tested my assumption first. I was pleasantly surprised.

Timing Strategies

The first and most obvious approach to this sort of timing is to use a
timer signal - on UNIX, for instance, there is a setitimer() call that
allows you to specify a repeating timer loop, implemented via signals.
This has a few problems - it's non-portable, it relies on signals, and
doesn't work if you have multiple timer loops in a single application. 
(Did I mention that it relies on signals?)

Nonetheless, this was the first approach I took, to determine whether
Python was able to package up a bundle of audio and send it out within
the 20ms available. I was quite happy to find that this was in fact 
extremely easy. On my (admittedly overpowered) laptop this takes 
less than a third of a millisecond. So, having determined that this wasn't
going to be a problem, I went back to the original problem of getting
the timing right.

The second approach is to schedule a call, and have the call reschedule
itself immediately. Something like::

     def nextpacket(self):
          reactor.callLater(0.020, self.nextpacket)
          # Send the current packet
          # read the next audio for the next packet

The problem here is that if there is a delay in calling the nextpacket()
routine for some reason, the next packet might miss the 20ms timer and
instead hit a 30ms timer. You can do a hacky workaround for this by setting
the timer to, say, 18ms, and hoping that any delay will fit inside this
2ms window of error. This is extremely ugly and rather brittle. 

The approach Shtoom now uses is to use a construct called LoopingCall,
developed by JP Calderone. The guts of the LoopingCall are as follows::

    def _loop(self):
        # Call the function, with the stored args and kwargs
        self.f(*self.a, **self.kw)
        # Now re-calculate the next timer delay
        self.count += 1
        # What's the current time?
        fromNow = self.starttime - time.time()
        # When should the next timer be scheduled?
        fromStart = self.count * self.interval
        delay = fromNow + fromStart:
        if delay > 0:
            self.call = reactor.callLater(delay, self._loop)

The approach is that the LoopingCall calls the function, then determines 
when the next timer call is due. It then schedules a timer call for the
delay needed. 

This approach has proven rock-solid in use, and remnants of the previous
code that used setitimer() have been removed from the codebase. This 
removed the first major concern I had about implementing SIP in Python.
The next, mixing together audio, seemed like a harder problem.


Many people in the computer industry are obsessed with performance
over all else. This often misses the point - the question to be asked
for most applications is not "how fast is it?" but instead "is it 
fast enough?" 

For most of shtoom, Python is easily fast enough. To really put it
to the test, though, I concentrated on one of the most CPU intensive 
components - the mixing of audio for conference calls.

In a conference call, we have many audio sources contributing to 
the audio transmitted out. For each user, we want to find the "loudest"
N audio sources (not including the user) and mix the audio samples

A simple approach to take is to take each of the contributing audio
sources, estimate their volume (using a simple root-mean-squared
function), sort by power, and then take the top N (for my example,
N is 4). We then scale each audio signal by 1/4 and add them together.

I first tried an implementation in straight Python. [listing mixPython]
In this (and all following examples) the input is a list of 320 byte
strings - these are each 160 16 bit signed sample values, representing
20ms of audio. The output should also be a 320 byte string, in the 
same format.

We first take the RMS of each audio chunk and sort them by this 
value. We then take the top 4 samples, scale them down, then
add them. On my test machine, feeding in 18 audio samples, selecting
the top 4, and then mixing them together took around 2.2ms. This is
purely in Python, and only minimal efforts to optimise this were 
taken (I'm sure people can point to obvious speedups).

The second approach I tried was to use the Numarray [numarray]
(formerly Numeric) Python extension. This is shown in 
[listing mixNumeric]. This turned out to be slightly slower
than the pure-Python implementation (around 2.4ms). Examining
this closer, it showed that while the scaling and adding were
about 3 times faster, this was outweighed by the increased time
taken in constructing the arrays. It's possible that there is 
a way to improve this by re-using existing array objects - I've
not looked too closely at this.

Psyco was also tried - this produced only minimal speedups. I'm
open to ideas as to why. 

Next was to start implementing sections of the code in Pyrex [pyrex].
Pyrex is a dialect of Python that is translated directly into C code.
It's by far the most pleasant way to write C extensions for Python.
Examining the Python code in detail revealed that the most expensive
part of the calculation was the RMS - it was responsible for around
65% of the time taken.  Moving just the RMS operation to Pyrex 
reduced that component from around 1.4ms to just 0.35ms - taking 
the overall time to just over 1ms.

Finally, I noticed that the standard Python module 'audioop' 
had most of the functions I needed, implemented in C code. Using
these reduced the time taken to around 120 microseconds (0.12ms).
This is an impressive 20 times speedup, and as an added bonus,
the code is considerably smaller and easier to work out.

This isn't _quite_ the end of the calculations, though - this 
only does mixing for a single output. We need to do this for 
each participant in the conference. We can re-use a lot of 
the calculations at each stage - we only need to calculate the
power once for each sample, and all users that are not one of the N+1 
loudest samples can re-use the same output sample. For 
comparision's sake, times were taken for the stupid approach
(recalculating the scaling and adding for each user) and
for the smart method which does the minimum work necessary.

              1 channel   18 channels   18 channels
                            (dumb)        (smart)
Python          2.2ms        8.7ms         2.7ms     
Numeric         2.4ms        5.0ms         2.7ms
Pyrex           1.1ms        7.7ms         1.6ms
audioop         0.12ms       0.80ms        0.18ms

The "smart" approach also has the benefit that it scales up to
a large number of participants very well. 

So, back to the original point. Is this "fast enough" - well, this
is for an audio sample of 20ms duration. The above code shows that
we can produce audio output in around 1% of the time limit we have
to meet. In the real world this would be even better performance - 
particularly with VoIP clients that have silence suppression, where
they don't send audio if the user isn't talking. Even taking the
pure-python implementation, we're "fast enough", but only barely. 
But with a small amount of optimisation we can produce code that's
easily fast enough. All of the above methods were produced in about
3 hours of work. At least 45 minutes of that was reading Numeric's

It should also be noted that for most cases, even the stupid approach
using straight Python is probably "fast enough" for a case with just
a handful of users. If the user count is 4 or below, we can skip the
entire RMS calculation and mix in all the user audio.

These results suggest that a whole host of other audio manipulation
tasks (such as silence detection) should also be quite possible in

Python and Audio Recording

There's no portable approach to capturing audio in the standard library.
This caused me some initial concern, but there's a solution for this: 
PortAudio [portaudio]. This is a platform independent library for 
accessing audio hardware, with an existing Python wrapper (fastaudio). 

(One minor aside is that the current release of PortAudio (v18) doesn't
work with ALSA, the new standard for Linux audio. ALSA is included by
default in the current Linux 2.6 kernel. On the positive side, the standard
library's ossaudio library works fine on this platform.)

Currently Shtoom uses either the ossaudiodev module or the fastaudio
wrapper of PortAudio. A native sound layer for Mac OS X is under
development, and I hope that a native windows layer (using DirectSound)
will be available in the future.

Audio Encoding

As mentioned, RTP supports multiple audio encodings. SIP negotiates 
a common encoding for the participants of the call.

Shtoom's underlying audio layer reads and writes audio as signed 16 bit
PCM at 8KHz. This is then converted to whichever format is required for
the call.

The easiest audio codec to support is G.711 ULAW. This is 8 bit ULAW
at 8KHz (and is also the format used in an ISDN call). The standard
python 'audioop' module can convert to and from this format. The downside
to this codec is that it consumes 64kbit/sec for each direction.

The next codec supported is GSM 06.10. This is a rather complex beast
to implement - fortunately, though, other people have already done the
work [gsm]. Itamar Shtull-Trauring wrote a simple wrapper around this 
library - it takes a sequence of 13-bit samples (the bottom three bits 
of the audio are discarded) and produces 33 bytes of output for each 
20ms of sound. GSM 06.10 consumes about 13 kbits/sec. 

There's a variety of other codecs in the G.72x family - unfortunately
they are all patented up the wazoo, and require you to purchase licenses.
Shtoom will support these with C-code wrappers around the reference
implementations, but obtaining a license before using them is obviously 
up to the end-user. 
(Lawyers are welcome to let me know whether simply providing an interface
to a patented codec is going to get me sued - I hope not!)

A relative newcomer to the audio coding world is Speex - a non-patented 
codec developed by the Ogg Vorbis folks. I intend to support this in an 
upcoming release of Shtoom. Someone's already written a pySpeex wrapper 
around this audio codec.

DTMF (aka "the funny beeping")

RTP can carry more than just audio. One common thing to be carried is
DTMF (the signals that are sent when you press a button on a touch-tone
phone). These are sent as a different sort of RTP packet - rather than
sending the tone down audio channel, they're sent out-of-band, as a 
different sort of RTP packet (a different Payload Type marker). This 
means that you don't need to put expensive signal processing in place 
to detect the magic audio tones - a very good thing.

While the Internet connectivity of phones allows a lot of other ways 
to interact with users, people who are still using a phone handset 
only have the DTMF buttons as a user interface. The goals of shtoom 
are that the user interface can generate DTMF (indeed, my original 
goal of a scripted client to test IVR systems requires it) and the
application side should allow interaction with DTMF in a simple way.
At the moment most of the shtoom innards for doing this are done -
it's now largely a matter of hooking the pieces together.

A SIP Call

So, how does this SIP thing all hook together? A simple example is 
probably in order, showing how shtoom handles the calls.

We'll show a small example here with one user calling another. (Under
international treaties describing technical papers these users must be
called "Alice" and "Bob".) We'll assume both Alice and Bob are using
shtoom - Alice is at home looking after a sick cat, while Bob is sitting 
at his desk at the office goofing off and looking for a reason to avoid 

When Bob fires up his copy of shtoom after getting back from lunch, 
the first thing shtoom did was register Bob with a SIP location 
service. This consists of a message saying "Any call for Bob@divmod.com, 
send them to this IP address". The SIP proxy might request authentication 
(using HTTP's Digest Authentication), then registers the user.

Alice is sitting at home bored and decides that boredom shared is
boredom halved, and places a call to her friend Bob@divmod.com. This
isn't Bob's work address - divmod.com in this case is a SIP location
server that Bob has an account on. Alice enters the address, and hits
Call. The Shtoom UI calls into the application layer - this in turn 
calls into the SIP layer. The SIP layer creates a new Call object. This
first determines which encodings are available, creates an Invite 
message, and then sends this to the divmod.com
SIP server. This contains a few pieces of information:

  - The destination for the INVITE
  - Who's doing the calling
  - The network ports to use for the low level audio (SIP uses dynamically
    allocated ports)
  - A description of the media that the caller can use (audio encodings,
    video encodings and the like).

The divmod.com SIP server looks up it's internal database to figure out
how to currently contact Bob, and forwards the SIP invite on to Bob's
computer. The SIP layer in Bob's shtoom creates a new call (based on the
Call-ID header in the invite) and hands control off to this new call. The
first thing it does for a new call is pop up a message saying "Alice is 
calling, answer?"

Bob clicks 'Yes', and this is passed through to the newly created call.
It sends back a 200 OK response to the proxy. (SIP shares much with HTTP, 
including the formatting of messages and many of the response codes. OK 
in SIP is 200, just like HTTP). The response includes Bob's real network
address, the list of media that Bob can handle from Alice's original
invite, and the network ports that Bob's phone program will be using.

The proxy forwards the response back to Alice. Alice's phone receives 
the response, looks up the relevant call, and passes the response to it.
It next replies with an ACK request directly to Bob's computer, 
using the network address in his response. After the ACK is sent, the 
connection starts up - audio flows between the network ports negotiated 
in the INVITE/OK messages.

Eventually one party or the other will terminate the connection - at 
that point, they click 'Hang up'. The application passes this to the 
SIP layer, along with the current call id. The SIP layer asks the relevant
call object to format a BYE message, and sends it to the other phone. On
receiving a BYE, the receiving phone hangs up the line and sends back an OK 


So, the conclusions that can be drawn from this effort? Well, the 
first, and most obvious, is that Python's a hell of a lot more capable 
than you might think. I was actually surprised at how easily Python 
was able to handle what I was throwing at it - even without "cheating"
and using code from the standard library, it was actually fast enough.
Pyrex makes it pretty trivial to remove the bits of the code that are
potential bottlenecks. Pyrex rocks.

Once you remove the "performance" reason for avoiding Python, the 
list of reasons for not using Python is pretty thin. "It makes the
C++ coders cry", while true and fair, isn't a suitable justification.

The final point to be drawn from the performance section is that
when you're looking at a problem that seems "hard" - examine the
standard library. Python's "batteries included" philosophy, with it's
extensive library, means that someone's quite possibly already done the
work for you. And there's no software methodology in the world that will
produce a result faster than "someone already did it for me".

Overall, the number one thing I've figured out from this whole exercise
is that people who refer to Python dismissively as "just a scripting
language" probably don't know what they're talking about.

Future Work

In no particular order here's some future work I'll be looking at.


My initial thoughts were that video would be completely impossible 
to handle with Python. Having been surprised once, though, I'm going
to check this. I suspect that the problems of platform-independent 
audio interfaces will be even worse for video capture, and that this 
will be the major pain with implementing video. 

Additional Platforms

There's a variety of new platform work I'd like to look into. One 
intriguing possibility is the Python interpreter bundled in the Nokia
6600 series. While the CPU of the phone is unlikely to be able to 
handle the requirements of Shtoom, if the Python platform exposes the
audio interfaces of the phone, this may be enough to allow the phone 
to work.

Additional Interoperation

It would be nice to interoperate with programs such as Messenger and
iChat - I've not begun this task. It appears that they use their own
protocols for some of the call setup work - hopefully this won't require 
extensive protocol reverse engineering.

I can only test Shtoom against implementations that I have access to - 
as time goes on, I hope to be able to expand the list of tested systems.

Instant Messaging (SIMPLE)

The IETF is moving towards standardising an instant messenger protocol
based on SIP. This could be an interesting direction to explore.

Additional Phone Features

Adding the ability to handle multiple calls is an obvious step for the
phone application. Most of the work is in the user interface side for 
this. One nice extension would be an ad-hoc conferencing facility. There's
no reason that any given phone couldn't patch two or more incoming calls
into the same audio session. Or allow the phone to make an additional 
outbound call, and patch it into the existing call. 


There's still a large amount of work to get Doug to a state where I'm
happy with it. This will largely be an iterative process - as new 
requirements come up, the platform will change.

A Long Footnote: Firewalls and SIP

Firewalls and Network Address Translation (NAT) boxes are the bane of
VoIP. With dynamically allocated UDP ports, and an announcement protocol
that needs to know what ports the traffic is going to be on, it's almost
certain that every firewall known to man will screw it up in some way. 
There's a few solutions to this.


One solution is to have a SIP proxy server. This is a pain in the 
backside for all concerned - it's also not very practical. Most people
behind a firewall don't have the ability to run a proxy server. One 
day I might look at implementing this, but it's pretty low on my list
of things to do.

Fixed Port Numbers

Another approach is to have fixed local port numbers, and have the 
firewall forward those ports onto the inbound system. This is OK if
it's only one or two people using the system, but quickly becomes a
nightmare to manage for any larger number of users.

SIP-aware firewalls

It would be nice if firewalls gained knowledge of SIP, and would
do the necessary magic to allow packets to flow through. This of
course then means you're relying on the firewall vendor to get it
right. This could be considered unlikely. Certain variants of Cisco's
IOS implement this, and it's reported that they actually work OK.

A Different Protocol

One solution might be to use a different protocol that's a little
more forgiving of firewalls. The problem here is getting the protocol
standardised and deployed widely. This might be a long term approach -
I've not seen much happening in this area.


STUN [stun] is a UDP protocol hack to help you determine what your
firewall is doing. Briefly, a STUN request involves sending a packet
from the port you're going to be communicating on, to a STUN server 
outside your firewall. The STUN server examines the packet, and replies
with the IP address and port number that it saw the packet come from.

STUN is only half of the solution. You also need your firewall to 
do stateful UDP filtering - that is, if packets go out on a port,
allow the replies to come back in. A side-benefit of STUN is that 
the outbound request should allow the traffic to flow back in.

Note that STUN doesn't help you if your firewall doesn't handle 
stateful UDP filtering. In the words of one correspondent "STUN
just lets you discover how screwed you are". That is, it allows 
you to figure out whether your firewall is usable, and how you 
can work with it.

Shtoom implements STUN for both SIP and RTP traffic. The STUN
implementation will eventually be folded back into the core
Twisted framework - it's useful for any UDP protocol that needs
firewall traversal.


Microsoft's entry in this cavalcade of horrors is Universal Plug
and Play (UPnP). This is a protocol that allows networked devices
to discover and control aspects of their local network. In the
case of a firewall, it allows an end-user system to request a
dynamic port-forwarding from the firewall to the box. Many network
administrators will probably (rightly) recoil at letting applications 
on a Windows box dictate firewall policy. UPnP, while implemented 
initially on Windows, is now an open protocol.

As an aside, UPnP's implementation (which features SOAP, HTTP over
multicast/broadcast UDP, and extremely odd XML) is a must-read for 
fans of unnatural and baroque network protocols. 

There's a partial implementation of UPnP in Shtoom - I hope to 
finish it in the not too distant future. It's not clear how useful
this will be - the first worm/virus that uses UPnP to punch holes
through firewalls will probably result in UPnP being disabled 
everywhere. Most of the routers that implement UPnP are also 
capable enough that it's unnecessary.


Thanks to the entire Divmod and Twisted teams for assistance in the 
development of Shtoom. Special thanks to Amir Bakhtiar for providing
hardware for testing, patient Windows user feedback, and for getting 
me along to PyCon in the first place. Thanks also to ekit for getting
me to the US, and for providing the impetus to do this stuff in the
first place.

Thanks also to Andy Hird, Dougal Scott, Cam Blackwood, Benno Rice and 
Toby Sargeant for feedback on earlier versions of this paper. 
Any mistakes remaining are, of course, entirely my fault. And my cats. 
They're always messing things up.

[ gsm ] http://
[ numarray ] http://
[ portaudio ] http://
[ pydirector ] http://
[ pyrex ] http://
[ rtp ] http://
[ shtoom ] http://divmod.org/Home/Projects/Shtoom
[ sipphone ] http://www.sipphone.com/

$Revision: 8580 $