|
|
|||||||||
|
"Scripting Language" My Arse: Using Python for Voice over IP ============================================================ [This is the paper, not the talk. The talk is yet to come] [third draft] Abstract -------- A common complaint made of Python is that it is not suitable for serious application development, and is only suitable for "scripting" or "prototyping" tasks. The Shtoom toolkit (http://divmod.org/Home/Projects/Shtoom) is a Voice over IP (VoIP) framework implemented in Python using the Twisted framework. It includes 'shtoom' itself, a software phone using the toolkit. This paper covers the basics of SIP and RTP (the protocols underlying Voice over IP), examines some of the issues relating to the implementation of Shtoom (with a digression on issues relating to timing), and will hopefully help demonstrate why implementing applications in Python is perfectly feasible. Introduction ------------ Why would I choose Python for VoIP? Python's a high-level language, with many constructs that make it extremely pleasant to work with. In addition, the Twisted framework provides an efficient and elegant model for implementing network protocols. In implementing the software phone, a nice-to-have was that the phone would work in a cross-platform way - I am not aware of any existing cross-platform software phones. Why would I not choose Python for VoIP? The primary reason would seem to be performance - VoIP is a complex beast, with requirements for throwing around packets of audio at some speed. It would seem from a first look that an interpreted language like Python would not be suitable for this task. Why Shtoom? ----------- I had a need for a VoIP client that could be scripted for automated testing of our cisco gateways running a number of large, complex IVR scripts (an IVR is one of those automated phone systems you interact with via the phone, pushing phone buttons to respond to menus). In addition, I was looking for a replacement for the current conference calling application we use (stupidmcu, derived from OpenH323's openmcu). We had been using Openh323[openh323] internally for a number of applications, so my first approach was to examine it's suitability for wrapping in Python. It's a large complex C++ library and, as usual for a large C++ library, it implements it's own basic types. I started down the path of using Boost.Python to wrap this library, but abandoned this after a few days work and pain. Just wrapping the basic types it needed would have been a couple of weeks tedious work. As a programmer who prefers to code in Python, this struck me as a very very boring approach. In addition, the openh323 libraries were (in my experience) extremely awkward to debug -- this is largely because the underlying H.323 protocol is itself a nightmare. I'll come back to H.323 in a bit. I then investigated using SIP (the Session Initiation Protocol) instead of H.323. SIP is the Internet's answer to H.323 (much more on SIP in later sections). There was a partial implementation of the SIP protocol as part of Twisted (enough to implement a SIP Registration server), so this was a good base to begin with. I'd already had experience with implementing RTP (Real-Time Protocol), the UDP-based protocol that provides the underlying transport of audio over the Internet portion, in C code so felt I was up to the task. Why Python? ----------- There's a few obvious reasons for choosing Python for Shtoom: It's easy to work with, and to debug. For implementing a network protocol from scratch, Python is hard to beat. It's cross platform - while my initial requirements were for something that would work on Linux and Solaris, having it work on other platforms would be a nice-to-have. There's a variety of UI toolkits available from Python, as well as one (Tkinter) that's cross platform. And finally, of course, Python is fun to write. Why not Python? --------------- The first concern I had was whether Python would be fast enough to handle VoIP. VoIP is a lot of little packets flying back and forth, and with certain applications (such as conferencing) you need to do software mixing of multiple audio samples down to a single sample. The next concern about Python was the interfaces to the audio hardware, in particular, capturing audio. We'll cover this more, later. There's no single user interface for Python. I regard this as something of a positive - Shtoom has a pluggable user interface layer. Currently the code has Qt, GNOME, Tk and command line user interfaces. An MFC (Windows) and Cocoa (Mac OS X) UI are planned. The underlying RTP protocol has fairly harsh timing requirements - you need to send a packet of audio every 20ms. This requirement was my major concern about Python's suitability for this task. Why Twisted? ------------ Twisted is an open-source Python framework for writing network applications, using an asynchronous event model. I'd previously used Twisted in another project [pydirector] and was impressed with the stability and flexibility of the core library. Twisted also features a whole pile of useful code that was already available - this meant I could concentrate on the interesting bits of the problem. Voice over IP - A Short and Biased Summary ------------------------------------------ Voice over IP (VoIP) refers to the carriage of telephone calls over the Internet, rather than the traditional public switched telephone network (PSTN) -- the copper wires and fibres that connect every house together. VoIP is used heavily by carriers (telephone companies) for their internal networks, and is gaining increasing popularity as high-speed Internet links to the home become more common. As well as being considerably cheaper than traditional phone calls (effectively free, assuming your Internet link is already paid for), VoIP allows for a variety of more sophisticated telephone services, such as video, multi-party communications (conferencing), and, well, pretty much anything you can think of. This is one of the most exciting aspects of the Internet taking over the telephone world - it takes control of the network off the existing carriers, and allows for a wide variety of people do come up with new and interesting services. Once your phone call is being routed over the Internet, it can, in theory go anywhere. Well, anywhere that's on the Internet. This of course probably doesn't include your mother, or the friend who's walking down the street with a mobile phone. To get around this problem, many people provide gateways to the PSTN from VoIP. Most of these gateways are commercial, but they are usually much cheaper than the phone call over a landline would be. The standardisation of the VoIP protocols also mean you have a large variety of companies who can accept your business. So how do you use this wonderful VoIP thingy? Well, obviously you're going to need an Internet link. And then you need a device that allows you to enter a phone number or net address, connects you to the other end, and then transmits the audio over the Internet. There are two sorts of devices that can do this. The first is a hardware phone. These have gone from being an expensive toy requiring extensive infrastructure and used only by large corporates only a couple of years ago, to a much more affordable consumer item that you can pick up for around US$100-US$200 today. SIPphone[sipphone], started up by MP3.com's Michael Robertson, sells an adapter that has a phone jack one side and an Ethernet port on the other side. You simply plug an existing handset into one side of the adapter, an Ethernet cable into the other side, and you're ready to go. Other carriers, such as Vonage, also provide these interfaces. There are also dedicated SIP phones - this looks something like a regular phone, but with an Ethernet port on the back. The second is known as a soft-phone, or, to use a term most people should be familiar with, a computer program. (Telephony types _love_ their terminology). It uses the existing PC sound hardware (speakers and microphone) and communicates via an existing Internet connection. There are a few free softphones out there, as well as many commercial phones. I had a look through the existing phones before I started on the implementation of Shtoom, to figure out what I liked and disliked. At the moment, the most polished of the free phones that I looked at is XTen's X-Lite. This is a closed-source Windows phone, so was only useful to me for interoperability testing. In addition many chat programs, including Microsoft's Messenger and Apple's iChat, are in fact SIP clients - they use SIP under the hood for voice chats. VoIP: The Protocols ------------------- H.323 ~~~~~ Once upon a time, the only VoIP protocol was H.323. This was a standard created by the ITU-T, the same organisation that gave us the run away success of the X.500 directory service and X.400 email. H.323 has much in common with other ITU-T standards - it features a complex binary wire protocol, a nightmarish implementation, and a bulk that can be used to fell medium-to-large predatory animals. OpenH323, an open-source implementation of this protocol, consists of over 7 MB of C++ code (the UNIX utility 'wc' reports that it's over 2.4 million lines of code). This doesn't include the code to actually encode and decode the audio. I don't intend to cover H.323 in detail in this paper - there are many fine resources on the net for you to peruse if you wish to inflict this on yourself. It should be noted, though, that H.323 is only one of a suite of protocols - it depends on H.225, H.245 and a swarm of other protocols. I'm unaware of anyone having implemented even a fraction of H.323 in Python. Doing so would require a special kind of dedication, and quite possibly a large amount of whiskey and prescription medication. SIP ~~~ SIP (the Session Initiation Protocol) is a creation of the IETF, the organisation that produces Internet standards. While it is a complex protocol, it features many advantages over H.323: - It uses text message bodies, in a format that should be familiar to anyone who's looked at the headers of an email message or a web request. - It is based on a variety of existing IETF protocols, including SDP and HTTP. - It wasn't designed by an organisation of telcos based in Switzerland. One common complaint of SIP regards its complexity. While it is on the large end of a typical Internet protocol (the base RFC, 3261, comes in at 269 pages), the problem it's solving is a complex one and to supplant H.323 it needs to support ridiculous number of options. But again, it's only complex compared to other Internet protocols. Compared to ITU protocols, it's a work of austere elegance. An aside on Standards ~~~~~~~~~~~~~~~~~~~~~ Standards are good. Standards make a lot of pain go away, and make everything easier. This is particularly true for VoIP -- the whole point of VoIP is being able to talk to other people. This obviously gets somewhat tricky if the phones don't talk the same protocol. There's a number of non-standard approaches out there. The most visible is Skype [skype], a Windows softphone that uses a proprietary protocol. Skype claim a whole pile of benefits to having their own protocol - it works better with a variety of firewalls, it's more efficient, blah blah blah. Unfortunately the trade off for this is that you can only talk to other Skype users - anyone on a non-Windows platform need not apply. There's also unlikely to be the variety of hardware SIP phones and phone adapters that you can get for SIP. In addition, when Skype eventually gets gatewayed to the existing PSTN network, it's very likely that your only choice for a gateway will be Skype. Another non-standard is Asterisk's IAX. While this is an open protocol, the only documentation of it is in the C code of Asterisk. This would be an amount of not-fun to reverse-engineer. Worse yet, as it's not documented anywhere means it could change as the Asterisk code changes. Once the Asterisk project takes the time to write down their protocol, I'll consider implementing it. Implementing VoIP ----------------- The two main divisions of work in implementing a VoIP application are the implementation of SIP, which controls the call negotiation and setup, and the implementation of the underlying protocol that passes the audio back and forth. The latter uses a protocol known as RTP, the Real Time Protocol[rtp]. This is a quite venerable Internet protocol, initially developed for use in Multicast applications. RTP consists of small packets of audio, transmitted as UDP. A typical packet size is just 20ms of audio. There is a companion protocol, RTCP (Real Time Control Protocol) that is used to communicate information such as delivery reports. The audio can be in a number of different formats - the format negotiation is explicitly _not_ part of RTP, but is left to a higher level protocol, such as SIP. One interesting aspect of implementing SIP is that every SIP implementation is both a client and a server. Either end of a SIP conversation can initiate a request or reply to a request. This is quite different to HTTP, which SIP superficially resembles. The protocol itself is also quite stateful - in the implementation there's a number of state machines for handling the various states of a call. Shtoom Details -------------- Shtoom Architecture ~~~~~~~~~~~~~~~~~~~ Ooo. ASCII art:: +-----------+ | UI | +-----------+ | +-------+ +------------- /| SIP | | |/ +-------+ | application | | |\ +-------+ +-------------+ \| RTP | | +-------+ +-----------+ | audio | +-----------+ The application is the core element of a Shtoom application. It controls the flow of calls, handles the (high level) incoming events, and deals with the flow of data between the other components (for instance, between the audio layer and the RTP layer). The audio layer is an abstraction on top of the audio hardware and any audio codecs that might be present. The application calls into the audio layer to query and select audio formats, and to deliver and retrieve audio. The UI layer is only present on those applications that require a user interface (currently only the phone). The application passes requests to the UI (for instance, when an incoming call arrives) and the UI calls into the application when the user requests something (for instance, when the user enters an address and hits 'call'). The SIP layer is an implementation of SIP. It listens for requests and responses and passes higher level requests to the application. At the moment Shtoom's SIP implementation is not complete - I'm adding to it as I hit a requirement for a new feature. An RTP layer is created for each incoming or outgoing call. It merely passes the audio to and from the network. Each RTP layer is responsible for its own timer loop. In the future, it would be possible for an RTP layer to be instantiated on a different machine, to allow load spreading. Multiple User Interfaces ~~~~~~~~~~~~~~~~~~~~~~~~ One nice thing about Python is the wide variety of user interfaces available, and the ease of working with them. I don't think any application implemented in a lower-level language would attempt to ship with 4, 5 or 6 user interfaces. In Python, though, this is really quite easy. In addition, I've made efforts in Shtoom to produce a higher-level API to reduce my workload. One thing that's reduced my workload significantly is the Preferences interface. Trying to maintain various preferences dialogs and keep them in sync for the different platforms struck me as a very boring task, so instead I developed code that described the preferences available in an application, and then the user interface layer inspects the options object to build the preferences UI. This allows me to tweak the preferences without having to rebuild the dialogs in each UI. There's additional code that works from the same options object to build a command-line parser (using optparse) and to load and save from Config.ini-style settings files. This is probably useful enough that I'll look at releasing this independently of Shtoom. Another reason for Shtoom's multiple user interfaces (aside from indecision on my part) was a desire to have a nice example of the different user interface toolkits and how they interface with Python. Hopefully this will be useful in the future - both for people trying to choose between UI toolkits and for people wondering about converting from one toolkit to another. I'm not aware of any projects that provide the same UI using a number of different toolkits. However, I'm not silly enough to offer an opinion as to which one I consider "the best" - no matter which I choose, someone will disagree violently, and attempt to engage me in a long and tedious discussion about the merits of their toolkit of choice. I really don't care enough to put myself through this. My only comment would be that while Tk is very simple and easy to code, it's... very simple. A lot of things you take for granted in a more modern toolkit require additional packages on top of Tk. Other Shtoom applications ------------------------- While the most visible part of shtoom is the phone application, there are a number of other applications in the package. This will grow as I have time to write them, and as I develop the Doug application server further (more on Doug, soon). The first two are a simple announcements server (available by placing a call to 'sip:testcall@divmod.com') and a basic voicemail server. The latter plays a per-user announcement, then records the audio from the person calling. When the person hangs up the call it saves the audio off - this can then be sent to the user as an email, or whatever. There's also a simple echo server - it simply replays the audio sent to it back to the caller. This is extremely useful for debugging. The next application that I intend to ship is shtoomcu - a conferencing server. Multiple people call into the conferencing server and can talk to each other. This is less complex than it sounds - you simply keep track of all participants in a conference, and when a bit of audio comes in, you pass it to the other users. The tricky bit is mixing audio - when multiple people are talking, you need to make sure that you do the right thing and mix the audio samples together. I'll come back to this a bit later in the paper in a discussion on performance. A bit further down the track, the conferencing will also be folded into the phone program - this will allow users to connect together multiple calls into a single multi party call. Doug: The Shtoom Application Server ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The next big step is implementing a full voice application server, known as Doug. This will be an event driven application server for writing voice applications. If you've used the Tcl engine embedded in cisco voice gateways, you'll have an idea of the sorts of things possible. I don't pretend to know what the next great voice applications will be - but I'd like to make it easy for people to write these applications. Doug is in a very early stage as yet. Once it's done, the previously mentioned applications (message, voicemail, conferencing) will be rewritten as Doug applications. Timing and Buffering -------------------- There are a issues you encounter as part of implementing RTP (the lowlevel protocol used to transmit the audio data). The first is the simple trade-off of buffering vs latency. Simply put, the more you buffer before playing, the more robust you are in the face of a glitchy network that delays individual packets of audio, but the more delay there is in playing the audio. For now, shtoom takes a simple approach of not buffering at all - as soon as a packet arrives, it is sent to the audio device. And every 20ms, a lump of audio is read from the audio device and sent to the network. While this doesn't give the best performance in the world, it's the easiest to implement. In practice, this is usually "good enough", but I do intend to revisit this issue in the future. The next issue is that RTP requires a reliable source of audio - you need to send the audio every 20ms. The real problem with this is that most modern computers have a timer clock that runs at only 100Hz. This means that the resolution of the timer is just 10ms. This has the unfortunate implication that if you miss the 20ms clock tick (even by a single millisecond) you get a 10ms delay. This delay is quite obvious to the listener and can render an audio stream unusable, even if only one in 10 samples is delayed in this way. I initially assumed that this real-time requirement would make Python an unsuitable language for implementing RTP - indeed, in a previous (non open-source) project, I just assumed that this would be the case, and implemented the RTP component of the application in C. This time around, though, I tested my assumption first. I was pleasantly surprised. Timing Strategies ~~~~~~~~~~~~~~~~~ The first and most obvious approach to this sort of timing is to use a timer signal - on UNIX, for instance, there is a setitimer() call that allows you to specify a repeating timer loop, implemented via signals. This has a few problems - it's non-portable, it relies on signals, and doesn't work if you have multiple timer loops in a single application. (Did I mention that it relies on signals?) Nonetheless, this was the first approach I took, to determine whether Python was able to package up a bundle of audio and send it out within the 20ms available. I was quite happy to find that this was in fact extremely easy. On my (admittedly overpowered) laptop this takes less than a third of a millisecond. So, having determined that this wasn't going to be a problem, I went back to the original problem of getting the timing right. The second approach is to schedule a call, and have the call reschedule itself immediately. Something like:: def nextpacket(self): reactor.callLater(0.020, self.nextpacket) # Send the current packet # read the next audio for the next packet The problem here is that if there is a delay in calling the nextpacket() routine for some reason, the next packet might miss the 20ms timer and instead hit a 30ms timer. You can do a hacky workaround for this by setting the timer to, say, 18ms, and hoping that any delay will fit inside this 2ms window of error. This is extremely ugly and rather brittle. The approach Shtoom now uses is to use a construct called LoopingCall, developed by JP Calderone. The guts of the LoopingCall are as follows:: def _loop(self): # Call the function, with the stored args and kwargs self.f(*self.a, **self.kw) # Now re-calculate the next timer delay self.count += 1 # What's the current time? fromNow = self.starttime - time.time() # When should the next timer be scheduled? fromStart = self.count * self.interval delay = fromNow + fromStart: if delay > 0: self.call = reactor.callLater(delay, self._loop) return The approach is that the LoopingCall calls the function, then determines when the next timer call is due. It then schedules a timer call for the delay needed. This approach has proven rock-solid in use, and remnants of the previous code that used setitimer() have been removed from the codebase. This removed the first major concern I had about implementing SIP in Python. The next, mixing together audio, seemed like a harder problem. Performance ----------- Many people in the computer industry are obsessed with performance over all else. This often misses the point - the question to be asked for most applications is not "how fast is it?" but instead "is it fast enough?" For most of shtoom, Python is easily fast enough. To really put it to the test, though, I concentrated on one of the most CPU intensive components - the mixing of audio for conference calls. In a conference call, we have many audio sources contributing to the audio transmitted out. For each user, we want to find the "loudest" N audio sources (not including the user) and mix the audio samples together. A simple approach to take is to take each of the contributing audio sources, estimate their volume (using a simple root-mean-squared function), sort by power, and then take the top N (for my example, N is 4). We then scale each audio signal by 1/4 and add them together. I first tried an implementation in straight Python. [listing mixPython] In this (and all following examples) the input is a list of 320 byte strings - these are each 160 16 bit signed sample values, representing 20ms of audio. The output should also be a 320 byte string, in the same format. We first take the RMS of each audio chunk and sort them by this value. We then take the top 4 samples, scale them down, then add them. On my test machine, feeding in 18 audio samples, selecting the top 4, and then mixing them together took around 2.2ms. This is purely in Python, and only minimal efforts to optimise this were taken (I'm sure people can point to obvious speedups). The second approach I tried was to use the Numarray [numarray] (formerly Numeric) Python extension. This is shown in [listing mixNumeric]. This turned out to be slightly slower than the pure-Python implementation (around 2.4ms). Examining this closer, it showed that while the scaling and adding were about 3 times faster, this was outweighed by the increased time taken in constructing the arrays. It's possible that there is a way to improve this by re-using existing array objects - I've not looked too closely at this. Psyco was also tried - this produced only minimal speedups. I'm open to ideas as to why. Next was to start implementing sections of the code in Pyrex [pyrex]. Pyrex is a dialect of Python that is translated directly into C code. It's by far the most pleasant way to write C extensions for Python. Examining the Python code in detail revealed that the most expensive part of the calculation was the RMS - it was responsible for around 65% of the time taken. Moving just the RMS operation to Pyrex reduced that component from around 1.4ms to just 0.35ms - taking the overall time to just over 1ms. Finally, I noticed that the standard Python module 'audioop' had most of the functions I needed, implemented in C code. Using these reduced the time taken to around 120 microseconds (0.12ms). This is an impressive 20 times speedup, and as an added bonus, the code is considerably smaller and easier to work out. This isn't _quite_ the end of the calculations, though - this only does mixing for a single output. We need to do this for each participant in the conference. We can re-use a lot of the calculations at each stage - we only need to calculate the power once for each sample, and all users that are not one of the N+1 loudest samples can re-use the same output sample. For comparision's sake, times were taken for the stupid approach (recalculating the scaling and adding for each user) and for the smart method which does the minimum work necessary. 1 channel 18 channels 18 channels (dumb) (smart) Python 2.2ms 8.7ms 2.7ms Numeric 2.4ms 5.0ms 2.7ms Pyrex 1.1ms 7.7ms 1.6ms audioop 0.12ms 0.80ms 0.18ms The "smart" approach also has the benefit that it scales up to a large number of participants very well. So, back to the original point. Is this "fast enough" - well, this is for an audio sample of 20ms duration. The above code shows that we can produce audio output in around 1% of the time limit we have to meet. In the real world this would be even better performance - particularly with VoIP clients that have silence suppression, where they don't send audio if the user isn't talking. Even taking the pure-python implementation, we're "fast enough", but only barely. But with a small amount of optimisation we can produce code that's easily fast enough. All of the above methods were produced in about 3 hours of work. At least 45 minutes of that was reading Numeric's documentation. It should also be noted that for most cases, even the stupid approach using straight Python is probably "fast enough" for a case with just a handful of users. If the user count is 4 or below, we can skip the entire RMS calculation and mix in all the user audio. These results suggest that a whole host of other audio manipulation tasks (such as silence detection) should also be quite possible in Python. Python and Audio Recording -------------------------- There's no portable approach to capturing audio in the standard library. This caused me some initial concern, but there's a solution for this: PortAudio [portaudio]. This is a platform independent library for accessing audio hardware, with an existing Python wrapper (fastaudio). (One minor aside is that the current release of PortAudio (v18) doesn't work with ALSA, the new standard for Linux audio. ALSA is included by default in the current Linux 2.6 kernel. On the positive side, the standard library's ossaudio library works fine on this platform.) Currently Shtoom uses either the ossaudiodev module or the fastaudio wrapper of PortAudio. A native sound layer for Mac OS X is under development, and I hope that a native windows layer (using DirectSound) will be available in the future. Audio Encoding -------------- As mentioned, RTP supports multiple audio encodings. SIP negotiates a common encoding for the participants of the call. Shtoom's underlying audio layer reads and writes audio as signed 16 bit PCM at 8KHz. This is then converted to whichever format is required for the call. The easiest audio codec to support is G.711 ULAW. This is 8 bit ULAW at 8KHz (and is also the format used in an ISDN call). The standard python 'audioop' module can convert to and from this format. The downside to this codec is that it consumes 64kbit/sec for each direction. The next codec supported is GSM 06.10. This is a rather complex beast to implement - fortunately, though, other people have already done the work [gsm]. Itamar Shtull-Trauring wrote a simple wrapper around this library - it takes a sequence of 13-bit samples (the bottom three bits of the audio are discarded) and produces 33 bytes of output for each 20ms of sound. GSM 06.10 consumes about 13 kbits/sec. There's a variety of other codecs in the G.72x family - unfortunately they are all patented up the wazoo, and require you to purchase licenses. Shtoom will support these with C-code wrappers around the reference implementations, but obtaining a license before using them is obviously up to the end-user. (Lawyers are welcome to let me know whether simply providing an interface to a patented codec is going to get me sued - I hope not!) A relative newcomer to the audio coding world is Speex - a non-patented codec developed by the Ogg Vorbis folks. I intend to support this in an upcoming release of Shtoom. Someone's already written a pySpeex wrapper around this audio codec. DTMF (aka "the funny beeping") ------------------------------ RTP can carry more than just audio. One common thing to be carried is DTMF (the signals that are sent when you press a button on a touch-tone phone). These are sent as a different sort of RTP packet - rather than sending the tone down audio channel, they're sent out-of-band, as a different sort of RTP packet (a different Payload Type marker). This means that you don't need to put expensive signal processing in place to detect the magic audio tones - a very good thing. While the Internet connectivity of phones allows a lot of other ways to interact with users, people who are still using a phone handset only have the DTMF buttons as a user interface. The goals of shtoom are that the user interface can generate DTMF (indeed, my original goal of a scripted client to test IVR systems requires it) and the application side should allow interaction with DTMF in a simple way. At the moment most of the shtoom innards for doing this are done - it's now largely a matter of hooking the pieces together. A SIP Call ---------- So, how does this SIP thing all hook together? A simple example is probably in order, showing how shtoom handles the calls. We'll show a small example here with one user calling another. (Under international treaties describing technical papers these users must be called "Alice" and "Bob".) We'll assume both Alice and Bob are using shtoom - Alice is at home looking after a sick cat, while Bob is sitting at his desk at the office goofing off and looking for a reason to avoid work. When Bob fires up his copy of shtoom after getting back from lunch, the first thing shtoom did was register Bob with a SIP location service. This consists of a message saying "Any call for Bob@divmod.com, send them to this IP address". The SIP proxy might request authentication (using HTTP's Digest Authentication), then registers the user. Alice is sitting at home bored and decides that boredom shared is boredom halved, and places a call to her friend Bob@divmod.com. This isn't Bob's work address - divmod.com in this case is a SIP location server that Bob has an account on. Alice enters the address, and hits Call. The Shtoom UI calls into the application layer - this in turn calls into the SIP layer. The SIP layer creates a new Call object. This first determines which encodings are available, creates an Invite message, and then sends this to the divmod.com SIP server. This contains a few pieces of information: - The destination for the INVITE - Who's doing the calling - The network ports to use for the low level audio (SIP uses dynamically allocated ports) - A description of the media that the caller can use (audio encodings, video encodings and the like). The divmod.com SIP server looks up it's internal database to figure out how to currently contact Bob, and forwards the SIP invite on to Bob's computer. The SIP layer in Bob's shtoom creates a new call (based on the Call-ID header in the invite) and hands control off to this new call. The first thing it does for a new call is pop up a message saying "Alice is calling, answer?" Bob clicks 'Yes', and this is passed through to the newly created call. It sends back a 200 OK response to the proxy. (SIP shares much with HTTP, including the formatting of messages and many of the response codes. OK in SIP is 200, just like HTTP). The response includes Bob's real network address, the list of media that Bob can handle from Alice's original invite, and the network ports that Bob's phone program will be using. The proxy forwards the response back to Alice. Alice's phone receives the response, looks up the relevant call, and passes the response to it. It next replies with an ACK request directly to Bob's computer, using the network address in his response. After the ACK is sent, the connection starts up - audio flows between the network ports negotiated in the INVITE/OK messages. Eventually one party or the other will terminate the connection - at that point, they click 'Hang up'. The application passes this to the SIP layer, along with the current call id. The SIP layer asks the relevant call object to format a BYE message, and sends it to the other phone. On receiving a BYE, the receiving phone hangs up the line and sends back an OK response. Conclusions ----------- So, the conclusions that can be drawn from this effort? Well, the first, and most obvious, is that Python's a hell of a lot more capable than you might think. I was actually surprised at how easily Python was able to handle what I was throwing at it - even without "cheating" and using code from the standard library, it was actually fast enough. Pyrex makes it pretty trivial to remove the bits of the code that are potential bottlenecks. Pyrex rocks. Once you remove the "performance" reason for avoiding Python, the list of reasons for not using Python is pretty thin. "It makes the C++ coders cry", while true and fair, isn't a suitable justification. The final point to be drawn from the performance section is that when you're looking at a problem that seems "hard" - examine the standard library. Python's "batteries included" philosophy, with it's extensive library, means that someone's quite possibly already done the work for you. And there's no software methodology in the world that will produce a result faster than "someone already did it for me". Overall, the number one thing I've figured out from this whole exercise is that people who refer to Python dismissively as "just a scripting language" probably don't know what they're talking about. Future Work ----------- In no particular order here's some future work I'll be looking at. Video ~~~~~ My initial thoughts were that video would be completely impossible to handle with Python. Having been surprised once, though, I'm going to check this. I suspect that the problems of platform-independent audio interfaces will be even worse for video capture, and that this will be the major pain with implementing video. Additional Platforms ~~~~~~~~~~~~~~~~~~~~ There's a variety of new platform work I'd like to look into. One intriguing possibility is the Python interpreter bundled in the Nokia 6600 series. While the CPU of the phone is unlikely to be able to handle the requirements of Shtoom, if the Python platform exposes the audio interfaces of the phone, this may be enough to allow the phone to work. Additional Interoperation ~~~~~~~~~~~~~~~~~~~~~~~~~ It would be nice to interoperate with programs such as Messenger and iChat - I've not begun this task. It appears that they use their own protocols for some of the call setup work - hopefully this won't require extensive protocol reverse engineering. I can only test Shtoom against implementations that I have access to - as time goes on, I hope to be able to expand the list of tested systems. Instant Messaging (SIMPLE) ~~~~~~~~~~~~~~~~~~~~~~~~~~ The IETF is moving towards standardising an instant messenger protocol based on SIP. This could be an interesting direction to explore. Additional Phone Features ~~~~~~~~~~~~~~~~~~~~~~~~~ Adding the ability to handle multiple calls is an obvious step for the phone application. Most of the work is in the user interface side for this. One nice extension would be an ad-hoc conferencing facility. There's no reason that any given phone couldn't patch two or more incoming calls into the same audio session. Or allow the phone to make an additional outbound call, and patch it into the existing call. Doug ~~~~ There's still a large amount of work to get Doug to a state where I'm happy with it. This will largely be an iterative process - as new requirements come up, the platform will change. A Long Footnote: Firewalls and SIP ---------------------------------- Firewalls and Network Address Translation (NAT) boxes are the bane of VoIP. With dynamically allocated UDP ports, and an announcement protocol that needs to know what ports the traffic is going to be on, it's almost certain that every firewall known to man will screw it up in some way. There's a few solutions to this. Proxies ~~~~~~~ One solution is to have a SIP proxy server. This is a pain in the backside for all concerned - it's also not very practical. Most people behind a firewall don't have the ability to run a proxy server. One day I might look at implementing this, but it's pretty low on my list of things to do. Fixed Port Numbers ~~~~~~~~~~~~~~~~~~ Another approach is to have fixed local port numbers, and have the firewall forward those ports onto the inbound system. This is OK if it's only one or two people using the system, but quickly becomes a nightmare to manage for any larger number of users. SIP-aware firewalls ~~~~~~~~~~~~~~~~~~~ It would be nice if firewalls gained knowledge of SIP, and would do the necessary magic to allow packets to flow through. This of course then means you're relying on the firewall vendor to get it right. This could be considered unlikely. Certain variants of Cisco's IOS implement this, and it's reported that they actually work OK. A Different Protocol ~~~~~~~~~~~~~~~~~~~~ One solution might be to use a different protocol that's a little more forgiving of firewalls. The problem here is getting the protocol standardised and deployed widely. This might be a long term approach - I've not seen much happening in this area. STUN ~~~~ STUN [stun] is a UDP protocol hack to help you determine what your firewall is doing. Briefly, a STUN request involves sending a packet from the port you're going to be communicating on, to a STUN server outside your firewall. The STUN server examines the packet, and replies with the IP address and port number that it saw the packet come from. STUN is only half of the solution. You also need your firewall to do stateful UDP filtering - that is, if packets go out on a port, allow the replies to come back in. A side-benefit of STUN is that the outbound request should allow the traffic to flow back in. Note that STUN doesn't help you if your firewall doesn't handle stateful UDP filtering. In the words of one correspondent "STUN just lets you discover how screwed you are". That is, it allows you to figure out whether your firewall is usable, and how you can work with it. Shtoom implements STUN for both SIP and RTP traffic. The STUN implementation will eventually be folded back into the core Twisted framework - it's useful for any UDP protocol that needs firewall traversal. UPnP ~~~~ Microsoft's entry in this cavalcade of horrors is Universal Plug and Play (UPnP). This is a protocol that allows networked devices to discover and control aspects of their local network. In the case of a firewall, it allows an end-user system to request a dynamic port-forwarding from the firewall to the box. Many network administrators will probably (rightly) recoil at letting applications on a Windows box dictate firewall policy. UPnP, while implemented initially on Windows, is now an open protocol. As an aside, UPnP's implementation (which features SOAP, HTTP over multicast/broadcast UDP, and extremely odd XML) is a must-read for fans of unnatural and baroque network protocols. There's a partial implementation of UPnP in Shtoom - I hope to finish it in the not too distant future. It's not clear how useful this will be - the first worm/virus that uses UPnP to punch holes through firewalls will probably result in UPnP being disabled everywhere. Most of the routers that implement UPnP are also capable enough that it's unnecessary. Acknowledgments ---------------- Thanks to the entire Divmod and Twisted teams for assistance in the development of Shtoom. Special thanks to Amir Bakhtiar for providing hardware for testing, patient Windows user feedback, and for getting me along to PyCon in the first place. Thanks also to ekit for getting me to the US, and for providing the impetus to do this stuff in the first place. Thanks also to Andy Hird, Dougal Scott, Cam Blackwood, Benno Rice and Toby Sargeant for feedback on earlier versions of this paper. Any mistakes remaining are, of course, entirely my fault. And my cats. They're always messing things up. [ gsm ] http:// [ numarray ] http:// [ portaudio ] http:// [ pydirector ] http:// [ pyrex ] http:// [ rtp ] http:// [ shtoom ] http://divmod.org/Home/Projects/Shtoom [ sipphone ] http://www.sipphone.com/ $Revision: 8580 $ |