>>> a = [1, 2, 3] >>> b = [a, a] >>> b[0] is b[1] 1 >>> import marshal >>> serialization = marshal.dumps(b) >>> b1 = marshal.loads(serialization) >>> b == b1 1 >>> b1[0] is b1[1] 0 >>> b[0].append(4) >>> b1[0].append(4) >>> b == b1 0 >>> print b [[1, 2, 3, 4], [1, 2, 3, 4]] >>> print b1 [[1, 2, 3, 4], [1, 2, 3]] >>>-- in other words, before marshalling b its two elements were provably the same object, but after unmarshalling we have a list of two different objects (whose value is the same). Not only is this less efficient (the value of a is marshalled twice) but also are the semantics are different: b[0].append(4) has a different effect on the value of b than b1[0].append(4) has on the value of b1.
This problem is easily solved by using a bottom-up serialization algorithm which assigns unique (within a particular serialization context) identifiers (uids) to each object and allows references to objects by their uid.
A harder problem is that of marshalling user-defined class instances and certain built-in objects like open files or sockets.
There are two problems with marshalling class instances: (1) how to find the class at unmarshal time, and (2) how to create an instance of it, given the possibility that it has a constructor method (__init__) requiring arguments.
Re (1): marshalling the class itself would be a possibility. This would even seem quite logical if we require marshalling (serializing) to follow all pointers out of an object -- the instance contains a pointer to the class, after all. However, there are two reasons why I don't like this: first, classes are often quite big (marshalling the class would imply marshalling the code for all its methods and base classes); second, I want to be able to load the instances back into a process that has a modified version of the class (e.g. one with a bug fixed).
A simple solution that will work in most (but not all!) cases is to indicate the class to which a marshalled instance belongs by a (modulename, classname) pair. The unmarshalling code can then import the module (if it hasn't been imported yet) and extract the class from it in order to create a new instance.
Since classes may be created dynamically, knowing a class' name and even the module in which it was created are not necessarily sufficient to be able to uniquely identify it. However this is sufficiently rare that it is probably acceptable to disallow marshalling instances of such dynamically created classes.
A minor sub-problem is how to find the module name given only the class instance (the class name is found as x.__class__.__name__). I suggest that the class creation code is modified to automatically add the module name to the class, so it can be accessed as x.__class__.__module__. In a prototype implemented in Python without modifications to the interpreter, we can scan the class dictionary for functions (i.e. methods) -- a function object contains a pointer to its globals which by definition is the dictionary of the containing module (and the key '__name__' in that dictionary gives the module name).
Re (2): A pragmatic solution would be to only allow marshalling class instances whose constructor accepts an empty argument list. This means that the constructor is called by the unmarshalling process to create an empty instance, after which the instance variables are filled in one by one through normal attribute assignments. (In a prototype implemented in unmodified Python, we have no other choice than to do it this way.)
A problem with this approach is that it requires a lot of extra work: the constructor probably assigns default values to all instance variables, which are then overwritten by the unmarshalled values. It also won't work if the class traps attribute assignments using a __setattr__ method (or a class-wide access statement).
This could be solved by adding a low-level class instantiation function which does not call the constructor. The unmarshalling code could then extract __dict__ from the "virginal" object and place instance variables in it. In order to allow the class some control over unmarshalled objects, it could define a __finalize__ method which would be called (if it existed) to finish the initialization. (This may be necessary e.g. when the class wants to keep a list or count of its instances, or if the instance needs to be linked up to some outside entity such as a UNIX file descriptor or an electric switch.)
Some more things: (3) how to marshal built-in objects like open files, sockets, windows; (4) can't we provide a way for user-defined classes to override the serialization -- this seems much more powerful than only having a hook for __finalize__; (5) how about marshalling objects that are internal to the interpreter such as stack frames, tracebacks and modules.
Re (3): Such objects usually have some of their relevant "state" stored in tables that reside in the operating system kernel. In general it is hopeless to try to restore this -- you may be able to save and restore the filename, seek position and read/write mode of a disk file, assuming you will be unmarshalling in the same environment, but "special files", sockets and windows are usually only meaningful after a handshake with some server process. I propose that these object types are marshalled as some kind of placeholder (maybe a "closed" object of the appropriate variety) so a __finalize__ method can be written that attempts to renegotiate the connection. (Raising an exception at either marshal or unmarshal time would be rather antisocial -- in many cases marshalling will be used to save a copy of important data before dying, and it should be appropriately robust.)
Re (4): I think not -- this would break robustness. (Note however that we must allow built-in object types to override the marshalling code -- this is an interesting problem in itself since built-in types may be created by dynamically loaded modules.)
Re (5): "live" stack frames will no longer be alive after unmarshalling, but can still be inspected. Both stack frames and tracebacks (which contain stack frames) can point to the globals in a module. Part of me says that these dictionaries should be marshalled as if they were ordinary objects (and if the 2nd/3rd arguments to exec or eval() are used, they needn't belong to a module at all). Part of me says that this will be too expensive and that it's better to link them to the corresponding modules in the unmarshalling process -- if only because this is also done for class instances.
Anders Lindstrom has written a nice prototype of a persistent heap (and this without looking at these notes :-).
import SharedMemoryHeap import RemoteServerHeap heap1 = SharedMemoryHeap('') heap2 = RemoteServerHeap(' : ') heap3 = RemoteServerHeap(' : ') foo1 = heap1.create('foo') foo2 = heap2.create('foo') bar1 = heap1.load('bar', READONLY) bar2 = heap2.load('bar', READWRITE) ... foo1.commit() # commit this particular object heap2.commit() # commit all modified objects on this heap
--Guido van Rossum, CWI, Amsterdam