Flattening Python Objects
(Guido van Rossum, 14-17 November 1994)
(Revised: 1 December 1994)
Origin of the name 'flattening'
Because I want to leave the original 'marshal' module alone, and
Jim complained that 'serialization' also means something totally
different that's actually relevant in the context of concurrent access
to persistent objects, I'll use the term 'flattening' from now on.
Monty Python fans can associate this with 16 ton weights, while snake
lovers can think of a Python's favorite way to kill its dinner. (The
Modula-3 system uses the term 'pickled' data for this concept. They
have probably solved all problems already, and in a type-safe manner
:-)
File formats
While I intend to rewrite this code in C for efficiency, I will
create a compatible prototype in Python first. With the efficient
version in mind, I would like to use a binary file format. However,
for debugging, it is much easier if the file format is ASCII. I'm not
sure what I'll do yet. I expect that in either case it will be
possible to gain a large factor in space by applying a general data
compression algorithm such as gzip to the binary files -- just not as
much as for flattening algorithms that generate Python source code
(since source code is more voluminous than a specially-designed
encoding).
Security
The unflattener should not use Python's eval() function or exec
statement. If it does, it is vulnerable to attacks that present it
with executable Python code that does something undesired. A second
kind of attack: if (like the marshal module) the flatten module could
transfer code objects, and if (unlike the marshal module) it could
also transfer function or method objects, an attacker might be able to
smuggle an object into an unflattening program that has a method which
is called implicitly by the interpreter, such as __repr__ or
__getattr__. This may be prevented by not flattening code, function
or method objects at all.
Abstract flatten/unflatten operations
The lowest level interface to the flatten module will support the
following operations:
- initialize a flattener for writing to a stream
- flatten an object to the stream, recursively
- initialize an unflattener for reading from a stream
- unflatten an object from the stream, recursively
The flattening and unflattening algorithms are designed together
since they must be each other's logical mirror image. The flattening
algorithm uses a dictionary mapping object id's to objects. It uses
recursion too. The unflattening algorithm uses a similar dictionary
and a stack. This doesn't contradict the goal that they be mirror
images: using a stack is equivalent to recursion.
Instruction stream
The flattened data stream contains single-byte 'instructions' and
multi-byte 'data items'. In the debugging (prototype) version of the
flattener, instructions and data items are printable ASCII strings and
are delimited by newline characters. A more efficient version may use
binary encodings. We will have to see whether this saves enough time
and space to be worth the effort (and worth the resulting lack of
debuggability of flattened objects).
The virtual machine interpreting the instruction stream uses two
data structures: a stack and a dictionary. The stack is used as
intermediate storage; many instructions push data on the stack or pop
it off. When the instruction stream ends, the top of the stack
contains the object of interest. (There shouldn't be anything else
left on the stack.) The dictionary is used as a way to refer to
objects by their 'local name'.
Local names are integers. They are meaningful only within one
flattened data stream. Not all objects need to have a local name:
numbers, for instance, may remain anonymous. Objects that occur only
once in the flattened object also needn't have a local name. It is up
to the flattener to decide whether to assign local names to objects --
it may optimize for space by assigning local names sparingly. It is
also up to the flattener to choose a name for an object. The
prototype uses the object's id() but for the meaning of the flattened
stream this is not important, as long as local names are unique. Note
that without local names, data sharing cannot be represented -- and a
special case of data sharing is recursion.
We also define a way to refer to objects with global names. These
are not defined by the flatten module but by its caller. Global names
are strings. The interpretation of global names is up to the caller.
I propose that the following instructions be defined. I will give
them symbolic names, which makes it easier to reason about them. Note
that there are no flow control instructions -- the instruction stream
is interpreted strictly sequentially. This is done so that the
instruction stream can be read off a network connection, for instance.
- PUT local_name
- Store the object on the stack top as 'local_name'; leave it on
the stack.
- GET local_name
- Push the object 'local_name' onto the stack.
- PERSID persistent_id
- Push the object 'persistent_id' onto the stack.
- POP
- Discard the object on the stack top.
- DUP
- Push the object on the stack top onto the stack.
- NONE
- Push the object 'None' onto the stack.
- INT value
- Push the integer 'value' onto the stack.
- LONG value
- Push the long integer 'value' onto the stack.
- FLOAT value
- Push the floating point number 'value' onto the stack.
- STRING value
- Push the string 'value' onto the stack.
- MARK
- Push a unique 'marker' onto the stack. Markers are used to
create tuples, lists etc. without having to specify a count.
- TUPLE
- Create a tuple from the stack entries until the topmost marker
(the deepest stack entry becomes the first tuple item). Everything up
to and including the marker is removed from the stack and the new
tuple is pushed onto it.
- LIST
- Create a list from the stack entries until the topmost marker
(the deepest stack entry becomes the first list item). Everything up
to and including the marker is removed from the stack and the new
list is pushed onto it.
- DICT
- Create a dictionary from pairs of stack entries until the topmost
marker (the deepest stack entry becomes the first dictionary key, the
entry immediately above it the corresponding value, and so on).
Everything up to and including the marker is removed from the stack
and the new dictionary is pushed onto it.
- INST module_name class_name
- Create an argument list by taking the stack entries until the
topmost marker. Then instantiate class 'class_name' defined in module
'module_name', passing these arguments to the constructore.
Everything up to and including the marker is removed from the stack
and the new instance is pushed onto it.
- Classes can define what arguments should be used by defining a
discipline __getinitargs__ which should return a tuple. The default
argument list is an empty tuple.
- APPEND
- Append the object on the stack top to the list object immediately
underneath it. The stack top is popped, leaving the list object on
top of the stack.
- SETITEM
- The object on the stack top is a value, the object immediately
underneath it is a key, and the object underneath that is a
dictionary. Store the value in the dictionary under the key, popping
the value and the key from the stack, leaving the dictionary object on
top of the stack.
- BUILD
- The object on the stack top is an argument, the object
immediately underneath it is a class instance. If the instance's
class defines a __setstate__ discipline, it is called with the
argument as parameter. if not, a default 'setstate' operation is
invoked. After the 'setstate' operation, the argument is popped off
the stack, leaving the class instance on top of the stack.
- The default 'setstate' operation requires a dictionary as
argument. It takes the keys of the dictionary as slot names and for
each key sets the instance slot to the correponding value in the
dictionary. I'm not sure if this should use setattr(inst, key, value)
or inst.__dict__[key] = value (there is a difference if the class has
a __setattr__ discipline!).
- Classes can define an __getstate__ discipline which should return
an argument for the __setstate__ discipline. The default 'setstate'
operation returns the instance's __dict__ dictionary.
Questions
- The discipline names __setstate__, __getstate__ and
__getinitargs__ aren't very pretty. Any better ideas?
- What to do about 64-bit integers? If the flatten algorithm uses
id() for local names, it will generate 64-bit local names when it runs
on a 64-bit machine, since id() returns its argument's address cast to
a (C long) integer. Solution: on 64-bit machines, the flatten
algorithm should not use id() directly for local names -- instead, it
should assign local names sequentially starting at 0, 1, 2, ... and
maintain a mapping from id() to local name. This needn't be slow so
we can do this on all architectures. (Truncating to the lowest 31
bits can't be done because two objects might map to the same local
name.) At least it saves us from the misundersanding that the local
name would be unique per process!
- How should new built-in types allow themselves to be flattened
and unflattened? Maybe use the INST and BUILD instructions, with a
'.' for the module name and the type name for the class name. These
types can then be registered (presuming they have globally unique
names). Better: use the extension module name that defines the type,
and let the extension module export a name that creates 'bare'
instances.
- What about recursion amongst globally named objects? We will
need the ability to patch them in later! Maybe use proxies at all
times??? Or simply rule out recursion at this level? (Does that make
sense at all???) (What does Jim Roskind's persistent module do?)
- The flattener must be able to find the module containing a class.
A way to do this would be exhaustive search of all modules. This is
not as bad as it sounds, as it can be cached and shared between all
flatteners:
import sys
from types import *
classes = {}
for modname in sys.modules.keys():
mod = sys.modules[modname]
dict = vars(mod)
for varname in dict.keys():
var = dict[varname]
if type(var) is ClassType:
classes[var] = (modname, varname)
# Now 'classes' is a mapping: class -> (modname, classname)
Prototype
The following is a tentative implementation, the files
'flatten.py' and 'pos.py'. It doesn't do anything about persistent
id's yet. It finds a class's module name by scanning all modules for
classes (and caching the results).
- flatten.py -- The main algorithm.
- pos.py -- Persistent objects on top of it.
This emulates the interface for persistent objects implemented by Anders
Lindstrom .
- copy.py -- A deep copying algorithm that
uses the __getinitargs__, __getstate__ and __setstate__ disciplines.