Yup. Great thread direction!
> Bill Janssen:
> > "marshal.dump" already does this (storeString) for some things. I'd
> > like to see it extended so that there's a simple way to have everything
> > dump-able and load-able.
>
> Indeed. I am thinking about something like this. Here are some ideas
> (thinking aloud):
>
> With every object dumped I also dump its "id" (this is just its
> address, but any unique identifier will do).
I think the concept of the "externalized name" of an object that
persists is critical, but I *think* that "id" is a weak choice. For
my persistent objects, I need the object to persist across process
boundaries. As a result, an "id" (which currently corresponds to an
address in a single process space) suffers from a great risk of reuse.
In fact, even within a single process, if an object is saved off to a
persistent store, and memory reallocated, then "id" collisions are
*very* possible. It is also pretty critical (to garbage collection)
to realize *when* the externalized name has been captured. Note that
IF the name has not been extracted, then at process termination, there
is *no* reason to save
I think that there are two kinds of externalized names that must
usable by the "system." The first kind of name is generated by
Python, and must be unique *not* within the process at a specific
time, but rather must be unique across all processes, and all time.
Fundamentally, this requires consultation with the persistent store to
establish a unique name. In my current implementation, I combine a
unix process id (or equivalent) with the class name, and a sequence
number, and check for uniqueness (and try the next sequence number if
the current one is in use).
The second sort of externalized name is supplied by the user (i.e.,
supplied by the calling code). It is critical that the calling code
be able to specify a "name" for certain root objects, so that the can
be "looked up" without having ever run the program before (and
acquired a name). I call this name an "alias." Just as it is
possible to reincarnate an object given its "system" name, it is
possible to reincarnate (via a slightly different load method) from an
"alias."
> I maintain a hash table
> (Python dict) of objects already dumped,
This use of an external hash table is *very* nice. I didn't think of
it originally, and used to implant the persistent-name into the
objects name space (i.e., I used to create a "__objname__" slot).
This "design error" reflected my C mentality, wherein it is cheap to
look at a member, but (relatively) expensive to use a hash table.
Since all slot name lookup proceeds through dictionary/hash tables, it
is funnily enough *just* as cheap to use the external table as to
locate the data in the objects :-). When I changed to this approach
my code got *much* cleaner, as I stopped injecting anything into
user's (instance) name space. Commenting on this is a major aside,
but a cute element of Python, and it's distinction from languages such
as C.
> and if a request to dump an
> object requires me to dump a sub-object that I've already dumped, I
> dump a reference to the id instead.
Ah... I called this saving the "transitive closure" of the persistent
objects, but was told by a Professor friend that this is really called
"orthogonal persistence." There are IMHO two distinct types of
"subobjects" (re: objects that are pointed to by members or slotnames
of in an instance, either directly, or indirectly (such as via a
list)). Some subobjects are "persistent," and some are not.
Persistent objects have functional attribute that forces multiple
reincarnations of a single persistent object to resolve to a single
physical python object. The idea is that if I have a persistent
object Pa, and it has two pointers to the the persistent object Pb,
then saving and resurrecting Pa will only cause one instance of Pb to
be reincarnated (and the multiple refs should automagically point to
the single lone copy of Pb). In contrast, "non-persistent" objects
such as numbers and strings etc, do not have this guaranteed "singular
reincarnation" attribute. For example, if Pa has two separate
references to the same physical string object "eight", then at
reincarnation time there *might* be two distinct string objects
(assuming they are not persistent) created. I guess I'm willing to
hear arguments against my proposal and definition of two distinct
kinds of objects, but it seems like a waste of time. Note that the
distinction between being "equal" objects (all strings containing
"eight" compare equal) and the same object (re: the "is" operator) is
very subtle.
> (With sub-object I just mean an
> object contained in another object, e.g. 1 is a sub-object of
> [1,2,3].) This allows me to dump recursive objects (e.g. lists that
> have been inserted into themselves), saves space when dumping an
> object that contains many references to the same sub-object, and (most
> importantly) retain pointer semantics, so that e.g. graphs represented
> as objects pointing to each other can be dumped correctly.
This feature of dumping arbitrary object graphs seemed critical to
saving and reincarnating objects. My current implementation does
this, including arbitrary graphs of non-persistent objects (i.e., even
recursively defined lists, and user defined objects with recursive
content pointers, and tuples, and arbitrary combinations). Note that
when an object is "persistent," then the "loops in the graph" are
automatically broken at the entry to the object, as only the name of
the object is saved. When a clump of data (i.e., lists, tuples, user
defined objects) are not persistent, and they do have internal self
references (i.e., looping references), then it is harder to work with
(but that is why programmers like me make money ;-) ) and my current
code handles all this automagically.
> I don't know yet what is the best approach to dumping things that
> contain references to the OS environment, like sockets, windows, pipes
> or open files -- I'll probably punt on these, or create a dummy object
> instead. Open files may be recreatable by recording the filename,
> mode and seek position, but for the others recreation requires too
> much context.
My current implementation punted on these as well. The following is
the list of objects that I can save. The following is my actual
function dispatch table for creating the persistent representation of
data. Note that the left hand column contains Python types:
Prepr_func_table = { \
type ( None ): repr ,\
type ( "string" ): repr ,\
type ( 1 ): repr ,\
type ( 1L ): repr ,\
type ( 3.14 ): repr ,\
type ( [] ): Prepr_list ,\
type ( {} ): Prepr_dict ,\
type ( () ): Prepr_tuple ,\
type ( sys ): Prepr_module ,\
type ( PersistentObject ): Prepr_class ,\
type ( PersistentObject() ): Prepr_instance ,\
}
Note that the last few are a bit tricky, and consist of a module name,
a user defined class name, and a user defined instance. If folks are
confused by why I use a "function dispatch table" defined as above,
see my older posting about how to create efficient switch statements,
of achieve cheap function overloading based on types or arguments in
Python.
> For class instances, a default dumping method will simply dump all the
> instance variables. Classes can override this by providing a
> __dump__() method (though I'm not sure about the exact interface
> here).
I called my method the Prepr() method, as an abbreviation for
Persistent REPResentation, and a bit of a take off of repr(). It
would be nice to have a Python standard for this, such as "__dump__"
or similar. Note that in my implementation, only persistent objects,
which is to say user defined objects derived from my persistent base
class, have such a method. (Or more precisely, I only try to call such
methods during recursive saves if the user defined object is derived
from my PersistentObject base class).
> There will also be a __load__() method.
I called my method "Load()" so we are even closer already on this
naming convention :-).
> A problem is how to
> represent the class to which the instance belongs to.
Indeed! This was a topic of some prior communication with Guido,
although I don't know if I explained why it was of such interest at
the time (I don't know if I posted this discussion).
> I don't want to
> dump the class definition itself -- usually, the receiving end has
> already imported the module that contains it anyway, and I feel that
> it must be possible to use a different version of the class (as long
> as it's backwards compatible; the __load__() method could do
> conversion if necessary). (The same applies to functions and modules.)
I agree completely that dumping all the class definitions is bad. I
thought about this a *long* time, and eventually realized that
attempts to consistently save all the methods would be *very*
detrimental to my work. At first I thought "methods plus data equal
objects," and so I considered it desirable, but too painful to save
all the methods. As a result, my initial implementation did not save
any methods. Later I realized that IF I wanted to save methods, then
to be consistent, I would *really* have to save all the functions that
those methods called, and on and on recursively. In the end, saving
an object would involve saving pretty near the entire code space
(ugh). Worse yet, at reincarnation time, I'd either have *many*
copies of the identical code, or I'd go crazy trying to decide whether
to merge repeated code segments or allow them to remain separate.
IMHO Bottom line: saving methods is bad. Saving data is good.
> Representing the class in a form that will enable the receiving end to
> find it will be difficult though -- not all classes can be expressed
> in the form "modulename.classname"
This was a *major* problem, which ended (for us) with an acceptable
restriction on the nature of objects that can be persistent. Only
objects that have their class definition at the outermost lexical
scope of a module can be made persistent. As a result, the
constructor for *any* class can be written in the form:
Module("modulename").class_name
I mentioned in a recent posting why the above format is cleaner than
"modulename.classname" in terms of pollution of the name space needed
at reincarnation time.
> , and even this information isn't
> readily available (unless I put it in the class definition as a hidden
> attribute, I'll have to search the class for a method and then look in
> the method's "globals" dictionary...). Possibly the __dump__() method
> (or a separate method) will be allowed to override this too, for full
> flexibility.
The gotcha that Guido mentioned is that it is hard to find the module
name for an arbitrary class. I *used* to think that this was all I
needed, but when I began to consider classed defined within functions,
defined with classes, I realized that module name was not all that I
needed :-(. When I established our restriction shown above, it was no
big deal to also require that persistent classes have *some* methods,
and hence it is possible to find what module provided the class
definition (i.e., it is the module that provided the method
definition, which is easy to find), and the above format will always
represent a class.
> Any contributions?
There are two other points that you didn't mention. Perchance the
first is not a significant issue because the intent of this discussion
was to implement a Python internal persistence/restoration structure.
For our development (without getting inside Python), we needed a way
to create "the hollow shell" of a user defined object at reincarnation
time. To support this need, we added the extra requirement to all
persistent objects that they must support a constructor call that
takes no arguments. For us, that was a pretty easy requirement to
accept. It might be nice if there was a hollow shell constructor
(i.e., it makes an instance, with all the base classes and classname
pointers in place, but it does not run *any* init methods). If we had
this, then we would could remove one restriction... and I *think* the
method is useful to expose for other related reasons anyway.
The second problem, which I have *not* yet figured out (and I've been
told by a Professor type is a "known hard problem") involves attempts
to change code, and continue to use the existing persistent data
store. It is a crying shame that after all this work at
modularization and oo methodologies, that I have to tell the
programmers: "Time to enhance and fix your code, but *don't* change
the names of any members of your persistent objects, or the world will
end." There are lots of approaches to this problem, and I'm not sure
what the end all answer will be. I suspect it will be a compromise on
the problem, which allows us to reach a solution. ;-)
I will try to convince my management to release my design document and
persistent object code. Until that point, all I can do is shout words
of encouragement and hope we can keep up with actual Python practices.
There are a pile of very hard and interesting questions that our
current implementation does not fully address. These include: 1)
Locking of objects for multiprocess access; 2) Using a *real* db
rather than the files system; 3) have the ability to "commit" and
"rollback" transactions.
Jim (*happy* to see this topic get some press)
Jim Roskind
voice: 408.982.4469
fax: 408.986.1889
jar@infoseek.com