Re: Directory usage tree's in perl & python.

Tim Peters (tim@ksr.com)
Wed, 22 Jan 92 01:37:13 EST

> [steve m]
> There has been a series of good-perl/bad-perl exchanges ...
> ...
> If I could read Perl better without poking into the manual, It might
> have been more obvious - I didn't have the perl man with me, ...

Wouldn't have helped <grin> -- seriously, the key to the Perl version is
the "*kid" construction, which is playing games with Perl's internal
symbol tables as a way to fake a "real" data structure. It's very
tricky stuff, & the Perl book rushes over it, giving only a few trivial
examples after a hand-waving explanation. Perl wizards all know how to
use it, though -- they don't have much choice <grin/frown>.

> ... they had asked for nodes sorted by size, ( at least the terminals,
> I'm not sure if they meant the whole tree. )

Pretty sure Felix Lee meant the whole tree, at each level.

> ... "Dangling "|"'s" : This is what made me realize it needs a
> recursive solution.

The Perl version also has "dangling |'s", & I *like* them! Agree
they're hard to get rid of if you really want to; but think Lee had the
"sorted at each level" property in mind when he claimed a natural
advantage for recursive data structures (you need to sort at an
arbitrary number of different levels, without mixing things up across
levels -- hard to do with a flat data structure!).

FYI, I'll attach a different Python stab below. Notes:

- This tries to look more like the Perl version in the appearance of
the output.

- The Perl version was written to accept an arbitary number of
directory arguments (@lines = `du @ARGV`), but it actually (in
effect) ignores all but one of them; the Python version pays
attention to all of them.

- When invoked via, e.g., "dutree /dir1/dir2/dir3", the Perl version
drops the "/dir1/dir2/" portion on the floor by the time it prints
stuff; the Python version retains it.

- The Perl version is at least 3x faster than the Python version. This
is more my doing than Python's: I didn't try to take advantage of
du's output being a postorder traversal, or play any clever tricks
based on looking for longest matching string prefixes etc. The sole
goal I had in writing a Python version was to minimize my own
programming time, so it's straightforward and the pieces are (in some
well-hidden <grin> sense) more general than they need to be. Despite
that, the Python version turned out to be the same number of lines as
the Perl version.

- This is certainly easier to program in Python than in Perl! The only
place that felt a little strained was in the sorting, where the
"size" field had to be the first component so that Python's
list.sort method would use the size as the primary sort key. Perl
does have an advantage here in that its "sort" verb allows specifying
an arbitrary Perl routine as the comparison predicate; but one could
code up a similar verb in Python.

When multiple sub-directories have the same size, my Python version
lists them in reverse alphabetical order; the Perl version
effectively lists them in a random order. Possible to fix either
version to be nicer about this, but I think it's a bit more strained
to fix in Python via fooling the built-in "sort".

- I'm afraid my Python version may be as hard to understand as the Perl
version. But hope that's just a side-effect of using complex data
structures without explaining them <grin>.

Briefly, the 'suck_up_du' routine constructs a dictionary mapping
path components into [size, dictionary] pairs, where the contained
dictionary is again of the same type (mapping the next path
components into [size, dictionary] pairs, and so on & so on). This
makes it very easy to capture the parent-child relationships without
needing string analysis any fancier than splitting on '/'.

But Python doesn't have a way to sort dictionaries directly, so the
"dict2list" function converts the recursive dictionary structure into
a list of [size, path_component_name, list_of_children] triples,
where the list_of_children is again of the same (list-of-triples)
type. The built-in list.sort() method automatically reorders the
triples by the "size" field, and a list.reverse() then gets them in
descending order. There's also some trickiness to deal with "empty"
path components arising from my being too lazy earlier to factor out
longest matching directory prefixes.

The output routine 'dump' is then a trivial recursive walk over the
recursive list-of-triples structure.

Example output:

8161 /lang/work/build/kap/kap12/Python
| 1605 src
| 839 readline
| | 256 doc
| | 14 examples
| 532 demo
| | 206 scripts
| | 176 sgi
| | | 69 gl
| | | | 27 glstdwin
| | | 64 gl_panel
| | | | 30 flying
| | | | 17 twoview
| | | | 9 nurbs
| | | | 5 apanel
| | | 20 al
| | | 17 audio_stdwin
| | | 4 audio
| | 131 stdwin
| | | 67 ibrowse
| | 17 sockets
| 284 lib
| 189 doc
| 76 misc
| 31 local

I strongly encourage anyone comparing Python with Perl to take a whack
at this exercise -- very educational! Hats off to Mr Lee for a fine
example ...

hoping-to-see-a-better-python-solution-too!-ly y'rs - tim

Tim Peters Kendall Square Research Corp
tim@ksr.com, ksr!tim@uunet.uu.net

#!/usr/local/bin/python

import string, posix, sys

def suck_up_du( dirlist ):
answer = {}
for line in posix.popen( 'du ' + dirlist, 'r' ).readlines():
[size, path] = string.split( line[:-1] ) # chop newline
size = eval(size) # convert to int
insert( answer, path, size )
return answer

def insert( dict, path, size ):
components = string.splitfields( path, '/' )
for chunk in components[:-1]:
if not dict.has_key( chunk ): dict[chunk] = [-1, {}]
dict = dict[chunk][1]
chunk = components[-1]
if not dict.has_key( chunk ): dict[chunk] = [-1, {}]
dict[chunk][0] = size

def dict2list( nameprefix, dict ):
answer = []
for key in dict.keys():
[size, kids] = dict[key]
key = nameprefix + key
if size < 0:
for triple in dict2list( key + '/', kids ):
answer.append( triple )
else:
answer.append( [size, key, dict2list('',kids)] )
answer.sort(); answer.reverse()
return answer

def dump( prefix, list ):
for [size,name,kids] in list:
print prefix + string.rjust(`size`,6), name
dump( prefix + ' '*5 + '|', kids )

def dudump( dirlist ):
dump( '', dict2list( '', suck_up_du( dirlist ) ) )

dudump( string.join( sys.argv[1:] ) )

>>> END OF MSG