Re: Sugar for regular expression groupings.

Tim Peters (tim@ksr.com)
Fri, 19 Feb 93 20:58:08 EST

> [Tracy Tims writes ...]
> I find myself using regex.compile() frequently for parsing lines from
> various data-files. ...
> [notes that breaking out fields is clumsy & error-prone, a la ...]
>
> format = regex.compile( a_pattern)
>
> if format.match( data) != -1:
> old_ver = data[format.regs[1][0]:format.regs[1][1]]
> new_ver = data[format.regs[2][0]:format.regs[2][1]]
> user = data[format.regs[3][0]:format.regs[3][1]]
> date = data[format.regs[4][0]:format.regs[4][1]]
> time = data[format.regs[5][0]:format.regs[5][1]]
> host = data[format.regs[6][0]:format.regs[6][1]]
> dir = data[format.regs[7][0]:format.regs[7][1]]
> ...
> [but with some changes ...]
> I could reduce the example above to the following:
>
> if format.match( data) != -1:
> old_ver, new_ver, user, date, time, host, dir \
> = format.groups(1,2,3,4,5,6,7)
>
> Code using the groups() method is easier to modify and maintain
> because it doesn't have the internal interdependencies that the first
> example has (and the fpformat.py idiom also has).

I'll whine louder <grin>: even given those changes, you still have a
maintenance nightmare because of the ubiquitous reliance on meaningless
little integers to denote the fields. If the format of the data file
changes, or even if it doesn't but you decide you need to extract some
more info "in the middle" (a common desire as applications grow fancier),
the relative indices of the parentheses may change. Then you've got to
track down all the references in the code & change 'em.

Attached is module regex2.py, a homegrown way to worm around those
problems. I haven't gotten around to documenting it, but an example
should make it clear enough to understand what the code is doing. I'll
use the regexp from fpformat.py for familiarity's sake:

>>> import regex2
>>> fpre = '^\([-+]?\)0*\([0-9]*\)\(\(\.[0-9]*\)?\)\(\([eE][-+]?[0-9]+\)?\)$'
>>> decoder = regex2.compile( fpre, 'all','sign','int','frac','junk','exp')
>>> decoder.match('-2.3e45')
7
>>> decoder.matches_by_index(0,2,2,6)
('-2.3e45', '2', '2', 'e45')
>>> decoder.matches_by_name('sign','int','frac','exp')
('-', '2', '.3', 'e45')
>>> decoder.match('abc')
-1
>>> decoder.matches_by_name('int')
regex2: last match failed
Stack backtrace (innermost last):
File "<stdin>", line 1
File "./regex2.py", line 41
raise error
>>>

So your "groups" method is named "matches_by_index" here, & there's also
a "matches_by_name" method that confines the dependence on little
integers to a single line (the fields are (optionally) named at the time
the regexp is compiled). It also arranges to gripe if a "matches_by_..."
method is invoked when the preceding search or match attempt failed; that
may or may not be a feature (eye of the beholder; obviously, I like it
that way).

Give it a try & see how you like it! This kind of thing is very easy to
do in Python, so play around & see if you can come up with something you
like better.

the-trick-to-motivating-guido-to-change-python-is-to-come-up-with-
something-*he*-likes-better<grin>-ly y'rs - tim

Tim Peters tim@ksr.com
not speaking for Kendall Square Research Corp

Module regex2.py:

import regex

error = 'regex2: last match failed'

def compile( pattern, *field_names ):
answer = _Regex2()
answer.pattern = pattern # currently unused
answer.compiled_re = regex.compile( pattern )
answer.matched_string = None
answer.name2index = {}
i = 0
for name in field_names:
answer.name2index[name] = i
i = i + 1
return answer

class _Regex2:
def match( self, string ):
a = self.compiled_re.match( string )
if a >= 0: self.matched_string = string
else: self.matched_string = None
return a

def search( self, string ):
a = self.compiled_re.search( string )
if a >= 0: self.matched_string = string
else: self.matched_string = None
return a

def matches_by_index( self, *indices ):
if self.matched_string is None:
raise error
answer = ()
for n in indices:
start, end = self.compiled_re.regs[n]
answer = answer + (self.matched_string[start:end],)
return answer

def matches_by_name( self, *field_names ):
if self.matched_string is None:
raise error
answer = ()
for name in field_names:
i = self.name2index[name]
start, end = self.compiled_re.regs[i]
answer = answer + (self.matched_string[start:end],)
return answer

END OF MSG