Re: Sugar for regular expression groupings.

Tim Peters (tim@ksr.com)
Sun, 21 Feb 93 21:41:19 EST

> [guido]
> ...
> I like the idea of saving a reference to the last matched string and
> providing a varargs function to access substrings. I'll try to put
> this (as C code) in the next release.

Cool!

> I don't think the interface to access substrings by name instead of by
> number buys you much

Not initially, no ... it's a year later when the format changes that the
pain begins <smile>. Still, I wouldn't _recommend_ people generally use
a name interface either, cuz no matter how it's done it's gonna be pretty
slow. For that reason, I don't use a name interface myself for regexp-
crunching on large volumes of data.

> (except an advantage over Perl :-).

Having a regexp _object_ is an advantage over Perl already; the Perl
folks ask for "something like that" regularly. But getting the effect of
named fields is already easy in Perl. E.g., for the fpformat.py example:

>>> fpre = '^\([-+]?\)0*\([0-9]*\)\(\(\.[0-9]*\)?\)\(\([eE][-+]?[0-9]+\)?\)$'
>>> decoder = regex2.compile( fpre, 'all','sign','int','frac','junk','exp')
>>> decoder.match('-2.3e45')
7
>>> decoder.matches_by_name('sign','int','frac','exp')
('-', '2', '.3', 'e45')

In idiomatic Perl that looks like:

$fpre = '^([-+]?)0*(\d*)((\.\d*)?)((e[-+]?\d+)?)$';
($sign,$int,$frac,$junk,$exp) = '-2.3e45' =~ /$fpre/io;
print "$sign $int $frac $exp\n";

which prints "- 2 .3 e45". A pretty close equivalent in Python would be
if a compiled regexp's search method returned a tuple with a number of
elements equal to the number of meta-parentheses in the regexp:

>>> sign,int,frac,junk,exp = decoder.hypothetical_search('-2.3e45')

where `sign' etc are bound to None if the search fails. This way has
attractions too.

Lots of ways to skin this cat <smile>! I hope someone who does a lot of
regexp crunching (I really don't) tries out several approaches & says what
they like best.

> You can always define constants to name the substrings near the place
> where you write down the pattern.

That's fine by me, although there's a little danger from unintended
namespace collisions.

A suggestion for people who intend to do that: Instead of defining the
"constants" like this:

ALL = 0
SIGN = 1
INT = 2
FRAC = 3
JUNK = 4
EXP = 5

Do it like this:

[ALL, SIGN, INT, FRAC, JUNK, EXP] = range(6)

You'll be glad you did when things change ...

agreeably y'rs - tim

Tim Peters tim@ksr.com
not speaking for Kendall Square Research Corp