Re: Sugar for regular expression groupings.

Tim Peters (tim@ksr.com)
Tue, 23 Feb 93 01:02:51 EST

> [tracy]
> I ended up not writing my own regex class with symbolic group names for
> two reasons: if the feature is supplied in the standard python module
> it will be faster, and it will be more portable (or distributable).

Agree "faster" is likely, but don't understand the portable/distributable
point: a module coded _in_ standard Python is quite portable! The
advantage to doing it that way is it gives people a chance to experiment
with the interface, before freezing it into the standard distribution.

> ... I can live with integer group names.

Me too!

> [various improvements: groupnames attribute; optional varargs to
> 'compile' to initialize groupnames; modify 'groups' method to allow
> strings (names) in addition to integer indices]
>
> This ends up looking like your solution, but the relationship
> between 'regs', groups(), and 'groupnames' is explicit. This
> is useful because it increases the number of "fruitful
> interactions".

Well, there's nothing to stop you from writing a portable module, in std
Python, that does exactly all that today. If you did that & distributed
the module, and people liked it a lot, Guido might get interested in
hacking a C version <grin>.

> Improvement 2:
> Add syntax to regular expressions so that groups can be named
> in place, yielding the group dictionary. (This is a *big*
> advantage over perl.)
>
> For example:
> re = '[^0-9]*\(<number>[0-9]+\)[ \t]+\(<label>[A-Za-z_-.]+\)'
> decode = regex.compile( re)
> n, l = decode.groups( 'number', 'label')
>
> I like this idea, because then I can build complicated regular
> expressions in substrings, and then catenate them together
> into the final regular expression before compiling. It also
> completely eliminates group-counting, and it provides a visual
> indication of which groups are just for grouping, and which
> are for substring extraction.

Agree that _is_ nice. But again, it's something you _can_ do today, in
your own Python module (in your compile method, you "just" need to
analyze the pattern string before invoking regex.compile, stripping out
the '<name>' portions and saving away the derived name->index dict;
there's really no need to touch the current regex implementation, except
in that it's likely you'll wind up using more sets of parens than
regexmodule is currently compiled to handle).

> But what python really needs are LALR(1) parser objects, don't you
> think?

Hard to say! I confess I came to UNIX(tm) late in life, & never did
grasp the fascination with regexps. I find them awfully cryptic &
clumsy as soon as they go beyond the trivial. E.g., here's one from
python-mode.el, to match a Python line that opens a code block:

\\([^#'\n\\]\\|'\\([^'\n\\]\\|\\\\.\\)*'\\|\\\\\n\\)*:\\([ \t]\\|\\\\\n\\)*\\(#.*\\)?$

I can't even read that anymore! Least not without a lot of tedious
effort.

A std parsing approach might be better after all (how many more desperate
net msgs will we read asking how to capture the concept of nesting
brackets via regexps <0.9 grin>?). But not sure: the only truly
_pleasant_ pattern-matching language I've used is SNOBOL4, & even it was
clumsy for dealing with left recursion.

Suspect we agree that regexps aren't the right way to go for complex
pattern-matching tasks. On the other hand, I do think they're fine for
simple tasks, so maybe keeping them clusmy to use is doing most users a
favor <0.9 grin>.

> [guido]
> I'm afraid the trouble with this one [tracy's '<name>' extension] is
> that the syntax of Python regular expressions is defined by the GNU
> Emacs regular expression package.

Ya, but it's not an essential extension -- the <name> constructs are
syntactic sugar that could be stripped out before the Emacs package is
invoked. Not saying you _should_, just saying that it's not hard to do.

> ... Is it really that hard to count occurrences of \(?

Well, it _is_ error-prone: I remember when quoted strings were introduced
into Fortran, and hearing "is it really that hard to count the number of
characters in a Hollerith?" (hint: the answer is "no" <grin>).

I think what it _does_ do is impose an unnatural implementation layer
between the way we think of the problem & the way we need to code the
solution. On the other hand, in those cases where regexps get so fancy
that the need for counting parens goes above 3 (truly my _comfortable_
mental limit!), regexps probably aren't the right tool for the job anyway
...

> ...
> Would you folks settle for a recursive descent parser generator (like
> the one used to build the Python parser)? That one I know how to
> hack...

I'd like to see someone suggest a specific interface, & code it up in
Python, so we could get a feel for how it works in practice.

In the meantime, I believe everyone agrees that a regex method supporting
varargs "integer group names" would be a valuable extension -- right?

mostly-just-thinking-out-loud-ly y'rs - tim

Tim Peters tim@ksr.com
not speaking for Kendall Square Research Corp