It's pretty simple. There is a html parsing library (which is included
in the standard distribution, I think) called "htmllib.py". What you
can do is define a class with methods 'start_a' and 'end_a' (among
others). For instance, you might have:
class AnchorEater(FormattingParser):
def start_a(self, attrs):
...
def end_a(self):
...
Alternatively, you can use some of the code I wrote for 'dancer',
specifically htmlparsemodule.c. All you need to do is pass it a class
with these methods:
handledata
unknown_starttag
unknown_endtag
and methods called 'start_XXX', 'do_XXX', and 'end_XXX', where XXX is
any markup type (A, IMG, EM, etc.) For instance:
class AnchorEater:
def handledata(self, text):
pass # ignore text
def unknown_starttag(self, tag, attrs):
pass # ignore every tag but A's
def unknown_endtag(self, tag):
pass
def start_a(self, attrs):
print 'Found an anchor! HREF is ',attrs['href']
def end_a(self):
print 'end of tag'
import htmlparse
text = 'this is a <A HREF="http:foo"> test </A> \
of a <A HREF="ftp://bob.com/bar.ps"> parser </A>'
htmlparse.parse(text, AnchorEater())
Results in:
Found an anchor! HREF is http:foo
end of tag
Found an anchor! HREF is ftp://bob.com/bar.ps
end of tag
-- Steven Miale [smiale@cs.indiana.edu] HTTP://cs.indiana.edu/hyplan/smiale.html