python网络编程学习笔记(七):HTML和XHTML解析(HTMLParser、BeautifulSo

2019-10-06 14:22:10王旭

    def __init__(self):
        self.taglevels=[]
        self.handledtags=['title','body']
        self.processing=None
        HTMLParser.HTMLParser.__init__(self)
    def handle_starttag(self,tag,attrs):
        if tag in self.handledtags:
            self.data=''
            self.processing=tag
    def handle_data(self,data):
        if self.processing:
            self.data +=data
    def handle_endtag(self,tag):
        if tag==self.processing:
            print str(tag)+':'+str(tp.gettitle())
            self.processing=None
    def handle_entityref(self,name):
        if entitydefs.has_key(name):
            self.handle_data(entitydefs[name])
        else:
            self.handle_data('&'+name+';')

    def handle_charref(self,name):
        try:
            charnum=int(name)
        except ValueError:
            return
        if charnum<1 or charnum>255:
            return
        self.handle_data(chr(charnum))

    def gettitle(self):
        return self.data
fd=open('test1.html')
tp=TitleParser()
tp.feed(fd.read())

运行结果为:
title: XHTML 与" HTML 4.01 "标准没有太多的不同
body:
i love÷ you×

3、提取链接
例4:


<html>
<head>
<title> XHTML 与" HTML 4.01 "标准没有太多的不同</title>
</head>
<body>

<a href="http://pypi.python.org/pypi" title="link1">i love÷ you×</a>
</body>
</html>

这里在handle_starttag(self,tag,attrs)中,tag=a时,attrs记录了属性值,因此只需要将attrs中name=href的value提出即可。具体如下:


##@小五义:
##HTMLParser示例:提取链接
# -*- coding: cp936 -*-