本文实例讲述了Python转换HTML到Text纯文本的方法。。具体分析如下:
今天项目需要将HTML转换为纯文本,去网上搜了一下,发现Python果然是神通广大,无所不能,方法是五花八门。
拿今天亲自试的两个方法举例,以方便后人:
方法一:
1. 安装nltk,可以去pipy装
(注:需要依赖以下包:numpy, PyYAML)
2.测试代码:
>>> import nltk
>>> aa = r'''''
<html>
<body>
<b>Project:</b> DeHTML<br>
<b>Description</b>:<br>
This small script is intended to allow conversion from HTML markup to
plain text.
</body>
</html>
'''
>>> aa
'n<html>n <body>n <b>Project:</b> DeHTML<br>n <b>Description</b>:<br>n This small script is intended to allow conversion from HTML markup to n plain text.n </body>n </html>n '
>>> <strong>print nltk.clean_html(aa)</strong>
Project: DeHTML
Description :
This small script is intended to allow conversion from HTML markup to
plain text.
方法二:
如果觉得nltk太笨重,大材小用的话,可以自己写代码,
from HTMLParser import HTMLParser
from re import sub
from sys import stderr
from traceback import print_exc
class _DeHTMLParser(HTMLParser):
def __init__(self):
HTMLParser.__init__(self)
self.__text = []
def handle_data(self, data):
text = data.strip()
if len(text) > 0:
text = sub('[ trn]+', ' ', text)
self.__text.append(text + ' ')










