python多线程抓取天涯帖子内容示例

使用re, urllib, threading　多线程抓取天涯帖子内容，设置url为需抓取的天涯帖子的第一页，设置file_name为下载后的文件名

#coding:utf-8

import urllib
import re
import threading
import os, time

class Down_Tianya(threading.Thread):
    """多线程下载"""
    def __init__(self, url, num, dt):
        threading.Thread.__init__(self)
        self.url = url
        self.num = num
        self.txt_dict = dt

    def run(self):
        print 'downling from %s' % self.url
        self.down_text()

def down_text(self):
        """根据传入的url抓出各页内容，按页数做键存入字典"""
        html_content =urllib.urlopen(self.url).read()
        text_pattern = re.compile('<span>时间：(.*?)</span>.*?.*?<div class="bbs-content.*?>s*(.*?)</div>', re.DOTALL)
        text = text_pattern.findall(html_content)
        text_join = ['rnrnrnrn'.join(item) for item in text]
        self.txt_dict[self.num] = text_join

def page(url):
    """根据第一页地址抓取总页数"""
    html_page = urllib.urlopen(url).read()
    page_pattern = re.compile(r'<a href="S*?">(d*)</a>s*<a href="S*?" class="S*?">下页</a>')
    page_result = page_pattern.search(html_page)
    if page_result:
        page_num = int(page_result.group(1))
        return page_num

def write_text(dict, fn):
    """把字典内容按键（页数）写入文本，每个键值为每页内容的list列表"""
    tx_file = open(fn, 'w+')
    pn = len(dict)
    for i in range(1, pn+1):
        tx_list = dict[i]
        for tx in tx_list:
            tx = tx.replace('<br>', 'rn').replace('<br />', 'rn').replace(' ', '')
            tx_file.write(tx.strip()+'rn'*4)
    tx_file.close()

def main():
url = 'http://bbs.tianya.cn/post-16-996521-1.shtml'
file_name ='abc.txt'

1/3 1 2 3 下一页尾页

python多线程抓取天涯帖子内容示例

Python ArcPy实现批量拼接长时间序列栅格图像

Python 中OS module的使用详解

Python Matplotlib基本用法详解

Python range() 函数用法详解

Python分割单词和转换命名法的实现

Python 中OS module的使用详解

使用Pytorch构建第一个神经网络模型附案例实战

Python实现关键路径和七格图计算详解

python3中SQLMap安装教程

kali最新国内更新源sources

Python ArcPy实现批量拼接长时间序列栅格图像

Python 中OS module的使用详解

Python Matplotlib基本用法详解

Python range() 函数用法详解

Python分割单词和转换命名法的实现

Python 中OS module的使用详解

使用Pytorch构建第一个神经网络模型附案例实战

Python实现关键路径和七格图计算详解

python3中SQLMap安装教程

kali最新国内更新源sources

python多线程抓取天涯帖子内容示例

Python ArcPy实现批量拼接长时间序列栅格图像

Python 中OS module的使用详解

Python Matplotlib基本用法详解

Python range() 函数用法详解

Python分割单词和转换命名法的实现

Python 中OS module的使用详解

使用Pytorch构建第一个神经网络模型 附案例实战

Python实现关键路径和七格图计算详解

python3中SQLMap安装教程

kali最新国内更新源sources

Python ArcPy实现批量拼接长时间序列栅格图像

Python 中OS module的使用详解

Python Matplotlib基本用法详解

Python range() 函数用法详解

Python分割单词和转换命名法的实现

Python 中OS module的使用详解

使用Pytorch构建第一个神经网络模型 附案例实战

Python实现关键路径和七格图计算详解

python3中SQLMap安装教程

kali最新国内更新源sources

使用Pytorch构建第一个神经网络模型附案例实战

使用Pytorch构建第一个神经网络模型附案例实战