爬虫实战1-多进程爬取名言网

import requests
import re
from multiprocessing import Pooldef get_html(url, header=''):''':param url: http://quotes.toscrape.com/:param header: 设置请求头 这个网站没有反爬 可以不设置:return: 返回响应数据'''response = requests.get(url, headers=header, timeout=3)# 如果状态码200  表示成功if response.status_code == 200:# 设置编码response.encoding = response.apparent_encoding# 返回数据return response.textelse:print('访问 {} 失败了。。 {}'.format(url, response.status_code))return Nonedef parser_html(html):''':param html: 需要处理的html:return: 返回当前页面的数据'''# 得到当前页所有的名言 返回listspan = re.findall('<span class="text" itemprop="text">(.*?)</span>', html)# 得到当前页所有的作者  返回listsmall = re.findall('<small class="author" itemprop="author">(.*?)</small>', html)# 得到所有标签div = re.findall('<div class="tags">(.*?)</div>', html, re.S)all_a = []  # 创建新列表用于存放标签for tags in div:# 在每个tag标签中得到所有的a标签 （标签）tag = [i for i in re.findall('<a class="tag" href=".*?">(.*?)</a>', tags)]all_a.append('/'.join(tag))data = []  # 用于存放最终数据for i in range(len(span)):# 遍历所有数据 得到最终数据data.append('名言是：' + span[i] + "作者是：" + small[i] + '标签是：' + all_a[i])return datadef save_data(data, path=''):''':param data: 要保存的数据:param path: 保存的路径:return:'''# 保存数据with open(path, 'a', encoding='utf-8') as f:for i in data:f.write(i + '\n')def main(url):# 获取当前页面html = get_html(url)# 解析当前页data = parser_html(html)# 保存当前数据集save_data(data, '名言.txt')if __name__ == '__main__':# 创建进程池pool = Pool()# 翻页 将所有的url 放入池中for page in range(1, 11):url = f'http://quotes.toscrape.com/page/{page}/'pool.apply_async(main, args=(url,))pool.close()  # 关闭进程池pool.join()  # 主进程等待

爬虫实战1-多进程爬取名言网相关推荐

python功能性爬虫案例_Python爬虫实现使用beautifulSoup4爬取名言网功能案例
本文实例讲述了Python爬虫实现使用beautifulSoup4爬取名言网功能.分享给大家供大家参考,具体如下: 爬取名言网top10标签对应的名言,并存储到mysql中,字段(名言,作者,标签) ...
Python爬虫学习---------使用beautifulSoup4爬取名言网
爬取名言网top10标签对应的名言,并存储到mysql中,字段(名言,作者,标签) #! /usr/bin/python3 # -*- coding:utf-8 -*-from urllib.requ ...
Python爬虫实战+Scrapy框架爬取当当网图书信息
1.环境准备 1.在python虚拟环境终端使用 pip install scrapy下载scrapy依赖库 2.使用scrapy startproject book创建scrapy心目工程 3.使用 ...
爬虫实战6：爬取英雄联盟官网五个位置的综合排行榜保存到excel
申明:资料来源于网络及书本,通过理解.实践.整理成学习笔记. 文章目录英雄联盟官网获取一个位置的综合排行榜所有数据(上单为例) 获取所有位置的综合排行榜所有数据英雄联盟官网获取一个位置的综合排 ...
爬虫实战5：爬取全部穿越火线武器的图片以武器名称命名保存到本地文件
申明:资料来源于网络及书本,通过理解.实践.整理成学习笔记. 文章目录穿越火线官网完整代码运行结果穿越火线官网完整代码 import requests# 循环33次,官网武器库展示有33页 ...
转 Python爬虫实战一之爬取糗事百科段子
静觅 » Python爬虫实战一之爬取糗事百科段子首先,糗事百科大家都听说过吧?糗友们发的搞笑的段子一抓一大把,这次我们尝试一下用爬虫把他们抓取下来. 友情提示糗事百科在前一段时间进行了改版,导致 ...
《python爬虫实战》：爬取贴吧上的帖子
<python爬虫实战>:爬取贴吧上的帖子经过前面两篇例子的练习,自己也对爬虫有了一定的经验. 由于目前还没有利用BeautifulSoup库,因此关于爬虫的难点还是正则表达式的书写. ...
python爬虫实战之多线程爬取前程无忧简历
python爬虫实战之多线程爬取前程无忧简历 import requests import re import threading import time from queue import Queu ...
Python爬虫实战一之爬取糗事百科段子
点我进入原文另外, 中间遇到两个问题: 1. ascii codec can't decode byte 0xe8 in position 0:ordinal not in range(128) 解 ...

爬虫实战1-多进程爬取名言网

爬虫实战1-多进程爬取名言网相关推荐

最新文章

热门文章

爬虫实战1-多进程爬取 名言网

爬虫实战1-多进程爬取 名言网相关推荐

最新文章

热门文章

爬虫实战1-多进程爬取名言网

爬虫实战1-多进程爬取名言网相关推荐