aiohttp与asyncio库爬取汽车销量排行榜

本次爬取是采用异步方法，异步请求来爬取http://db.auto.sohu.com/cxdata/的数据，数据要求是提取每个车辆品牌每个车型的每个月份的销量，数据是ajax加载的，还是需要抓包获取，难点在于数据的对应以及整体思路是怎么实现爬取的。代码放在了https://github.com/dongxun1/The-Cars-Sales-Nums，里面包含了自定义的mysql储存，详细的提取步骤以及思路放在了README.md中。
以及一个配置文件，cars.py采集函数。

主代码：

from Test.Homework.configs import *
import aiohttp
import asyncio
import datetime
import re
import pymongo
from Test.Homework.configs_mysql import MYSQL
print(datetime.datetime.today(), '开始采集')class AioCar(object):def __init__(self):#  从这里面获取总的车辆品牌id， 注意，这是总的， 每个品牌id 又对应多个车型self.url_1 = 'http://db.auto.sohu.com/cxdata/xml/basic/brandList.xml'async def have_brand_id(self, url):"""获取品牌id brand_id:param url::return:"""async with aiohttp.ClientSession() as session:async with session.get(self.url_1) as resp:result = await resp.text()brand_ids = re.findall('id="(\d+)"', result, re.S)brand_ids = list(brand_ids)# brand_ids = ", ".join(brand_ids) # 注意！！字符串循环的赋值不可行， 必须是单独的数字对象# print(brand_ids)return brand_idsasync def have_brand_name(self, url):"""http://db.auto.sohu.com/cxdata/xml/basic/brand145ModelListWithCorp.xml通过这个网址 ，提取车辆名字， 最后的数据显示会有多个， 每个对应多个车型:param url::return:brand_name"""async with aiohttp.ClientSession() as session:async with session.get(url) as resp:result = await resp.text(encoding="GBK")brand_name = re.findall('brand name="(.*?)"', result, re.S)[0]return brand_nameasync def have_ids(self, url):"""依旧是这个网址， 获取id， 这是真正的独一无二的id，对应的是每个不同的车型， 注意数据对应:param url::return:"""async with aiohttp.ClientSession() as session:async with session.get(url) as resp:result = await resp.text(encoding="gbk")ids = re.findall('id="(\d+)"', result, re.S)return idsasync def have_leardboard_message(self, url):"""通过id 获取数据，:param id::return:"""async with aiohttp.ClientSession() as session:async with session.get(url) as resp:try:result = await resp.text(encoding="GBK")except UnicodeError:result = await resp.text(encoding="utf-8")datas = re.findall('date="(.*?)" salesNum="(\d+)"', result, re.S)item = []name = re.findall('name="(.*?)"', result, re.S)[0]for data in datas:datetime = data[0]sale_nums = data[1]data_list = [name, datetime, sale_nums]item.append(data_list)return item@staticmethoddef save_message_to_mongodb(data):try:if db[MONGO_COLLECTION].insert(data):passexcept Exception as e:print(e.args)else:passasync def main(self):aio_car = AioCar()task1 = aio_car.have_brand_id(self.url_1)return await asyncio.ensure_future(asyncio.gather(task1))if __name__ == '__main__':aio_car = AioCar()mysql = MYSQL()loop = asyncio.get_event_loop()results = loop.run_until_complete(aio_car.main())url_2 = 'http://db.auto.sohu.com/cxdata/xml/basic/brand{}ModelListWithCorp.xml' # 包含车辆总名字以及idurl_3 = 'http://db.auto.sohu.com/cxdata/xml/sales/model/model{}sales.xml'client = pymongo.MongoClient(MONGO_URL)db = client[MONGO_DB]for brand_id in results[0]:url = url_2.format(brand_id)task = aio_car.have_brand_name(url)  # 获取品牌名字函数brand_name = loop.run_until_complete(task)   # 获取车辆匹配总名字， 也就是搜寻的第一个名字ids = loop.run_until_complete(aio_car.have_ids(url))  # 利用事件循环获取idsfor id in ids:url = url_3.format(id)task = aio_car.have_leardboard_message(url)datas = loop.run_until_complete(task)for item in datas:item = [brand_name, item[0], item[1], item[2]]item = " ".join(item)item = {'result' : item}aio_car.save_message_to_mongodb(item) # 保存到MongoDBmysql.insert(item)  # 保存到Mysqlprint('采集结束', datetime.datetime.today())

aiohttp与asyncio库爬取汽车销量排行榜相关推荐

爬虫（2）-解析库xpath和beautifulsoup爬取猫眼电影排行榜前100部电影
解析库爬取猫眼电影前100部电影认为有用的话请点赞,码字不易,谢谢. 其他爬虫实战请查看:https://blog.csdn.net/qq_42754919/category_10354544.ht ...
Python爬虫——aiohttp异步协程爬取同程旅行酒店评论
大家好!我是霖hero Python并发编程有三种方式:多线程(Threading).多进程(Process).协程(Coroutine),使用并发编程会大大提高程序的效率,今天我们将学习如何选择多线 ...
送书 | aiohttp异步协程爬取同程旅行酒店评论并作词云图
大家好!我是啃书君! Python并发编程有三种方式:多线程(Threading).多进程(Process).协程(Coroutine),使用并发编程会大大提高程序的效率,今天我们将学习如何选择多线程 ...
Python 爬虫实战入门——爬取汽车之家网站促销优惠与经销商信息
在4S店实习,市场部经理让我写一个小程序自动爬取汽车之家网站上自家品牌的促销文章,因为区域经理需要各店上报在网站上每一家经销商文章的露出频率,于是就自己尝试写一个爬虫,正好当入门了. 一.自动爬取并输 ...
python爬虫利用Scrapy框架爬取汽车之家奔驰图片--实战
先看一下利用scrapy框架爬取汽车之家奔驰A级的效果图 1)进入cmd命令模式下,进入想要存取爬虫代码的文件,我这里是进入e盘下的python_spider文件夹内 C:\Users\15538&g ...
python3 selenium webdriver.Chrome php 爬取汽车之家所有车型详情数据[开源版]
介绍本接口是车型库api的补充,用于爬取汽车之家所有车型详情数据开源地址:https://gitee.com/web/CarApi/tree/master/python 软件架构 python3 ...
python爬虫学习(三)：使用re库爬取淘宝商品，并把结果写进txt文件
第二个例子是使用requests库+re库爬取淘宝搜索商品页面的商品信息 (1)分析网页源码打开淘宝,输入关键字"python",然后搜索,显示如下搜索结果从url连接中可以得 ...
urllib库爬取拍信创意图片(post请求)json传参
urllib库爬取拍信创意图片解决urllib库遇到Request payload传参问题分析网页: 找到接口: 发现图片数据都是以json格式存储在这个接口里我们在来看接口所需要的data,这 ...
Python网络爬虫与信息提取（17）—— 题库爬取与整理+下载答案
前言上一节实现了题目的整理,没整理答案是不完整的,所以这一节加上答案的爬取. 上一节地址:Python网络爬虫与信息提取(16)-- 题库爬取与整理效果思路爬答案有点难搞,像这种题库的答案都是 ...
使用Requests库+re库爬取猫眼电影评分
使用Requests库+re库爬取猫眼电影评分作者:小胖 0x1: 分析在简单的翻页中,我们可以知道.网页涵盖了以下几个规律 1.offset参数的值是除以30就是当前的页面 2.每个页面只有30 ...

aiohttp与asyncio库爬取汽车销量排行榜

aiohttp与asyncio库爬取汽车销量排行榜相关推荐

最新文章

热门文章