一、采集豆瓣电影 Top 250的数据采集

1.进入豆瓣 Top 250的网页

豆瓣电影 Top 250

2.进入开发者选项

3.进入top250中去查看相关配置

右键----检查----

4.为Pycharm添加其第三方库

pycharm中 File----右键----settings----Python Interpreter----+----（添加bs4、requests、lxml等安装包）

注意：如果安装库出现错误，即可能是对应的pip版本太低，需要更新

在 cmd 命令窗口查看pip版本

pip -V

更新pip版本

python -m pip install --upgrade pip

5.进行爬虫的编写
（1）导入：import requests
（2）向服务器发送请求 requests=requests.get(url)
(3)获取网页数据：html =request.text

反反爬处理–——伪装浏览器
1.定义变量geaders={用户代理}
2.向服务器发送请求时携带上代理信息：response=requests.get(url,headrs = h)

6、bs4库中beautifulSoup类的使用

1.导入：from bs4 import BeautifulSoup
2.定义soup对象：soup = BeautifulSoup(html,lxml)
3.使用soup对象调用类方法select(css选择器)
4.提取数据：
get_text()
text
string

7、储存到CSV中

1.先将数据存放到列表里
1.创建csv文档with open(‘xxx.csv’,‘a’,newlie=’’) as f:
3.调用CSV模块中的write()方法：w=csv.write(f
4.将列表中的数据写入文档：w.writerows(listdata)

import requests,csv
from bs4 import BeautifulSoup
def getHtml(url):# 数据采集h = {'User-Agent': 'Mozilla / 5.0(Windows NT 10.0;WOW64)'}response = requests.get(url,headers = h)html = response.text# print(html)
# getHtml('https://movie.douban.com/top250')
# 数据解析：正则，BeautifulSoup，Xpathsoup = BeautifulSoup(html,'lxml')filmtitle = soup.select('div.hd > a > span:nth-child(1)')ct = soup.select('div.bd > p:nth-child(1)')score = soup.select('div.bd > div > span.rating_num')evalue = soup.select('div.bd > div > span:nth-child(4)')print(score)filmlist = []for t,c,s,e in zip(filmtitle,ct,score,evalue):title = t.textcontent = c.textfilmscore = s.textnum = e.text.strip('人评价')director = content.strip().split()[1]if "主演:" in content:actor = content.strip().split('主演:')[1].split()[0]else:actor = Noneyear = content.strip().split('/')[-3].split()[-1]area = content.strip().split('/')[-2].strip()filmtype = content.strip().split('/')[-1].strip()# print(num)listdata = [title,director,actor,year,area,filmtype,filmscore,num]filmlist.append(listdata)print(filmlist)# 存储数据with open('douban250.csv','a',encoding='utf-8',newline='') as f:w = csv.writer(f)w.writerows(filmlist)# 函数调用
listtitle = ['title','director','actor','year','area','type','score','evalueate']
with open('douban250.csv','a',encoding='utf-8',newline='') as f:w = csv.writer(f)w.writerow(listtitle)for i in range(0,226,25):getHtml('https://movie.douban.com/top250?start=%s&filter='%(i))# getHtml('https://movie.douban.com/top250?start=150&filter=')

备注

h 是到网站的开发者页面去寻找这句话就是

二、安居客数据采集

1.安居客的网页

安居客

2.导入from lxml import etree

3.将采集的字符串转换为html格式：etree.html

4.转换后的数据调用xPath(xpath的路径)：data.xpath（路径表达式）

import requests
from lxml import etree
def getHtml(url):h = {'user - agent': 'Mozilla / 5.0(Windows NT 10.0;Win64;x64)'}response = requests.get(url,headers=h)html = response.text# 数据解析data = etree.HTML(html)print(data)name = data.xpath('//span[@class="items-name"]/text()')print(name)
getHtml("https://bj.fang.anjuke.com/?from=AF_Home_switchcity")

三、拉勾网的数据采集

（一）requests数据采集

1.导入库
2.向服务器发送请求
3.下载数据

（二）post方式请求（参数不在url里）

1.通过网页解析找到post参数（请求数据Fromdata）并且参数定义到字典里
2.服务器发送请求调用requests.post（data=fromdata）

（三）数据解析——json

1.导入 import——json
2. 将采集的json数据格式转换为python字典：json.load（json格式数据）
3.通过字典中的键访问值

（四）数据存储到mysql中

1.下载并导入pymysql import pymysql
2.建立链接：conn=pymysql.Connenct(host,port,user,oassword,db,charset=‘utf8’)
3.写sql语句
4.定义游标：cursor=conn.cursor()
5.使用定义的游标执行sql语句：cursor.execute()sql
6.向数据库提交数据：conn.commit()
7.关闭游标:cursor.close()
8.关闭链接：conn.close()
这次的爬虫是需要账户的所以h是使用的Request Headers中的数据

import requests,json,csv,time,pymysql
keyword = input("请输入查询的职务")def getHtml(url):# 数据采集# h = {#     'user-agent': ''' Mozilla/5.0 (Windows NT 10.0;# WOW64) AppleWebKit/537.36 (KHTML, like# Gecko) Chrome/84.0# .4147# .89 Safari/537.36'''# }h = {'accept': 'application / json, text / javascript, * / *;q=0.01','accept-encoding': 'gzip, deflate, br','accept-language': 'zh - CN, zh;q = 0.9','content-length': '187','content-type': 'application / x - www - form - urlencoded;charset = UTF - 8','cookie': 'user_trace_token = 20210811171420 - 2ffaeedb - b564 - 4952 - bd4f - 98d653f8dd82;_ga = GA1.2.1124964589.1628673260;LGUID = 20210811171420 - 23da14c9 - 6dab - 4720 - bc7f - b69df2ceb1f3;sajssdk_2015_cross_new_user = 1;_gid = GA1.2.642774310.1628673284;gate_login_token = 6750561a88ef94f733a1fcd2a6b2904d62951a47bc7f22df805342a995a92f22;LG_LOGIN_USER_ID = 3ef1b26ba9555cb7bd5bc00ea2afc307e0396355ce430c44c7c7b0efd9754be6;LG_HAS_LOGIN = 1;showExpriedIndex = 1;showExpriedCompanyHome = 1;showExpriedMyPublish = 1;hasDeliver = 0;privacyPolicyPopup = false;index_location_city = % E5 % 85 % A8 % E5 % 9B % BD;RECOMMEND_TIP = true;__SAFETY_CLOSE_TIME__22449191 = 1;Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6 = 1628673260, 1628680870;LGSID = 20210811192110 - 5d357c31 - f5c5 - 4673 - a6fe - 1f7aa9832c9b;PRE_UTM = m_cf_cpt_baidu_pcbt;PRE_HOST = www.baidu.com;PRE_SITE = https % 3A % 2F % 2Fwww.baidu.com % 2Fother.php % 3Fsc.060000abstzkng4oZnusQQpAJ1RM1r469sUVFOyBxYezJNF % 5F3voalKPl75Aalwm55EuNlZYnyABo8sHQVsnubyZpCt6rg20SDkMFfi1ykMxURnb1uFvNn6gVFev7OaHIODVc1YLOzZuI % 5FP91K7bQn3c1eVyMa2NTINfLSw8vAeFm4RE2nVOm7rhVwkpuF8ToMGSuWJDYa5 - EfqJPTEEmFAiIPydo.7Y % 5FNR2Ar5Od663rj6tJQrGvKD77h24SU5WudF6ksswGuh9J4qt7jHzk8sHfGmYt % 5FrE - 9kYryqM764TTPqKi % 5FnYQZHuukL0.TLFWgv - b5HDkrfK1ThPGujYknHb0THY0IAYqs2v4VnL30ZN1ugFxIZ - suHYs0A7bgLw4TARqnsKLULFb5TaV8UHPS0KzmLmqnfKdThkxpyfqnHRzrjb1nH0zPsKVINqGujYkPjfsPjn3n6KVgv - b5HDknjTYPjTk0AdYTAkxpyfqnHczP1n0TZuxpyfqn0KGuAnqiDF70ZKGujYzP6KWpyfqnWDs0APzm1YznH01Ps % 26ck % 3D8577.2.151.310.161.341.197.504 % 26dt % 3D1628680864 % 26wd % 3D % 25E6 % 258B % 2589 % 25E5 % 258B % 25BE % 25E7 % 25BD % 2591 % 26tpl % 3Dtpl % 5F12273 % 5F25897 % 5F22126 % 26l % 3D1528931027 % 26us % 3DlinkName % 253D % 2525E6 % 2525A0 % 252587 % 2525E9 % 2525A2 % 252598 - % 2525E4 % 2525B8 % 2525BB % 2525E6 % 2525A0 % 252587 % 2525E9 % 2525A2 % 252598 % 2526linkText % 253D % 2525E3 % 252580 % 252590 % 2525E6 % 25258B % 252589 % 2525E5 % 25258B % 2525BE % 2525E6 % 25258B % 25259B % 2525E8 % 252581 % 252598 % 2525E3 % 252580 % 252591 % 2525E5 % 2525AE % 252598 % 2525E6 % 252596 % 2525B9 % 2525E7 % 2525BD % 252591 % 2525E7 % 2525AB % 252599 % 252520 - % 252520 % 2525E4 % 2525BA % 252592 % 2525E8 % 252581 % 252594 % 2525E7 % 2525BD % 252591 % 2525E9 % 2525AB % 252598 % 2525E8 % 252596 % 2525AA % 2525E5 % 2525A5 % 2525BD % 2525E5 % 2525B7 % 2525A5 % 2525E4 % 2525BD % 25259C % 2525EF % 2525BC % 25258C % 2525E4 % 2525B8 % 25258A % 2525E6 % 25258B % 252589 % 2525E5 % 25258B % 2525BE % 21 % 2526linkType % 253D;PRE_LAND = https % 3A % 2F % 2Fwww.lagou.com % 2Flanding - page % 2Fpc % 2Fsearch.html % 3Futm % 5Fsource % 3Dm % 5Fcf % 5Fcpt % 5Fbaidu % 5Fpcbt;_putrc = 49E09C21F104EE26123F89F2B170EADC;JSESSIONID = ABAAABAABAGABFA53D5BE8416D5D8A97CF86D4C78670054;login = true;unick = % E5 % B8 % AD % E4 % BA % 9A % E8 % 8E % 89;WEBTJ - ID = 20210811192139 - 17b34f24cf942a - 0a32363c696808 - 436d2710 - 1296000 - 17b34f24cfaa72;sensorsdata2015session = % 7B % 7D;TG - TRACK - CODE = index_user;X_HTTP_TOKEN = 0d37be70811080cf330186826144b8b492a87bfbbc;Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6 = 1628681034;LGRID = 20210811192358 - 526edbb5 - b009 - 4e55 - a806 - 3b99a453ed3b;sensorsdata2015jssdkcross = % 7B % 22distinct_id % 22 % 3A % 2222449191 % 22 % 2C % 22 % 24device_id % 22 % 3A % 2217b347e188019c - 04cc2319c4da9a - 436d2710 - 1296000 - 17b347e18818b0 % 22 % 2C % 22props % 22 % 3A % 7B % 22 % 24latest_traffic_source_type % 22 % 3A % 22 % E7 % 9B % B4 % E6 % 8E % A5 % E6 % B5 % 81 % E9 % 87 % 8F % 22 % 2C % 22 % 24latest_search_keyword % 22 % 3A % 22 % E6 % 9C % AA % E5 % 8F % 96 % E5 % 88 % B0 % E5 % 80 % BC_ % E7 % 9B % B4 % E6 % 8E % A5 % E6 % 89 % 93 % E5 % BC % 80 % 22 % 2C % 22 % 24latest_referrer % 22 % 3A % 22 % 22 % 2C % 22 % 24os % 22 % 3A % 22Windows % 22 % 2C % 22 % 24browser % 22 % A % 22Chrome % 22 % 2C % 22 % 24browser_version % 22 % 3A % 2284.0.4147.89 % 22 % 7D % 2C % 22first_id % 22 % 3A % 2217b347e188019c - 04cc2319c4da9a - 436d2710 - 1296000 - 17b347e18818b0 % 22 % 7D','origin': 'https: // www.lagou.com','referer': 'https: // www.lagou.com / wn / jobs?px = default & cl = false & fromSearch = true & labelWords = sug & suginput = python & kd = python % E5 % BC % 80 % E5 % 8F % 91 % E5 % B7 % A5 % E7 % A8 % 8B % E5 % B8 % 88 & pn = 1','sec-fetch-dest': 'empty','sec-fetch-mode': 'cors','sec-fetch-site': 'same-origin','user-agent': 'Mozilla / 5.0(Windows  NT10.0;WOW64) AppleWebKit / 537.36(KHTML, likeGecko) Chrome / 84.0.4147.89Safari / 537.36SLBrowser / 7.0.0.6241SLBChan / 30','x-anit-forge-code': '7cd23e45 - 40de - 449e - b201 - cddd619b37f3','x-anit-forge-token': '9b30b067 - ac6d - 4915 - af3e - c308979313d0'}# 定义post方式参数for num in range(1,31):formdata = {'first': 'true','pn': num,'kd': keyword}# 向服务器发送请求response = requests.post(url, headers = h,data = formdata)# 下载数据jsonhtml = response.textprint(jsonhtml)# 解析数据dictdata = json.loads(jsonhtml)# print(type(dictdata))ct = dictdata['content']['positionResult']['result']positionlist = []for i in range(0,dictdata['content']['pageSize']):positionName = ct[i]['positionName']companyFullName = ct[i]['companyFullName']city = ct[i]['city']district = ct[i]['district']companySize = ct[i]['companySize']education = ct[i]['education']salary = ct[i]['salary']salaryMonth = ct[i]['salaryMonth']workYear = ct[i]['workYear']# print(workYear)datalist = [positionName,companyFullName,city,district,companySize,education,salary,salaryMonth,workYear]positionlist.append(datalist)# 建立连接conn = pymysql.Connect(host='localhost',port=3306,user='root',passwd='277877061#xyl',db='lagou',charset='utf8')# sqlsql = "insert into lg values('%s','%s','%s','%s','%s','%s','%s','%s','%s');"%(positionName,companyFullName,city,district,companySize,education,salary,salaryMonth,workYear)# 创建游标并执行cursor = conn.cursor()try:cursor.execute(sql)conn.commit()except:conn.rollback()cursor.close()conn.close()# 设置时间延迟time.sleep(3)with open('拉勾网%s职位信息.csv'%keyword,'a',encoding='utf-8',newline='') as f:w = csv.writer(f)w.writerows(positionlist)# print(ct)# 调用函数
title = ['职位名称','公司名称','所在城市','所属地区','公司规模','教育水平','薪资范围','薪资月','工作经验']
with open('拉勾网%s职位信息.csv'%keyword,'a',encoding='utf-8',newline='') as f:w = csv.writer(f)w.writerow(title)
getHtml('https://www.lagou.com/jobs/v2/positionAjax.json?needAddtionalResult=false')

四、疫情数据采集

360疫情数据

import requests,json,csv,pymysqldef getHtml(url):h = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0;Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'}response = requests.get(url,headers=h)joinhtml = response.textdirectro = joinhtml.split('(')[1]directro2 = directro[0:-2]dictdata = json.loads(directro2)ct = dictdata['country']positionlist = []for i in range(0,196):provinceName = ct[i]['provinceName']cured = ct[i]['cured']diagnosed = ct[i]['diagnosed']died = ct[i]['died']diffDiagnosed = ct[i]['diffDiagnosed']datalist = [provinceName,cured,diagnosed,died,diffDiagnosed]positionlist.append(datalist)print(diagnosed)conn = pymysql.Connect(host = 'localhost',port = 3306,user = 'root',passwd='277877061#xyl',db='lagou',charset='utf8')sql = "insert into yq values('%s','%s','%s','%s','%s');" %(provinceName,cured,diagnosed,died,diffDiagnosed)cursor = conn.cursor()try:cursor.execute(sql)conn.commit()except:conn.rollback()cursor.close()conn.close()with open('360疫情数据.csv','a',encoding="utf-8",newline="") as f:w = csv.writer(f)w.writerows(positionlist)title = ['provinceName','cured','diagnosed','died','diffDiagnosed']
with open('360疫情数据.csv', 'a', encoding="utf-8", newline="") as f:w = csv.writer(f)w.writerow(title)
getHtml('https://m.look.360.cn/events/feiyan?sv=&version=&market=&device=2&net=4&stype=&scene=&sub_scene=&refer_scene=&refer_subscene=&f=jsonp&location=true&sort=2&_=1626165806369&callback=jsonp2')

五、scrapy 数据采集

（一）创建爬虫步骤

1.创建项目，在命令行输入：scrapy startproject 项目名称
2.创建爬虫文件：在命令行定位到spiders文件夹下：scrapy genspider 爬虫文件名网站域名
3.运行程序：scrapy crawl 爬虫文件名

（二）scrapy文件结构

1.spiders文件夹:编写解析程序
2.init.py：包标志文件，只有存在这个python文件的文件夹才是包
3.items.py:定义数据结构，用于定义爬取的字段
4.middlewares.py:中间件
5.pipliness.py：管道文件，用于定义数据存储方向
6.settings.py:配置文件，通常配置用户代理，管道文件开启

（三）scrapy爬虫框架使用流程

1.配置settings.py文件：

（1）User-Agent:“浏览器参数”

(2)ROBOTS协议：Fasle

（3）修改ITEM_PIPLINES：打开管道

2.在item.py中定义爬取的字段

字段名 = scrapy.Field()

3.在spiders文件夹中的parse（）方法中写解析代码并将解析结果提交给item对象
4.在piplines.py中定义存储路径
5.分页采集：

1）查找url的规律
2）在爬虫文件中先定义页面变量，使用if判断和字符串格式化方法生成新url
3）yield scrapy Requests(url,callback=parse)

案例
（一）安居客数据采集
其中的文件

venv----右键----New----Directory----安居库数据采集

venv----右键----Open in----terminal----命令行输入：

 scrapy startproject anjuke

回车

查看自动生成的文件

在terminal里面进去spiders目录：

cd anjuke\anjuke\spiders

在命令行中输入：（创建ajk.py文件）

scrapy genspider ajk bj.fang.anjuke.com/?from=navigation

运行：（命令行输入）

scrapy crawl ajk

运行也是要在终端：

scrapy crawl ajk --nolog

1.ajk.py中

import scrapyfrom anjuke.items import AnjukeItem
class AjkSpider(scrapy.Spider):name = 'ajk'#allowed_domains = ['bj.fang.anjuke.com/?from=navigation']start_urls = ['http://bj.fang.anjuke.com/loupan/all/p1']pagenum = 1def parse(self, response):#print(response.text)item = AnjukeItem()# 解析数据name = response.xpath('//span[@class="items-name"]/text()').extract()temp = response.xpath('//span[@class="list-map"]/text()').extract()place = []district = []#print(temp)for i in temp:placetemp ="".join(i.split("]")[0].strip("[").strip().split())districttemp = i.split("]")[1].strip()# print(districttemp)place.append(placetemp)district.append(districttemp)# apartment = response.xpath('//a[@class="huxing"]/span[not(@class)]/text()').extract()area1 = response.xpath('//span[@class="building-area"]/text()').extract()area = []for j in area1:areatemp = j.strip("建筑面积：").strip('㎡')area.append(areatemp)price = response.xpath('//p[@class="price"]/span/text()').extract()# print(name)# 将处理后的数据传入item中item['name'] = nameitem['place'] = placeitem['district'] = district#item['apartment'] = apartmentitem['area'] = areaitem['price'] = priceyield item#print(type(item['name']))for a, b, c, d, e in zip(item['name'], item['place'], item['district'], item['area'], item['price']):print(a, b, c, d, e)# 分页爬虫if self.pagenum < 5:self.pagenum += 1newurl = "https://bj.fang.anjuke.com/loupan/all/p{}/".format(str(self.pagenum))print(newurl)yield scrapy.Request(newurl, callback=self.parse)# print(type(dict(item)))

2.pipelines.py中

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html# useful for handling different item types with a single interfacefrom itemadapter import ItemAdapter
# 存放到csv中
import csv,pymysql
class AnjukePipeline:def open_spider(self, spider):self.f = open("安居客北京.csv", "w", encoding='utf-8', newline="")self.w = csv.writer(self.f)titlelist = ['name', 'place', 'distract', 'area', 'price']self.w.writerow(titlelist)def process_item(self, item, spider):# writerow()  [1,2,3,4] writerows()  [[第一条记录],[第二条记录],[第三条记录]]# 数据处理k = list(dict(item).values())self.listtemp = []for a, b, c, d, e in zip(k[0], k[1], k[2], k[3], k[4]):self.temp = [a, b, c, d, e]self.listtemp.append(self.temp)# print(listtemp)self.w.writerows(self.listtemp)return itemdef close_spider(self,spider):self.f.close()
# 存储到mysql中
class MySqlPipeline:def open_spider(self,spider):self.conn = pymysql.Connect(host="localhost",port=3306,user='root',password='277877061#xyl',db="anjuke",charset='utf8')def process_item(self,item,spider):self.cursor = self.conn.cursor()for a, b, c, d, e in zip(item['name'], item['place'], item['district'], item['area'], item['price']):sql = 'insert into ajk values("%s","%s","%s","%s","%s");'%(a,b,c,d,e)self.cursor.execute(sql)self.conn.commit()return itemdef close_spider(self,spider):self.cursor.close()self.conn.close()

3.items.py中

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.htmlimport scrapyclass AnjukeItem(scrapy.Item):# define the fields for your item here like:name = scrapy.Field()place = scrapy.Field()district = scrapy.Field()#apartment = scrapy.Field()area = scrapy.Field()price = scrapy.Field()

4.settings.py中

# Scrapy settings for anjuke project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.htmlBOT_NAME = 'anjuke'SPIDER_MODULES = ['anjuke.spiders']
NEWSPIDER_MODULE = 'anjuke.spiders'# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla / 5.0(Windows NT 10.0;WOW64)'# Obey robots.txt rules
ROBOTSTXT_OBEY = False# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16# Disable cookies (enabled by default)
#COOKIES_ENABLED = False# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'anjuke.middlewares.AnjukeSpiderMiddleware': 543,
#}# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'anjuke.middlewares.AnjukeDownloaderMiddleware': 543,
#}# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {'anjuke.pipelines.AnjukePipeline': 300,'anjuke.pipelines.MySqlPipeline': 301
}# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

5.middlewares.py中

# Define here the models for your spider middleware
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/spider-middleware.htmlfrom scrapy import signals# useful for handling different item types with a single interface
from itemadapter import is_item, ItemAdapterclass AnjukeSpiderMiddleware:# Not all methods need to be defined. If a method is not defined,# scrapy acts as if the spider middleware does not modify the# passed objects.@classmethoddef from_crawler(cls, crawler):# This method is used by Scrapy to create your spiders.s = cls()crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)return sdef process_spider_input(self, response, spider):# Called for each response that goes through the spider# middleware and into the spider.# Should return None or raise an exception.return Nonedef process_spider_output(self, response, result, spider):# Called with the results returned from the Spider, after# it has processed the response.# Must return an iterable of Request, or item objects.for i in result:yield idef process_spider_exception(self, response, exception, spider):# Called when a spider or process_spider_input() method# (from other spider middleware) raises an exception.# Should return either None or an iterable of Request or item objects.passdef process_start_requests(self, start_requests, spider):# Called with the start requests of the spider, and works# similarly to the process_spider_output() method, except# that it doesn’t have a response associated.# Must return only requests (not items).for r in start_requests:yield rdef spider_opened(self, spider):spider.logger.info('Spider opened: %s' % spider.name)class AnjukeDownloaderMiddleware:# Not all methods need to be defined. If a method is not defined,# scrapy acts as if the downloader middleware does not modify the# passed objects.@classmethoddef from_crawler(cls, crawler):# This method is used by Scrapy to create your spiders.s = cls()crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)return sdef process_request(self, request, spider):# Called for each request that goes through the downloader# middleware.# Must either:# - return None: continue processing this request# - or return a Response object# - or return a Request object# - or raise IgnoreRequest: process_exception() methods of#   installed downloader middleware will be calledreturn Nonedef process_response(self, request, response, spider):# Called with the response returned from the downloader.# Must either;# - return a Response object# - return a Request object# - or raise IgnoreRequestreturn responsedef process_exception(self, request, exception, spider):# Called when a download handler or a process_request()# (from other downloader middleware) raises an exception.# Must either:# - return None: continue processing this exception# - return a Response object: stops process_exception() chain# - return a Request object: stops process_exception() chainpassdef spider_opened(self, spider):spider.logger.info('Spider opened: %s' % spider.name)

6、Mysql中

1）创建anjuke数据库

mysql> create database anjuke;

2）使用库

mysql> use anjuke;

3）创建表

mysql> create table ajk(-> name varchar(100),place varchar(100),distarct varchar(100),area varchar(100),price varchar(50));
Query OK, 0 rows affected (0.03 sec)

4）查看表结构

mysql> desc ajk;
+----------+--------------+------+-----+---------+-------+
| Field    | Type         | Null | Key | Default | Extra |
+----------+--------------+------+-----+---------+-------+
| name     | varchar(100) | YES  |     | NULL    |       |
| place    | varchar(100) | YES  |     | NULL    |       |
| distarct | varchar(100) | YES  |     | NULL    |       |
| area     | varchar(100) | YES  |     | NULL    |       |
| price    | varchar(50)  | YES  |     | NULL    |       |
+----------+--------------+------+-----+---------+-------+

5）查看表数据

mysql> select * from ajk;
+-------------------------------------+-----------------------+-----------------------------------------------------------+--------------+--------+
| name                                | place                 | distarct                                                  | area         | price  |
+-------------------------------------+-----------------------+-----------------------------------------------------------+--------------+--------+
| 和锦华宸品牌馆                      | 顺义马坡              | 和安路与大营二街交汇处东北角                              | 95-180       | 400    |
| 保利建工•和悦春风                   | 大兴庞各庄            | 永兴河畔                                                  | 76-109       | 36000  |
| 华樾国际·领尚                       | 朝阳东坝              | 东坝大街和机场二高速交叉路口西方向3...                    | 96-196.82    | 680    |
| 金地·璟宸品牌馆                     | 房山良乡              | 政通西路                                                  | 97-149       | 390    |
| 中海首钢天玺                        | 石景山古城            | 北京市石景山区古城南街及莲石路交叉口...                   | 122-147      | 900    |
| 傲云品牌馆                          | 朝阳孙河              | 顺黄路53号                                                | 87288        | 2800   |
| 首开香溪郡                          | 通州宋庄              | 通顺路                                                    | 90-220       | 39000  |
| 北京城建·北京合院                   | 顺义顺义城            | 燕京街与通顺路交汇口东800米（仁和公...                    | 93-330       | 850    |
| 金融街武夷·融御                     | 通州政务区            | 北京城市副中心01组团通胡大街与东六环...                   | 99-176       | 66000  |
| 北京城建·天坛府品牌馆               | 东城天坛              | 北京市东城区景泰路与安乐林路交汇处                        | 60-135       | 800    |
| 亦庄金悦郡                          | 大兴亦庄              | 新城环景西一路与景盛南二街交叉口                          | 72-118       | 38000  |
| 金茂·北京国际社区                   | 顺义顺义城            | 水色时光西路西侧                                          | 50-118       | 175    |

（二）太平洋汽车数据采集

1.tpy.py中

import scrapy
from taipingyangqiche.item import TaipingyangqicheItem
class TpyqcSpider(scrapy.Spider):name = 'tpyqc'allowed_domains = ['price.pcauto.com.cn/top']start_urls = ['http://price.pcauto.com.cn/top/']pagenum = 1def parse(self, response):item = TaipingyangqicheItem()name = response.xpath('//p[@class="sname"]/a/text()').extract()temperature = response.xpath('//p[@class="col rank"]/span[@class="fl red rd-mark"]/text()').extract()price = response.xpath('//p/em[@class="red"]/text()').extract()brand2 = response.xpath('//p[@class="col col1"]/text()').extract()brand = []for j in brand2:areatemp = j.strip('品牌：').strip('排量：').strip('\r\n')brand.append(areatemp)brand = [i for i in brand if i != '']rank2 = response.xpath('//p[@class="col"]/text()').extract()rank = []for j in rank2:areatemp = j.strip('级别：').strip('变速箱：').strip('\r\n')rank.append(areatemp)rank = [i for i in rank if i != '']item['name'] = nameitem['temperature'] = temperatureitem['price'] = priceitem['brand'] = branditem['rank'] = rankyield itemfor a, b, c, d, e in zip(item['name'], item['temperature'], item['price'], item['brand'], item['rank']):print(a, b, c, d, e)if self.pagenum < 6:self.pagenum += 1newurl = "https://price.pcauto.com.cn/top/k0-p{}.html".format(str(self.pagenum))print(newurl)yield scrapy.Request(newurl, callback=self.parse)# print(type(dict(item)))

2.pipelines.py中

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html# useful for handling different item types with a single interface
from itemadapter import ItemAdapter
import csv,pymysql
class TaipingyangqichePipeline:def open_spider(self, spider):self.f = open("太平洋.csv", "w", encoding='utf-8', newline="")self.w = csv.writer(self.f)titlelist = ['name', 'temperature', 'price', 'brand', 'rank']self.w.writerow(titlelist)def process_item(self, item, spider):k = list(dict(item).values())self.listtemp = []for a, b, c, d, e in zip(k[0], k[1], k[2], k[3], k[4]):self.temp = [a, b, c, d, e]self.listtemp.append(self.temp)print(self.listtemp)self.w.writerows(self.listtemp)return itemdef close_spider(self, spider):self.f.close()
class MySqlPipeline:def open_spider(self, spider):self.conn = pymysql.Connect(host="localhost", port=3306, user='root', password='277877061#xyl', db="taipy",charset='utf8')def process_item(self, item, spider):self.cursor = self.conn.cursor()for a, b, c, d, e in zip(item['name'], item['temperature'], item['price'], item['brand'], item['rank']):sql = 'insert into tpy values("%s","%s","%s","%s","%s");' % (a, b, c, d, e)self.cursor.execute(sql)self.conn.commit()return itemdef close_spider(self, spider):self.cursor.close()self.conn.close()

3.items.py中

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.htmlimport scrapyclass TaipingyangqicheItem(scrapy.Item):# define the fields for your item here like:name = scrapy.Field()temperature = scrapy.Field()price = scrapy.Field()brand = scrapy.Field()rank = scrapy.Field()

4.settings.py中

# Scrapy settings for taipingyangqiche project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.htmlBOT_NAME = 'taipingyangqiche'SPIDER_MODULES = ['taipingyangqiche.spiders']
NEWSPIDER_MODULE = 'taipingyangqiche.spiders'# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla / 5.0(Windows NT 10.0;WOW64)'# Obey robots.txt rules
ROBOTSTXT_OBEY = False# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16# Disable cookies (enabled by default)
#COOKIES_ENABLED = False# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'taipingyangqiche.middlewares.TaipingyangqicheSpiderMiddleware': 543,
#}# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'taipingyangqiche.middlewares.TaipingyangqicheDownloaderMiddleware': 543,
#}# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {'taipingyangqiche.pipelines.TaipingyangqichePipeline': 300,'taipingyangqiche.pipelines.MySqlPipeline': 301
}# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

5.middlewares.py中

# Define here the models for your spider middleware
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/spider-middleware.htmlfrom scrapy import signals# useful for handling different item types with a single interface
from itemadapter import is_item, ItemAdapterclass TaipingyangqicheSpiderMiddleware:# Not all methods need to be defined. If a method is not defined,# scrapy acts as if the spider middleware does not modify the# passed objects.@classmethoddef from_crawler(cls, crawler):# This method is used by Scrapy to create your spiders.s = cls()crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)return sdef process_spider_input(self, response, spider):# Called for each response that goes through the spider# middleware and into the spider.# Should return None or raise an exception.return Nonedef process_spider_output(self, response, result, spider):# Called with the results returned from the Spider, after# it has processed the response.# Must return an iterable of Request, or item objects.for i in result:yield idef process_spider_exception(self, response, exception, spider):# Called when a spider or process_spider_input() method# (from other spider middleware) raises an exception.# Should return either None or an iterable of Request or item objects.passdef process_start_requests(self, start_requests, spider):# Called with the start requests of the spider, and works# similarly to the process_spider_output() method, except# that it doesn’t have a response associated.# Must return only requests (not items).for r in start_requests:yield rdef spider_opened(self, spider):spider.logger.info('Spider opened: %s' % spider.name)class TaipingyangqicheDownloaderMiddleware:# Not all methods need to be defined. If a method is not defined,# scrapy acts as if the downloader middleware does not modify the# passed objects.@classmethoddef from_crawler(cls, crawler):# This method is used by Scrapy to create your spiders.s = cls()crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)return sdef process_request(self, request, spider):# Called for each request that goes through the downloader# middleware.# Must either:# - return None: continue processing this request# - or return a Response object# - or return a Request object# - or raise IgnoreRequest: process_exception() methods of#   installed downloader middleware will be calledreturn Nonedef process_response(self, request, response, spider):# Called with the response returned from the downloader.# Must either;# - return a Response object# - return a Request object# - or raise IgnoreRequestreturn responsedef process_exception(self, request, exception, spider):# Called when a download handler or a process_request()# (from other downloader middleware) raises an exception.# Must either:# - return None: continue processing this exception# - return a Response object: stops process_exception() chain# - return a Request object: stops process_exception() chainpassdef spider_opened(self, spider):spider.logger.info('Spider opened: %s' % spider.name)

Python数据采集相关推荐

python数据采集6-读取文档
文章目录 python数据采集6-读取文档文档编码纯文本 CSV PDF 微软Word和.docx python数据采集6-读取文档有种观点认为,互联网基本上就是那些符合新式 Web 2.0 潮 ...
Python数据采集分析告诉你为何上海二手房你都买不起
感谢关注Python爱好者社区公众号,在这里,我们会每天向您推送Python相关的文章实战干货. 来吧,一起Python. 对商业智能BI.大数据分析挖掘.机器学习,python,R等数据领域感兴趣的 ...
Python 数据采集-爬取学校官网新闻标题与链接（基础）
Python 爬虫爬取学校官网新闻标题与链接一.前言二.扩展库简要介绍 01 urllib 库 (1)urllib.request.urlopen() 02 BeautifulSoup 库 (1) ...
python数据采集有哪些技术_如何快速掌握Python数据采集与网络爬虫技术
一.数据采集与网络爬虫技术简介网络爬虫是用于数据采集的一门技术,可以帮助我们自动地进行信息的获取与筛选.从技术手段来说,网络爬虫有多种实现方案,如PHP.Java.Python ....那么用pyt ...
python数据采集3-开始采集
文章目录 python数据采集3-开始采集遍历单个域名采集整个网站通过互联网 python数据采集3-开始采集遍历单个域名写一段获取百度百科网站的任何页面并提取页面链接的 Python 代码 ...
Python 数据采集-爬取学校官网新闻标题与链接（进阶）
Python 爬虫爬取学校官网新闻标题与链接(进阶) 前言一.拼接路径二.存储三.读取翻页数据四.完整代码展示五.小结前言 ⭐ 本文基于学校的课程内容进行总结,所爬取的数据均为学习使用,请 ...
python数据采集框架_20190715《Python网络数据采集》第 1 章
<Python网络数据采集>7月8号-7月10号,这三天将该书精读一遍,脑海中有了一个爬虫大体框架后,对于后续学习将更加有全局感. 此前,曾试验看视频学习,但是一个视频基本2小时,全部拿下 ...
python数据采集培训
这几年,"数据分析"是很火啊,在这个数据驱动一切的时代,数据挖掘和数据分析就是这个时代的"淘金",懂数据分析.拥有数据思维,往往成了大厂面试的加分项. 比如通过 ...
合集｜Python数据采集、分析挖掘、可视化，看这一篇就够了！
这几年,"数据分析"是很火啊,在这个数据驱动一切的时代,数据挖掘和数据分析就是这个时代的"淘金",懂数据分析.拥有数据思维,往往成了大厂面试的加分项. 比如通过 ...
python数据采集8-自然语言处理
当你在 Google 的图片搜索里输入"cute kitten"时,Google 怎么会知道你要搜索什么呢? 其实这个词组与可爱的小猫咪是密切相关的.当你在 YouTube 搜索框 ...

Python数据采集

一、采集豆瓣电影 Top 250的数据采集

1.进入豆瓣 Top 250的网页

2.进入开发者选项

3.进入top250中去查看相关配置

4.为Pycharm添加其第三方库

5.进行爬虫的编写
（1）导入：import requests
（2）向服务器发送请求 requests=requests.get(url)
(3)获取网页数据：html =request.text

6、bs4库中beautifulSoup类的使用

7、储存到CSV中

备注

二、安居客数据采集

1.安居客的网页

2.导入from lxml import etree

3.将采集的字符串转换为html格式：etree.html

4.转换后的数据调用xPath(xpath的路径)：data.xpath（路径表达式）

三、拉勾网的数据采集

（一）requests数据采集

1.导入库
2.向服务器发送请求
3.下载数据

（二）post方式请求（参数不在url里）

1.通过网页解析找到post参数（请求数据Fromdata）并且参数定义到字典里
2.服务器发送请求调用requests.post（data=fromdata）

（三）数据解析——json

1.导入 import——json
2. 将采集的json数据格式转换为python字典：json.load（json格式数据）
3.通过字典中的键访问值

（四）数据存储到mysql中

四、疫情数据采集

五、scrapy 数据采集

（一）创建爬虫步骤

1.创建项目，在命令行输入：scrapy startproject 项目名称
2.创建爬虫文件：在命令行定位到spiders文件夹下：scrapy genspider 爬虫文件名网站域名
3.运行程序：scrapy crawl 爬虫文件名

（二）scrapy文件结构

（三）scrapy爬虫框架使用流程

1.配置settings.py文件：

3.在spiders文件夹中的parse（）方法中写解析代码并将解析结果提交给item对象
4.在piplines.py中定义存储路径
5.分页采集：

案例
（一）安居客数据采集
其中的文件

（二）太平洋汽车数据采集

Python数据采集相关推荐

最新文章

热门文章

Python数据采集

一、采集豆瓣电影 Top 250的数据采集

3.在spiders文件夹中的parse（）方法中写解析代码并将解析结果提交给item对象 4.在piplines.py中定义存储路径 5.分页采集：

案例 （一）安居客数据采集 其中的文件

（二）太平洋汽车数据采集

Python数据采集相关推荐

最新文章

热门文章

3.在spiders文件夹中的parse（）方法中写解析代码并将解析结果提交给item对象
4.在piplines.py中定义存储路径
5.分页采集：

案例
（一）安居客数据采集
其中的文件