python爬虫之数据爬取项目实例

一、scrapy数据爬取
- （一）前期准备
- （二）目标：
- （三）步骤：
- - 1、创建项目：
  - 2、创建爬虫：
  - 3、打开项目：
  - 4、创建启动程序：
  - 5、编写爬虫：
  - 6、启动测试：

一、scrapy数据爬取

（一）前期准备

scrapy安装（略）

（二）目标：

1、网页url：https://ke.qq.com/course/list?mt=1001&page=1
（tx课堂的具体课程网页）
2、爬取分页1-20
3、爬取字段：
course = scrapy.Field() #课程名称 ./h4/a/text()
schedule = scrapy.Field() #课程进度 ./div[1]/span/text()
company = scrapy.Field() #开课机构 ./div[1]/a/text()
pay = scrapy.Field() #费用 ./div[2]/span[1]/text()
hot = scrapy.Field() #热度 ./div[2]/span[2]/text()

（三）步骤：

1、创建项目：

在scrapyProject目录下创建scrapy项目：
D:…\scrapyProject>scrapy startproject ke

2、创建爬虫：

D:…\scrapyProject>cd ke
D:…\scrapyProject\boss>scrapy genspider keLesson ke.qq.com

3、打开项目：

选择scrapyProject目录下的ke项目
注意：不要打开成ke子目录中的ke

4、创建启动程序：

创建一个run.py文件并添加启动命令`

from scrapy.cmdline import execute
execute(["scrapy", "crawl", "keLesson"])

5、编写爬虫：

keLesson.py编写:

import scrapy
from ke.items import KeItemclass KelessonSpider(scrapy.Spider):name = 'keLesson'# 须是爬取网页的网站域名allowed_domains = ['ke.qq.com']# 分页爬取1-20的网页地址urlsdef start_requests(self):for i in range(1,20):url = 'https://ke.qq.com/course/list?mt=1001&page={}'.format(i)  #爬虫域名yield scrapy.Request(url=url)def parse(self, response):li_list = response.xpath("//ul[@class='course-card-list']/li")items = []for li in li_list:course = li.xpath("./h4/a/text()").extract_first()                # 课程名称 ./h4/a/text()schedule = li.xpath("./div[1]/span/text()").extract_first()     # 课程进度 ./div[1]/span/text()company = li.xpath("./div[1]/a/text()").extract_first()         # 开课机构 ./div[1]/a/text()pay = li.xpath("./div[2]/span[1]/text()").extract_first()       # 费用    ./div[2]/span[1]/text()hot = li.xpath("./div[2]/span[2]/text()").extract_first()       # 热度    ./div[2]/span[2]/text()hot = hot.replace('\n', '')hot = hot.strip()item = KeItem()item['course'] = courseitem['schedule'] = scheduleitem['company'] = companyitem['pay'] = payitem['hot'] = hotitems.append(item)return items

items.py编写

import scrapyclass KeItem(scrapy.Item):course = scrapy.Field()     #课程名称 ./h4/text()schedule = scrapy.Field()   #课程进度 ./div[1]/span/text()company = scrapy.Field()    #开课机构 ./div[1]/a/text()pay = scrapy.Field()        #费用    ./div[2]/span[1]/text()hot = scrapy.Field()        #热度    ./div[2]/span[2]/text()

pipelines.py编写

from itemadapter import ItemAdapterclass KePipeline:def process_item(self, item, spider):with open('keLesson_info.txt', 'a') as f:f.write(item['course']+';'+item['schedule']+';'+item['company']+';'+item['pay']+';'+item['hot']+'\n')

注意：在setting.py中启用管道

6、启动测试：

然后在run.py运行，测试是否可以爬取到指定的网页

scrapy遵守robot协议，robots.txt这个文件中规定了目标站点允许的爬虫机器爬取的范围，所以，我们就需要去settings里面设置
ROBOTSTXT_OBEY = True
修改为
ROBOTSTXT_OBEY = False

最终可以得到keLesson_info.txt