python爬虫——Scrapy爬取博客数据

新建一个Scrapy文件：

# -*- coding: utf-8 -*-
import scrapyclass CsdnBlogSpider(scrapy.Spider):name = 'csdn_blog'allowed_domains = ['blog.csdn.net']keyword = 'another'def start_requests(self):for pn in range(1, 11):url =  'https://so.csdn.net/so/search/s.do?p=%s&q=%s&t=blog&viparticle=&domain=&o=&s=&u=&l=&f=&rbg=0' % (pn, self.keyword)yield scrapy.Request(url=url,callback=self.parse)def parse(self, response):href_s = response.xpath('//div[@class="search-list-con"]/dl//span[@class="mr16"]/../../dt/div/a[1]/@href').extract()for href in href_s:yield scrapy.Request(url=href,callback=self.parse2)def parse2(self, response):item = dict(#获取第一个值等同于xpath('//h1[@class="title-article"]/text()')[0]title = response.xpath('//h1[@class="title-article"]/text()').extract_first(),#获取字节数据data = response.body)yield item# start_urls = ['http://blog.csdn.net/']## def parse(self, response):#     pass

将setting.py文件下的这两个函数取消注释

DOWNLOADER_MIDDLEWARES = {'s1.middlewares.S1DownloaderMiddleware': 543,
}ITEM_PIPELINES = {'s1.pipelines.S1Pipeline': 300,
}

在middlewares.py文件下修改DownloaderMiddleware类下的process_request函数：

  #一般重写这里def process_request(self, request, spider):#这里是bytes类型，所以要导入Headers的包request.headers = Headers({'User_Agent': user_agent.get_user_agent_pc(),})#设置代理IPrequest.meta['proxy'] ='http://'+ ur.urlopen('http://api.ip.data5u.com/dynamic/get.html?order=06b5d4a85d10b5cbe9db1e5a3b9fa2e1&sep=4').read().decode('utf-8').strip()

修改pipelines.py文件：


class S1Pipeline(object):def process_item(self, item, spider):with open('blog_html/%s.html' % item['title'], 'wb') as f:f.write(item['data'])return item

然后基本就OK，不想在控制台敲scrapy crawl 项目名的话，可以再spiders文件夹下添加一个start文件写入如下代码：

from scrapy import cmdlinecmdline.execute('scrapy crawl 项目名'.split())

完成。

下面是非分布式爬取，作用和上面一样：

没有加入代理IP慎用，玩意被封了不太好。

import urllib.request as ur
import lxml.etree as le
import user_agent
import redef getRequest(url):return ur.Request(url=url,headers={'User-Agent': user_agent.get_user_agent_pc(),'Cookie': 'TY_SESSION_ID=14e93d1c-5cfb-4692-8416-dc2df061bb5c; JSESSIONID=68E9815DA238619AB37E640211691B8B; uuid_tt_dd=10_20594510460-1585746871024-545182; dc_session_id=10_1585746871024.456447; dc_sid=e127d0cf2db7a2e5cf7ded7b0b7d1880; Hm_lvt_6bcd52f51e9b3dce32bec4a3997715ac=1585746876; Hm_ct_6bcd52f51e9b3dce32bec4a3997715ac=6525*1*10_20594510460-1585746871024-545182; c-toolbar-writeguide=1; announcement=%257B%2522isLogin%2522%253Afalse%252C%2522announcementUrl%2522%253A%2522https%253A%252F%252Fblog.csdn.net%252Fblogdevteam%252Farticle%252Fdetails%252F105203745%2522%252C%2522announcementCount%2522%253A1%252C%2522announcementExpire%2522%253A78705482%257D; firstDie=1; __guid=129686286.421372154518304900.1585746901554.712; monitor_count=1; c_ref=https%3A//blog.csdn.net/; __gads=ID=86722e1f5d97e31d:T=1585746904:S=ALNI_MaIZXWpb5EgzqK0TDZB-yNS9h6l_g; searchHistoryArray=%255B%2522python%2522%252C%2522opencv%2522%255D; dc_tos=q8426m; c-login-auto=3; Hm_lpvt_6bcd52f51e9b3dce32bec4a3997715ac=1585746960'})
hot_word=['C++','java','python','PHP','Go','Objective-C','SQL','PL/SQL','C','Swift','Swift','Visual Basic',]if __name__ == '__main__':keyword = input('请输入关键词:')pn_start = int(input('起始页:'))pn_end = int(input('终止页:'))for pn in range(pn_start, pn_end + 1):request = getRequest('https://so.csdn.net/so/search/s.do?p=%s&q=%s&t=blog&viparticle=&domain=&o=&s=&u=&l=&f=&rbg=0' % (pn, keyword))response = ur.urlopen(request).read()href_s = le.HTML(response).xpath('//div[@class="search-list-con"]/dl//span[@class="mr16"]/../../dt/div/a[1]/@href')for href in href_s:try:response_blog = ur.urlopen(getRequest(href)).read()title = le.HTML(response_blog).xpath('//h1[@class="title-article"]/text()')[0]title = re.sub(r'[\//:*<>|"?]', '', title)with open('blog/%s.html' % title, 'wb') as f:f.write(response_blog)print(title)except Exception as e:print(e)# def getProxyOpener():
#     # proxy_address = ur.urlopen(
#     #     'http://api.ip.data5u.com/dynamic/get.html?order=d314e5e5e19b0dfd19762f98308114ba&sep=4').read().decode(
#     #     'utf-8').strip()
#     proxy_handler = ur.ProxyHandler(
#         {
#             'http': '58.218.214.147:4029'
#         }
#     )
#     return ur.build_opener(proxy_handler)
#  print(response)#   print(href_s)

python爬虫——Scrapy爬取博客数据相关推荐

java 使用webmagic 爬虫框架爬取博客园数据
java 使用webmagic 爬虫框架爬取博客园数据存入数据库学习记录 webmagic简介: WebMagic是一个简单灵活的Java爬虫框架.你可以快速开发出一个高效.易维护的爬虫. ht ...
AJAX教程美食滤镜,Python爬虫实例——爬取美团美食数据
1.分析美团美食网页的url参数构成 1)搜索要点美团美食,地址:北京,搜索关键词:火锅 2)爬取的url https://bj.meituan.com/s/%E7%81%AB%E9%94%85/ ...
python基于scrapy爬取京东笔记本电脑数据并进行简单处理和分析
这篇文章主要介绍了python基于scrapy爬取京东笔记本电脑数据并进行简单处理和分析的实例,帮助大家更好的理解和学习使用python.感兴趣的朋友可以了解下一.环境准备 python3.8.3 ...
Python爬虫 - scrapy - 爬取妹子图 Lv1
0. 前言这是一个利用python scrapy框架爬取网站图片的实例,本人也是在学习当中,在这做个记录,也希望能帮到需要的人.爬取妹子图的实例打算分成三部分来写,尝试完善实用性. 系统环境 Sys ...
java爬虫之爬取博客园推荐文章列表
这几天学习了一下Java爬虫的知识,分享并记录一下: 写一个可以爬取博客园十天推荐排行的文章列表通过浏览器查看下一页点击请求,可以发现在点击下一页的时候是执行的 post请求,请求地址为 http ...
Java爬虫-WebMagic爬取博客图片(好色龍的網路觀察日誌)
WebMagic爬取博客图片最近在学习java爬虫,接触到WebMagic框架,正好拿我喜爱的博客来练习,希望龙哥(博主)不要责备我~~ 博客链接: 好色龍的網路觀察日誌 ,超级有趣的翻译漫画,持续 ...
scrapy爬取博客文章
锦瑟无端五十弦,一弦一柱思华年.庄生晓梦迷蝴蝶,望帝春心托杜鹃. 沧海月明珠有泪,蓝田日暖玉生烟.此情可待成追忆,只是当时已惘然. --李商隐<锦瑟> 编译环境:linux 编译器:ipy ...
python爬虫scrapy爬取新闻标题及链接_python爬虫框架scrapy爬取梅花网资讯信息
原标题:python爬虫框架scrapy爬取梅花网资讯信息一.介绍本例子用scrapy-splash爬取梅花网(http://www.meihua.info/a/list/today)的资讯信息, ...
Python爬虫 senlenium爬取拉勾网招聘数据，你学会了吗
一.基本思路目标url:https://www.lagou.com/ 用selenium爬虫实现,输入任意关键字,比如 python 数据分析 ,点击搜索,得到的有关岗位信息,爬取下来保存到Exce ...

python爬虫——Scrapy爬取博客数据

python爬虫——Scrapy爬取博客数据相关推荐

最新文章

热门文章