python百度关键词爬虫_Python——爬取百度百科关键词1000个相关网页

#coding:utf8#author:Jery#datetime:2019/4/12 19:22#software:PyCharm#function:爬取百度百科关键词python1000个相关网页——标题和简介

from urllib.request importurlopenimportrefrom bs4 importBeautifulSoupclassSpiderMain(object):def __init__(self):

self.urls=UrlManager()

self.downloader=HtmlDownloader()

self.parser=HtmlParser()

self.outputer=DataOutputer()#主爬虫，调度四个类的方法执行爬虫

defcrawl(self, root_url):

count= 1self.urls.add_new_url(root_url)whileself.urls.has_new_url():try:

new_url=self.urls.get_new_url()print("crawl 第{} ：{}".format(count, new_url))

html_content=self.downloader.download(new_url)

new_urls, new_data=self.parser.parse(new_url, html_content)#新网页的url及数据

self.urls.add_new_urls(new_urls)

self.outputer.collect_data(new_data)if count == 1000:breakcount+= 1

except:print("crawl failed!")

self.outputer.output_html()#URL管理器，实现URL的增加与删除

classUrlManager:def __init__(self):

self.new_urls=set()

self.old_urls=set()defhas_new_url(self):return len(self.new_urls) !=0defget_new_url(self):

new_url=self.new_urls.pop()

self.old_urls.add(new_url)returnnew_urldefadd_new_url(self, url):if url isNone:return

if url not in self.new_urls and url not inself.old_urls:

self.new_urls.add(url)defadd_new_urls(self, urls):if urls is None or len(urls) ==0:return

for url inurls:

self.new_urls.add(url)#下载网页源代码

classHtmlDownloader:defdownload(self, url):if url inNone:returnresponse=urlopen(url)if response.getcode() != 200:return

returnresponse.read()#下载网页所需内容

classHtmlParser:defparse(self, page_url, html_content):if page_url is None or html_content isNone:returnsoup= BeautifulSoup(html_content, 'lxml', from_encoding='utf-8')

new_urls=self._get_new_urls(page_url, soup)

new_data=self._get_new_data(page_url, soup)returnnew_urls, new_datadef_get_new_urls(self, page_url, soup):

new_urls=set()

links= soup.find_all('a', href=re.compile(r'/view/.*'))for link inlinks:

new_url= "https://baike.baidu.com" + link['href']

new_urls.add(new_url)returnnew_urlsdef_get_new_data(self, page_url, soup):

res_data={}#

Python

title_node = soup.find("dl", {"class": "lemmaWgt-lemmaTitle lemmaWgt-lemmaTitle-"}).dd.h1

res_data['title'] =title_node.get_text()

summary_node= soup.find('div', {"class": "lemma-summary"})

res_data['summary'] =summary_node.get_text()returnres_data#将所搜集数据输出至html的表格中

classDataOutputer:def __init__(self):

self.datas=[]defcollect_data(self, data):if data isNone:returnself.datas.append(data)defoutput_html(self):

output= open('output.html', 'w')

output.write("")

output.write("

{}".format(data['url']))

output.write("

{}".format(data['title'].encode('utf-8')))

output.write("

{}".format(data['summary'].encode('utf-8')))

output.write("

output.write("")

output.close()if __name__ == '__main__':

root_url= "https://baike.baidu.com/item/Python/407313"obj_spider=SpiderMain()

obj_spider.crawl(root_url)

python百度关键词爬虫_Python——爬取百度百科关键词1000个相关网页相关推荐

python爬取百度域名注册_python爬取百度域名_python爬取百度搜索結果url匯總
寫了兩篇之后,我覺得關於爬蟲,重點還是分析過程分析些什么呢: 1)首先明確自己要爬取的目標比如這次我們需要爬取的是使用百度搜索之后所有出來的url結果 2)分析手動進行的獲取目標的過程,以便以程序 ...
python外国网站爬虫_Python爬取某境外网站漫画，心血来潮，爬之
本文的文字及图片来源于网络,仅供学习.交流使用,不具有任何商业用途,版权归原作者所有,如有问题请及时联系我们以作处理. 转载地址 https://blog.csdn.net/fei347795790? ...
python爬虫爬取百度图片总结_爬虫篇| 爬取百度图片（一）
什么是爬虫网络爬虫(又被称为网页蜘蛛,网络机器人,在FOAF社区中间,更经常的称为网页追逐者),是一种按照一定的规则,自动地抓取万维网信息的程序或者脚本.另外一些不常使用的名字还有蚂蚁.自动索引.模 ...
python爬虫之爬取百度网盘
爬虫之爬取百度网盘(python) #coding: utf8 """ author:haoning create time: 2015-8-15 "" ...
python爬去百度图片_爬虫篇| 爬取百度图片（一）
什么是爬虫网络爬虫(又被称为网页蜘蛛,网络机器人,在FOAF社区中间,更经常的称为网页追逐者),是一种按照一定的规则,自动地抓取万维网信息的程序或者脚本.另外一些不常使用的名字还有蚂蚁.自动索引.模 ...
Python 3.6模拟输入并爬取百度前10页密切相关链接
1.安装扩展库mechanicalsoup,这个库依赖requests.beautifulsoup4等模块,一般会自动安装,如果失败的话,可以先安装依赖的其他扩展库. 2.分析百度网页源代码,找到用来 ...
二、入门爬虫，爬取百度图片
什么是爬虫网络爬虫(又被称为网页蜘蛛,网络机器人,在FOAF社区中间,更经常的称为网页追逐者),是一种按照一定的规则,自动地抓取万维网信息的程序或者脚本.另外一些不常使用的名字还有蚂蚁.自动索引.模 ...
菜鸟Python实战-03爬虫之爬取数据
最近想学习一下爬虫所以参考了一下网上的代码,并加以理解和整理,好记性不如烂笔头吧. 以下代码的目标网站是豆瓣电影:https://movie.douban.com/top250?start=%22( ...
python战反爬虫：爬取猫眼电影数据 (一）（Requests, BeautifulSoup, MySQLdb,re等库)
姓名:隋顺意博客:Sui_da_xia 微信名:世界上的霸主本篇文章未涉及猫眼反爬,主要介绍爬取无反爬内容,战反爬内容请去 python战反爬虫:爬取猫眼电影数据 (二)(Requests, Be ...
python战反爬虫：爬取猫眼电影数据 (二）（Requests, BeautifulSoup, MySQLdb,re等库)
姓名:隋顺意博客:Sui_da_xia 微信名:世界上的霸主本文主要介绍破解反爬,可以先去上一篇观看爬取无反爬内容 python战反爬虫:爬取猫眼电影数据 (一)(Requests, Beauti ...

python百度关键词爬虫_Python——爬取百度百科关键词1000个相关网页

Python

python百度关键词爬虫_Python——爬取百度百科关键词1000个相关网页相关推荐

最新文章

热门文章