爬取百度词语的相关内容

需求：

根据HSK词汇表搜索相关词语，并爬取其中的拼音，释义、同义/近义/反义词

使用语言及编译器：

python
pycharm

目标网站：

百度汉语：https://hanyu.baidu.com/

目标网页分析：

网页首页无任何东西，需要搜索进行跳转

F12查看JavaScript加载后的网页源代码

右击查看网页源代码

通过对比，网页搜索跳转以后加载的是静态网页。故不需要进行逆向分析或者使用selenium库。

ps.F12调出的是网页html代码的集合，并不是完整的网页html代码

爬取思路

1、获取页面
2、创建一个字典，用于存储爬取的相关数据
3、将字典存储为json文件，方便导入MySQL
4、连接数据库，使用for循环爬取HSK考试常用词组

实现代码

导入相关的包

import urllib.request
from lxml import etree
from urllib.parse import urlencode, unquote
import requests
import re
import json
import time
import pymysql

1、获取页面

def Net(url,headers):try:request = urllib.request.Request(url, headers=headers)html = urllib.request.urlopen(request, timeout=0.7).read().decode("utf8")return htmlexcept:time.sleep(10)Net(url,headers)

def get_baidu_page(kw,url):# 获取html页面#模拟请求头headers = {'Accept': 'text / html, application / xhtml + xml, application / xml,*/*;q = 0.9;q = 0.8','Accept - Encoding': 'gzip, deflate, br','Accept - Language': 'zh - CN, zh;q = 0.9',"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36"}html = Net(url,headers)content = etree.HTML(str(html))

2、创建字典

dic1 = {}
#因为字库中有些是单个字，有些是词语。网页对于字和词语有不同的页面布局。故加一个判断
if len(kw) == 1:link_list_pinyin = content.xpath('//div[@class="pronounce"]//b/text()')     #拼音link_list1 = content.xpath('//div//p/text()')            #详细信息link_synonym = content.xpath('//div[@id="synonym"]//a/text()')  # 近义词link_antonym = content.xpath('//div[@id="antonym"]//a/text()')  # 反义词link_redical = content.xpath('//li[@id="radical"]/span/text()')        #部首link_stroke = content.xpath('//li[@id="stroke_count"]/span/text()')        #笔画link_content = content.xpath('//div[@class="tab-content"]/a/text()')         #相关组词dic1["关键词"] = kwdic1['拼音'] = link_list_pinyindic1["释义"] = link_list1dic1["近义词"] = link_synonymdic1["反义词"] = link_antonymdic1["部首"] = link_redicaldic1["笔画"] = link_strokedic1["相关组词"] = link_content
else:#获取详细信息link_list1 = content.xpath('//div//p/text()')link_list_pinyin = content.xpath('//div/dl/dt[@class="pinyin"]/text()')     #拼音link_synonym = content.xpath('//div[@id="synonym"]//a/text()')              #近义词link_antonym = content.xpath('//div[@id="antonym"]//a/text()')              #反义词dic1["关键词"] = kwdic1['拼音'] = link_list_pinyindic1["释义"] = link_list1dic1["近义词"] = link_synonymdic1["反义词"] = link_antonym

3、存储文件

def save_file(dic):           #写入文件json_str = json.dumps(dic, ensure_ascii=False, indent=4)with open("result.json","a",encoding="utf8") as file1:file1.write(json_str)

4、连接数据库循环爬取

# 连接数据库
conn = pymysql.connect("localhost", "root", "123456", "sys")
cursor = conn.cursor()
sql = "select WORD from bucong"
cursor.execute(sql)
results = cursor.fetchall()
# kw = input("请输入要搜索的关键词： ")
for row in results[426:]:kw = row[0]print(kw)word = {"wd":kw}key = urllib.parse.urlencode(word)url = "https://hanyu.baidu.com/s"fullurl = url + "?" + key + "&ptype=zici"get_baidu_page(kw, fullurl)#对于有些字词百度汉语里面未收录，程序会报出异常。故需要加一个异常处理try:doSomething()except:pass

完整代码

"""一：百度词语爬虫"""
import urllib.request
from lxml import etree
from urllib.parse import urlencode, unquote
import requests
import re
import json
import time
import pymysqldef digui(url,headers):try:request = urllib.request.Request(url, headers=headers)html = urllib.request.urlopen(request, timeout=0.7).read().decode("utf8")return htmlexcept:time.sleep(10)digui(url,headers)def get_baidu_page(kw,url):# 获取html页面#定义一个字典，存储我们想要的东西dic1 = {}"""获取html页面"""#模拟请求头headers = {'Accept': 'text / html, application / xhtml + xml, application / xml,*/*;q = 0.9;q = 0.8','Accept - Encoding': 'gzip, deflate, br','Accept - Language': 'zh - CN, zh;q = 0.9',"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36"}html = digui(url,headers)content = etree.HTML(str(html))if len(kw) == 1:link_list_pinyin = content.xpath('//div[@class="pronounce"]//b/text()')     #拼音link_list1 = content.xpath('//div//p/text()')            #详细信息link_synonym = content.xpath('//div[@id="synonym"]//a/text()')  # 近义词link_antonym = content.xpath('//div[@id="antonym"]//a/text()')  # 反义词link_redical = content.xpath('//li[@id="radical"]/span/text()')        #部首link_stroke = content.xpath('//li[@id="stroke_count"]/span/text()')        #笔画link_content = content.xpath('//div[@class="tab-content"]/a/text()')         #相关组词dic1["关键词"] = kwdic1['拼音'] = link_list_pinyindic1["释义"] = link_list1dic1["近义词"] = link_synonymdic1["反义词"] = link_antonymdic1["部首"] = link_redicaldic1["笔画"] = link_strokedic1["相关组词"] = link_contentelse:#获取详细信息link_list1 = content.xpath('//div//p/text()')link_list_pinyin = content.xpath('//div/dl/dt[@class="pinyin"]/text()')     #拼音link_synonym = content.xpath('//div[@id="synonym"]//a/text()')              #近义词link_antonym = content.xpath('//div[@id="antonym"]//a/text()')              #反义词dic1["关键词"] = kwdic1['拼音'] = link_list_pinyindic1["释义"] = link_list1dic1["近义词"] = link_synonymdic1["反义词"] = link_antonymsave_file(dic1)def save_file(dic):           #写入文件json_str = json.dumps(dic, ensure_ascii=False, indent=4)with open("result.json","a",encoding="utf8") as file1:file1.write(json_str)if __name__ == "__main__":"""输入要搜索的关键词和对应的url地址"""# 连接数据库conn = pymysql.connect("localhost", "root", "123456", "sys")cursor = conn.cursor()sql = "select WORD from bucong"cursor.execute(sql)results = cursor.fetchall()# kw = input("请输入要搜索的关键词： ")for row in results[426:]:kw = row[0]print(kw)word = {"wd":kw}key = urllib.parse.urlencode(word)url = "https://hanyu.baidu.com/s"fullurl = url + "?" + key + "&ptype=zici"get_baidu_page(kw, fullurl)try:doSomething()except:pass

运行效果

爬取百度词语的相关内容相关推荐

【JavaWeb 爬虫】Java文本查重网页版爬取百度搜索结果页全部链接内容
! ! 更新:增加了网页过滤判断,只允许域名包含blog,jianshu的网站通过小技巧 Java中InputStream和String之间的转换方法 String result = new Buf ...
python爬取百度贴吧指定内容
环境:python3.6 1:抓取百度贴吧-linux吧内容基础版抓取一页指定内容并写入文件萌新刚学习Python爬虫,做个练习贴吧链接: http://tieba.baidu.com/f?k ...
爬取bili番剧相关内容
我知道有些二次元肥仔就是喜欢看这些二次元东西,反正我是没看过,不知道你们怎么样呢? 今天就给大家分析一下他的视频列表页首先呢,打开网页,借用开发者工具,检查分析,看一下网页源代码中有没有你想要的信息 ...
使用PHP的curl爬取百度搜索页相关搜索词
使用PHP获取百度搜索的第一个相关搜索词详细代码如下 $key_word = urlencode('王者荣耀');//需要对关键词进行url解析,否者部分带字符的标题会返回空 $url = 'htt ...
java 利用httpclient绕过百度验证登录爬取百度指数
公司要求爬取百度指数的相关信息,发现需要百度登录验证.网上找了很多相关文章,都是用来模拟登录的过程,经过一番尝试,发现太复杂,最后失败.于是,换种方式,直接绕过登录.具体方式如下: 分析请求内容我用 ...
用python 爬取百度百科内容-爬虫实战(一) 用Python爬取百度百科
最近博主遇到这样一个需求:当用户输入一个词语时,返回这个词语的解释我的第一个想法是做一个数据库,把常用的词语和词语的解释放到数据库里面,当用户查询时直接读取数据库结果但是自己又没有心思做这样一个数 ...
python爬虫代码实例-Python爬虫爬取百度搜索内容代码实例
这篇文章主要介绍了Python爬虫爬取百度搜索内容代码实例,文中通过示例代码介绍的非常详细,对大家的学习或者工作具有一定的参考学习价值,需要的朋友可以参考下搜索引擎用的很频繁,现在利用Python爬 ...
【Python网络编程】爬取百度贴吧、小说内容、豆瓣小说、Ajax爬微博、多线程爬淘宝
一.爬取百度贴吧 import re titleR ='<a rel="noreferrer" href=".*?" title=".*?&qu ...
写一个爬虫，可以爬取百度文库内容
爬取百度文库内容需要使用爬虫技术.以下是一个简单的 Python 爬虫示例: import requestsurl ="https://wenku.baidu.com/view/your_d ...