百度爬虫使用selenium + beautifulsoup 百度搜索关键词爬虫代码整理

导入模块

from selenium.webdriver.common.by import By
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.action_chains import ActionChains
import time
import openpyxl
from openpyxl import load_workbook
import re
from bs4 import BeautifulSoup

关键词列表

#关键词列表
kws=["人工智能透明","算法透明","推荐算法透明","推送透明","黑箱","算法黑箱","推荐算法黑箱","推送黑箱","算法公开","算法可解释"]
kw=kws[0]

#首先  启动浏览器driver=webdriver.Chrome()
#driver.get(ur[0])driver.get('https://www.baidu.com/s?ie=UTF-8&wd=%E4%BA%BA%E5%B7%A5%E6%99%BA%E8%83%BD%E9%80%8F%E6%98%8E')driver.find_element(By.XPATH,"/html/body/div[2]/div[2]/div/div/a[5]").click()  #点击资讯time.sleep(5)

#运行
for ikw in range(1,len(kws)):kw=kws[ikw]print(kw)wb = openpyxl.Workbook()        # 创建一个excel文件sheet = wb.active               # 获得一个的工作表sheet.title = kwwb.save(r"baidunews-{}.xlsx".format(kw))driver.switch_to.window(driver.window_handles[-1])  # 回到搜索页search_button=driver.find_element(By.ID,"kw")search_button.clear()search_button.send_keys(kw)  #填搜索框driver.find_element(By.ID,"su").click()  #搜索time.sleep(5)datamining(wb,kw,driver)

#单次测试
datamining(wb,kw,driver)

#主程序函数
def datamining(wb,kw,driver):sheet = wb.activesheet.cell(1,1).value="日期"sheet.cell(1,2).value="标题"sheet.cell(1,3).value="文章来源"sheet.cell(1,4).value="简述"sheet.cell(1,5).value="URL"current_row=sheet.max_row+1for x in range(100):#当页操作driver.switch_to.window(driver.window_handles[-1])soup = BeautifulSoup(driver.page_source)#找出  本页  非广告的各个elementlinkElems = soup.select('div.c-container')if x==0:former_linkElems=0if former_linkElems==linkElems:print("两页重复")return 0Elems_title_list=[]this_page_case_list=[]Elems_title_list.append(linkElems[0].get_text().strip())this_page_case_list.append(linkElems[0])for i in range(1,len(linkElems)):title=linkElems[i].get_text().strip()if bool(re.search("广告",str(title))) or bool(re.search("大家还在搜",str(title))) :passelse:if title not in Elems_title_list:Elems_title_list.append(title)this_page_case_list.append(linkElems[i])#本页case数量this_page_case_num=len(this_page_case_list)print("第"+str(x+1)+"页，本页case数目为 "+str(this_page_case_num))for i in range(this_page_case_num):#标题case_title=Elems_title_list[i]#print(case_title)#时间try:case_time = this_page_case_list[i].select('span.c-color-gray2')case_time=case_time[0].get_text().strip()except:case_time="NaN"#print(case_time)#来源try:case_sourse = this_page_case_list[i].select('span.c-color-gray')case_sourse=case_sourse[0].get_text().strip()except:case_sourse="NaN"#print(case_sourse)#简述try:case_short = this_page_case_list[i].select('span.content-right_8Zs40')case_short=case_short[0].get_text().strip()except:case_short="NaN"#print(case_short)#网址URLurls=this_page_case_list[i].find_all('a', href=True,target="_blank")case_url=urls[0]['href']#print(case_url)sheet.cell(current_row,1).value=case_titlesheet.cell(current_row,2).value=case_timesheet.cell(current_row,3).value=case_soursesheet.cell(current_row,4).value=case_shortsheet.cell(current_row,5).value=case_urlwb.save(r"E:\桌面备份\武大帅爬虫任务\baidunews-{}.xlsx".format(kw))print(kw+"项搜索词 已保存 "+str(current_row)+" 项")current_row=sheet.max_row+1#翻页指令former_linkElems=linkElemstry:button=driver.find_element(By.XPATH,"/html/body/div/div[3]/div[2]/div/a[last()]")except:button=driver.find_element(By.XPATH,"/html/body/div/div[3]/div[2]/div/a[last()]")if re.search("下一页",button.get_attribute('innerHTML')):button.click()time.sleep(6)else:print("没有下一页按钮，爬虫中断")return 0

百度爬虫使用selenium + beautifulsoup 百度搜索关键词爬虫代码整理相关推荐

js 获取百度搜索关键词的代码
有可能有时候我们会用到在百度搜什么关键词进来我们的网站的,所有我们又想拿到用户搜索的关键词. 这是我研究了半天所得出的办法.话不多说直接贴代码 <script>function query ...
Python爬虫：Selenium+ BeautifulSoup 爬取JS渲染的动态内容（雪球网新闻）
最近要有一个任务,要爬取https://xueqiu.com/#/cn 网页上的文章,作为后续自然语言处理的源数据. 爬取目标:下图中红色方框部分的文章内容.(需要点击每篇文章的链接才能获得文章内容) ...
python爬图片 beautifulsoup_【Python爬虫】基于BeautifulSoup的微博图片爬虫
本文来源吾爱破解论坛这个仅是用来记录我的学习过程,若有错误或者其他问题,欢迎指出. [Python] 纯文本查看复制代码import requests from bs4 import Beauti ...
微信小程序--搜索关键词高亮
代码地址如下: http://www.demodashi.com/demo/14249.html 一.前期准备工作软件环境:微信开发者工具官方下载地址:https://mp.weixin.qq.c ...
爬虫之selenium爬取斗鱼网站
爬虫之selenium爬取斗鱼网站示例代码: from selenium import webdriver import timeclass Douyu(object):def __init__(s ...
python搜索关键词自动提交_python+selenium实现百度关键词搜索自动化操作
缘起之前公司找外面网络公司做某些业务相关关键词排名,了解了一下相关的情况,网络公司只需要我们提供网站地址和需要做的关键词即可,故猜想他们采取的方式应该是通过模拟用户搜索提升网站权重进而提升排名. 不 ...
python百度关键词自动提交-python+selenium实现百度关键词搜索自动化操作
缘起之前公司找外面网络公司做某些业务相关关键词排名,了解了一下相关的情况,网络公司只需要我们提供网站地址和需要做的关键词即可,故猜想他们采取的方式应该是通过模拟用户搜索提升网站权重进而提升排名. 不 ...
python搜索关键词自动提交_简单爬虫：调用百度接口,实现关键词搜索（python_003)...
需求: 如何用python代码实现百度搜索关键词的功能? 比如输入关键词:"python爬虫",得到一个搜索结果的页面,并保存到本地. 这是经典的python爬虫教学案例之一,也是 ...
【Python爬虫教学】百度篇·手把手教你抓取百度搜索关键词后的页面源代码
[开门见山] 最近整理了下之前做过的项目,学的东西不少,乱七八糟.打算写点关于 Python 爬虫的东西,新人一枚,还望大佬们多多担待,别把我头给打歪了. 前面我先磨叽磨叽些基础的东西,对爬虫新人友好 ...

百度爬虫使用selenium + beautifulsoup 百度搜索关键词爬虫代码整理

百度爬虫使用selenium + beautifulsoup 百度搜索关键词爬虫代码整理相关推荐

最新文章

热门文章

百度爬虫 使用selenium + beautifulsoup 百度搜索关键词爬虫 代码整理

百度爬虫 使用selenium + beautifulsoup 百度搜索关键词爬虫 代码整理相关推荐

最新文章

热门文章

百度爬虫使用selenium + beautifulsoup 百度搜索关键词爬虫代码整理

百度爬虫使用selenium + beautifulsoup 百度搜索关键词爬虫代码整理相关推荐