python 爬虫小说使用无头浏览器 + 自动化爬虫

仅供学习，请勿商业行为！，未经允许请勿转载

获取到搜索接口和请求方法和请求参数当前是post 方法

请求参数为

获取对应小说的详情介绍页

对应类、对应浏览器驱动获取方法

python selenium4 使用无界面浏览器爬虫并存储mysql数据库_fuchto的博客-CSDN博客_python 无界面浏览器浏览器驱动需要查看对应浏览器版本进行下载selenium · PyPIhttps://pypi.org/project/selenium/浏览器设置中查看当前版本from selenium import webdriverfrom selenium.webdriver.chrome.service import Servicefrom selenium.webdriver.common.by import By# select 选择框需要引入 select 类fro...https://blog.csdn.net/fuchto/article/details/124480885?spm=1001.2014.3001.5502

废话不多说直接上代码

import requests
from bs4 import BeautifulSoup
import re
import os
import time
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
# select 选择框需要引入 select 类
from selenium.webdriver.support.select import Select
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.action_chains import ActionChains
import math
import jsondef bubbleSort(arr):n = len(arr)# 遍历所有数组元素for i in range(n):# Last i elements are already in placefor j in range(0, n - i - 1):if arr[j] > arr[j + 1]:arr[j],arr[j + 1] = arr[j + 1],arr[j]return arrdef str_replaces(arr):arrs = []for itme in arr:arrs.append(itme.replace(".html", ' '))return arrsdef start():fiction = input("请输入你想获取的小说：")# 搜索小说# post 方法search_url = "私聊获取请求地址"# 头部信息# 请求参数search_data = {'searchtype': 'articlename','searchkey':fiction}# 发送请求response = requests.post(search_url,search_data)# 设置获取到的内容编码response.encoding = 'utf-8'# print(response.text)# 声明 当前 字符串用于匹配soup = BeautifulSoup(response.text,"html.parser")# 获取到 a标签fiction_a_link = soup.select("#content > div > ul > a")# 循环获取到的小说for itme in fiction_a_link:# 正则匹配规则 http 开始 到 html 结束pattern = re.compile('http.+html')# 匹配 获取 内容fiction_link = pattern.findall(str(itme))# 当前文件绝对路径basedir = os.path.abspath(os.path.dirname(__file__))# 获取 小说名称fiction_pattern = re.compile(r'alt="(\w+)"')fiction_name = fiction_pattern.findall(str(itme))print("\n小说名称：" + fiction_name[0])#小说目录dir = basedir + "\\" + fiction_name[0]#创建目录if os.path.exists(dir) == False:print("\n正在创建目录" + dir)# 创建文件目录os.mkdir(fiction_name[0])else:print("\n目录已存在")#         获取小说详情print("\n详情地址"+fiction_link[0])details_respsone = requests.get(fiction_link[0])details_respsone.encoding = 'utf-8'details_soup = BeautifulSoup(details_respsone.text,"html.parser")fiction_list = details_soup.select("#content > div.articleInfo > div.articleInfoRight > ol > p.right > a")print("\n列表页地址")print(fiction_list)list_pattern = re.compile(r'href=\"(.+?)\"')str_fiction_list = str(fiction_list[0])lsit_link = list_pattern.findall(str_fiction_list)print("\n列表请求地址")print(lsit_link)lists_response = requests.get(str(lsit_link[0]))lists_response.encoding = lists_response.apparent_encodinglists_soup = BeautifulSoup(lists_response.text,'html.parser')lists_html = lists_soup.select("#newlist")lists_pattern = re.compile(r'href=\"(.+?)\"')# 章节链接chapter_link = lists_pattern.findall(str(lists_html))# print("\n章节链接")# 去除字符串 .htmlchapter_link = str_replaces(chapter_link)# 从小到大排序chapter_link = bubbleSort(chapter_link)for value in chapter_link:# if int(value) >= 189883:# 章节详情链接chapter_details_link = str(lsit_link[0])+value.strip()+".html"#使用无头浏览器访问chrome_options = Options()chrome_options.add_argument('--headless')chrome_options.add_argument('--disable-gpu')s = Service("D:\pythonVendor\chrome\chromedriver.exe")driver = webdriver.Chrome(service=s, options=chrome_options)#  打开网站driver.get(chapter_details_link)# 获取网页内容chapter_soup = BeautifulSoup(driver.page_source,'html.parser')title_html = chapter_soup.select("body > div.readerListBody > div.readerTitle")title_pattern = re.compile(r"<h1>(.+?)</h1>")title_name = title_pattern.findall(str(title_html))print("\n章节名称："+str(title_name[0]))# 获取文章内容content = chapter_soup.select("#content")content_pattern = re.compile(r'<p data-id="99" .+?>(.+?)</p>',re.S)content = content_pattern.sub('', str(content[0]))pattern = re.compile(r'<[^>]+>', re.S)content = pattern.sub('\r\n', content)content = content.replace('最新网址：www.umiwx.com', '')content = content.replace(' ', '')# 创建文章 文件chapter_dir = dir+"\\"+str(title_name[0])+".txt";chapter_dir = chapter_dir.replace("：",' ')chapter_dir = chapter_dir.replace("|",' ')chapter = open(chapter_dir,"w",encoding="utf-8")chapter.writelines(content)print("\n章节："+str(title_name[0])+"保存成功")driver.close()chapter.close()time.sleep(30)print("小说"+fiction_name[0]+"已全部爬取")if __name__ == "__main__":start()

python 爬虫小说使用无头浏览器 + 自动化爬虫相关推荐

python 模拟用户点击浏览器_python爬虫之selenium模拟浏览器
1.前言之前在异步加载(AJAX)网页爬虫的时候提到过,爬取这种ajax技术的网页有两种办法:一种就是通过浏览器审查元素找到包含所需信息网页的真实地址,另一种就是通过selenium模拟浏览器的方法 ...
python——selenium框架实现无头浏览器访问 + 规避检测配置
备注: 有的时候,我们希望,selenium访问的时候,不要出现浏览器, 那么就需要进行配置.具体配置看代码. 运行代码 # !/user/bin/env python # -*- coding: u ...
Python爬虫笔记——经典python-selenium浏览器自动化小练习
转载文章: 做selenium自动化项目时需要用的操作方法 selenium中webdriver跳转新页面后定位置新页面的两种方式 window.scrollTo和window.scrollBy py ...
python爬虫小说设计过程_Python制作爬虫采集小说
开发工具:python3.4 操作系统:win8 主要功能:去指定小说网页爬小说目录,按章节保存到本地,并将爬过的网页保存到本地配置文件. 被爬网站:http://www.cishuge.com/ 小 ...
基于Python, Selenium, Phantomjs无头浏览器访问页面
引言: 在自动化测试以及爬虫领域,无头浏览器的应用场景非常广泛,本文将梳理其中的若干概念和思路,并基于代码示例其中的若干使用技巧. 1. 无头浏览器通常大家在在打开网页的工具就是浏览器,通过界面上输 ...
[转载] 基于Python, Selenium, Phantomjs无头浏览器访问页面
参考链接: Selenium Python技巧引言: 在自动化测试以及爬虫领域,无头浏览器的应用场景非常广泛,本文将梳理其中的若干概念和思路,并基于代码示例其中的若干使用技巧. 1. 无头浏览器通 ...
python 无头浏览器多线程_基于Python, Selenium, Phantomjs无头浏览器访问页面
引言: 在自动化测试以及爬虫领域,无头浏览器的应用场景非常广泛,本文将梳理其中的若干概念和思路,并基于代码示例其中的若干使用技巧. 1. 无头浏览器通常大家在在打开网页的工具就是浏览器,通过界面上输 ...
python爬虫之selenium,谷歌无头浏览器
python爬虫之selenium和PhantomJS 主要的内容 lenium hantomjs 无头浏览器的懒加载一什么是selenium? 介绍它是python中的一个第三方库,对外提供 ...
Python之Selenium自动化爬虫
文章目录 Python之Selenium自动化爬虫 0.介绍 1.安装 2.下载浏览器驱动 3.实例 4.开启无头模式 5.保存页面截图 6.模拟输入和点击 a.根据文本值查找节点 b.获取当前节点的 ...

python 爬虫小说使用无头浏览器 + 自动化爬虫

python 爬虫小说使用无头浏览器 + 自动化爬虫相关推荐

最新文章

热门文章

python 爬虫小说 使用无头浏览器 + 自动化爬虫

python 爬虫小说 使用无头浏览器 + 自动化爬虫相关推荐

最新文章

热门文章

python 爬虫小说使用无头浏览器 + 自动化爬虫

python 爬虫小说使用无头浏览器 + 自动化爬虫相关推荐