python实践-轩宇阅读网爬取单册小说

import requests
import re,os
from bs4 import BeautifulSoup#红楼梦
# URL = 'https://www.xyyuedu.com/wgmz/dongyeguiwu/baiyexingxs/'
# DIR = r'D:\test\0602\baiyexing'#白夜行
# URL = 'https://www.xyyuedu.com/gdmz/sidamingzhu/hlmeng/index.html'
# DIR = r'D:\test\0602\hongloumeng'#过去我死去的家
# URL = 'https://www.xyyuedu.com/wgmz/dongyeguiwu/guoquwosiqudejia/index.html'
# DIR = r'D:\test\0602\gqwsqdj'#壮丽的奥力诺克河
URL = 'https://www.xyyuedu.com/gdmz/yingliechuan/index.html'
DIR = r'D:\test\mingzhu\中国古典文学\英烈传'def getHomeHtml(url,isGetStatusCode=False):'''请求页面:param url::param isGetStatusCode::return:'''headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36'}respose = requests.get(url, headers=headers)if isGetStatusCode:return respose.status_codehomeHtml = respose.content.decode('gbk', 'ignore')return homeHtmldef getChapterUrlList(homeHtml):'''获取各章节url:param homeHtml::return:'''dict_chapter = {}pattern = r'<a href="(/[A-Za-z]+/[A-Za-z]+/[A-Za-z]+/[0-9]+.html)"'urlList = re.findall(pattern, homeHtml)if(len(urlList) == 0):pattern = r'<a href="(/[A-Za-z]+/[A-Za-z]+/[0-9]+.html)"'urlList = re.findall(pattern, homeHtml)urlList_re = []for url in urlList:urlList_sub = []url = 'https://www.xyyuedu.com' + urlurlList_sub.append(url)for i in range(2,10):url_next = url.replace('.html','')+'_%s'%(i)+'.html'statusCode = getHomeHtml(url_next,isGetStatusCode=True)print('%s   %s'%(url_next,statusCode))if statusCode !=(200):breakelse:urlList_sub.append(url_next)urlList_re.append(urlList_sub)pattern_chapter = r'title="(.+?)"   target="_blank"'chapterList = re.findall(pattern_chapter, homeHtml)for i in range(len(chapterList)):dict_chapter[str(i+1)+chapterList[i].replace('?','')] = urlList_re[i]return dict_chapterdef saveChapterText(name,urlList):'''将各章节文本记录在txt文件中:param name::param urlList::return:'''for url in urlList:chapterHtml = getHomeHtml(url)bf = BeautifulSoup(chapterHtml)isDown = Falseif writeDown(name,bf.find_all('p')) or writeDown_mode2(name,bf.find_all('div',id="onearcxsbd")):print('下载%s成功'%(name))isDown = Trueelse:print('未获得文本,切换模式下载')if not isDown:print('失败，未获得文本')def writeDown_mode2(name,chapterTextList):'''第二种文本解析方式该模式下，章节文本存储在整个<div>中:param name::param chapterTextList::return:'''isMatch = Falsefor text in chapterTextList:text = \str(text).replace('<p>', '').replace('</p>', '').replace('<br/>', '').replace('&lt', '').replace(' ', '').split('<!--分页-->')[0].replace('<divclass="onearcxsbd"id="onearcxsbd">','')if len(text) != 0:writeInText(name, text + '\r\n')isMatch = Truereturn isMatchdef writeDown(name,chapterTextList):'''第一种文本解析方式该模式下，章节文本存储在整个<p>中:param name::param chapterTextList::return:'''isMatch = Falsefor text in chapterTextList:text = str(text).replace('<p>', '').replace('</p>', '').replace('<br/>', '').replace('&lt', '').replace(' ', '').split('<!--分页-->')[0]if ('微信扫码关注' not in text) and ('互联网信息管理办法' not in text) and ('声明' not in text) and ('分页' not in text) and ('开始' not in text) and ('回目录' not in text) and ('轩宇阅读网' not in text) and (len(text) != 0) :writeInText(name, text + '\r\n')isMatch = Truereturn isMatchdef writeInText(name,text):'''文件操作:param name::param text::return:'''fileName = r'%s\%s.txt'%(DIR,name)print('写入文件%s'%(fileName))with open(fileName, 'a+', encoding='utf-8') as fb:fb.write(text)def reptileNovel(url,dir):'''提供向外接口函数:param url::param dir::return:'''global URLglobal DIRDIR = dirURL = urlif not os.path.exists(dir):os.makedirs(dir)homeHtml = getHomeHtml(url)dict_chapter = getChapterUrlList(homeHtml)for chapterName in dict_chapter.keys():# print(chapterName, dict_chapter[chapterName])saveChapterText(chapterName, dict_chapter[chapterName])if __name__ == '__main__':reptileNovel(URL,DIR)

，执行结果如下图

python实践-轩宇阅读网爬取单册小说相关推荐

python实践-轩宇阅读网爬取全部小说
附上单本小说下载代码,其中已提供外部调用函数 https://blog.csdn.net/zy1007531447/article/details/117475891 全部小说下载代码 import ...
python爬虫之-斗图网爬取
python爬虫之-斗图啦爬取利用:requests, re 功能:用户自定义关键词,页码整体代码 # 请求库 import requests # 正则 import re # 让用户输入 im ...
Python爬虫层层递进，从爬取一章小说到爬取全站小说！
很多好看的小说只能看不能下载,教你怎么爬取一个网站的所有小说知识点: requests xpath 全站小说爬取思路开发环境: 版本:anaconda5.2.0(python3.6.5) 编辑器 ...
python学习（二）爬虫——爬取网站小说并保存为txt文件（二）
前面我们已经完成了单章小说的爬取,现在我们来爬取整本小说一:获取小说章节列表在小说网站里没不小说都有自己的章节目录,里面记录了所有的小说章节地址. 我们要想获取整本小说就要先得到小说的章节列表 ...
Python爬虫之简单爬虫之爬取英雄联盟官网的英雄的皮肤
Python爬虫之简单爬虫之爬取英雄联盟官网的英雄的皮肤文章目录 Python爬虫之简单爬虫之爬取英雄联盟官网的英雄的皮肤背景:LOL这款游戏有着大量的玩家,这个游戏里面人们津津乐道的皮肤,每一款 ...
[python]豆瓣网爬取图书图片信息教程
[python]豆瓣网爬取图书图片信息教程 1.准备工作:已经爬取了图片的URL,图书的相关信息,以便后期进行标记. 画圈处为图片链接和图书ID(用于匹配图片) 2.定义url数组和id数组作用同上 ...
python爬虫(16)使用scrapy框架爬取顶点小说网
本文以scrapy 框架来爬取整个顶点小说网的小说 1.scrapy的安装这个安装教程,网上有很多的例子,这里就不在赘述了 2.关于scrapy scrapy框架是一个非常好的东西,能够实现异步爬 ...
python爬上去飞卢_pyhon3爬虫爬取飞卢小说网小说
想看小说,不想看花里胡哨的网页,想着爬下来存个txt,顺便练习一下爬虫. 随便先找了个看起来格式比较好的小说网站<飞卢小说网>做练习样本,顺便记录一下练习成果. ps:未登录,不能爬取VI ...
monthy python爬虫_Python爬虫DOTA排行榜爬取实例(分享)
Python爬虫DOTA排行榜爬取实例(分享) 1.分析网站打开开发者工具,我们观察到排行榜的数据并没有在doc里 doc文档在Javascript里我么可以看到下面代码: ajax的post方法 ...

python实践-轩宇阅读网爬取单册小说

python实践-轩宇阅读网爬取单册小说相关推荐

最新文章

热门文章