python爬虫，爬取起点网站小说

使用python再来做一次爬虫：主要抓取玄幻类型的小说
目标网址:起点
使用模块：bs4，os模块
基本思路：
获取需求页面的元素代码，装到bs4容器里面，然后进行操作

首先获取接口：https://www.qidian.com/xuanhuan，可以看到，亲求方法是get

首先获取玄幻小说的所有页面元素代码，然后装到bs4容器里进行操作：

url = "https://www.qidian.com/xuanhuan"
method = 'get'
headers = {"user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64)","Referer":"https://www.qidian.com"}
res = requests.get(url,headers=headers)
res.encoding = 'utf-8'
# print(res.text)
soup = BeautifulSoup(res.text,'html.parser')
xuanhuan = soup.select('.book-list')
print('book-list:',xuanhuan)
number = 0

headers是对一些防爬机制的简单处理
因为有很多的页面和链接。所有建议把 BeautifulSoup直接封装：

from bs4 import BeautifulSoup
import requests
class soupx:def soup(self,method,url):headers = {"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)", "Referer": "https://www.qidian.com"}res = requests.request(method,url,headers=headers)res.encoding = 'utf-8'soup = BeautifulSoup(res.text,'html.parser')return soup

完整代码块：

import os
from reptile.soup4 import soupx
import timepath = 'D:/xiaoshuo/'
#windows不能创建自带的目录，添加逻辑判断
if os.path.exists(path):print('目录已经存在')flag = 1
else:os.makedirs(path)flag = 0url = "https://www.qidian.com/xuanhuan"
method = 'get'
# headers = {"user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64)","Referer":"https://www.qidian.com"}
soup = soupx().soup(method=method,url=url)
#把bs操作模块封装成一个类，后面直接调用这个模块
# res = requests.get(url,headers=headers)
# res.encoding = 'utf-8'
# print(res.text)
# soup = BeautifulSoup(res.text,'html.parser')
xuanhuan = soup.select('.book-list')
print('book-list:',xuanhuan)
number = 0
for book in xuanhuan:#获取所有玄幻日周月前十的内容print('book:',book)soup1 = book.select('a')soup1.pop(1)soup1.pop(1)soup1.pop(1)number += 1print('soup1:',soup1)time.sleep(0.5)for article in soup1:#获取书名和链接print('article:',article)name = article.textherf = article['href']herf_article = 'https:' + herf  # 文章链接加上httpsprint(name,":",herf_article)file = os.path.join(path,name)print(file)# 获取章节链接article_soup = soupx().soup(method=method,url=herf_article)chapter = article_soup.select('.volume')print('chapter:',chapter)data_list = chapter[0]('li')# print(data_list)#打开或创建文件file_name = open(file + '.txt', 'w+', encoding='utf-8')article_soup = soupx().soup(method=method, url=herf_article)for data in data_list:# 获取章节名和内容chapter_href = "https:" + data.select('a')[0]['href']# print(article_href)soup = soupx().soup(method='get', url=chapter_href)# print(soup)chapter_name = soup.select('.content-wrap')[0].text# print(chapter_name)chapter_text = soup.select('.read-content')[0].text# print(chapter_text)file_name.writelines(chapter_name)print(chapter_name)file_name.writelines(chapter_text + '\n')time.sleep(0.5)file_name.close()

部分结果截图：

= = = = = = = = = = = = = = 分割线 = = = = = = = = = = = = = = = = = = =
这是爬取长沙公司信息的操作

import os
import time
import math
from reptile.soup4 import soupxpath = 'D:\\xiaoshuo\wql'soup = soupx().soup(method='get',url='http://wap.huangye88.com/b2b/dq-changsha/')
title = soup.select('.notop')
for i in title:sort_name = i.select('.le')[0].textli_sort = i.select('a')for a in li_sort:li_sort_href = a['href']li_sort_name = a.text# print(li_sort_name,':',li_sort_href)if os.path.exists(os.path.join(path,li_sort_name)):print('目录已经存在')flag = 1else:os.makedirs(os.path.join(path,li_sort_name))flag = 0path1 = os.path.join(path,li_sort_name)# print(path1)soup2 = soupx().soup('get',url=li_sort_href)li_soup2 = soup2.select('.listwords')[0]('a')# print(li_soup2)for c in li_soup2:soup2_herf = c['href']soup2_name = c.textprint(soup2_name,":",soup2_herf)soup3 = soupx().soup('get',url=soup2_herf)li_soup3 = soup3.select('.listwords')[0]('a')for z in li_soup3:soup3_herf = z['href']soup3_name = z.textprint(soup3_name, ":", soup3_herf)path2 = os.path.join(path1, soup3_name)os.makedirs(path2)soup4 = soupx().soup('get',url=soup3_herf)company_list = soup4.select('.com-item')number = soup4.select('.total')[0]('span')[0].textif int(number)/36 > 1:for company in company_list:company_name = company.select('a')[0].textcompany_href = company.select('a')[0]['href']# print(company_name,':',company_href)path3 = os.path.join(path2,company_name)company_message = soupx().soup('get', url=company_href)try:company_ph_href = company_message.select('.tab')[0]('a')[3]['href']company_phone = soupx().soup('get', url=company_ph_href)company_text = company_phone.select('.contact')[0].textexcept BaseException as e:print(e)else:file_name = open(path3 + '.txt', 'w+', encoding='utf-8')file_name.writelines(company_name)file_name.writelines(company_text + '\n')# print(soup4.select('.nextPage')[0]('a')[0]['href'])numb = math.ceil(int(number)/36)for num in range(int(numb)-1):company_hrefs = soup4.select('.nextPage')[0]('a')[0]['href']# print('这里：',company_hrefs)soup4_1 = soupx().soup('get',company_hrefs)company_list_1 = soup4_1.select('.com-item')for company in company_list_1:company_name = company.select('a')[0].textcompany_href = company.select('a')[0]['href']print(company_name,':',company_href)company_message = soupx().soup('get',url=company_href)path3 = os.path.join(path2, company_name)try:company_ph_href = company_message.select('.tab')[0]('a')[3]['href']company_phone = soupx().soup('get',url=company_ph_href)company_text = company_phone.select('.contact')[0].textexcept BaseException as e:print(a)else:file_name = open(path3 + '.txt', 'w+', encoding='utf-8')file_name.writelines(company_name)file_name.writelines(company_text + '\n')elif int(number)/36 <= 1:for company in company_list:company_name = company.select('a')[0].textcompany_href = company.select('a')[0]['href']# print(company_name,':',company_href)path3 = os.path.join(path2,company_name)company_message = soupx().soup('get', url=company_href)try:company_ph_href = company_message.select('.tab')[0]('a')[3]['href']company_phone = soupx().soup('get', url=company_ph_href)company_text = company_phone.select('.contact')[0].textexcept BaseException as e:print(e)else:file_name = open(path3 + '.txt', 'w+', encoding='utf-8')file_name.writelines(company_name)file_name.writelines(company_text + '\n')

python爬虫，爬取起点网站小说相关推荐

Python爬虫爬取纵横中文网小说
Python爬虫爬取纵横中文网小说学了一周的爬虫,搞了这个东西,自己感觉还不错,有什么问题可以提一提哈目标:纵横中文网-完本-免费小说网址:http://book.zongheng.com/st ...
python 爬虫抓取网页数据导出excel_Python爬虫|爬取起点中文网小说信息保存到Excel...
前言: 爬取起点中文网全部小说基本信息,小说名.作者.类别.连载\完结情况.简介,并将爬取的数据存储与EXCEL表中环境:Python3.7 PyCharm Chrome浏览器主要模块:xlwt ...
python爬虫——爬取起点中文网作品信息
首先打开起点中文网点开红圈内的全部作品选项,本博客爬取这里面的作品信息. 接下来爬取所有作品信息,注意,不仅仅只是该面的所有作品信息,而是全部作品信息. 网页下面有跳转其他页的选项. 我们需要找到网 ...
python爬虫爬取某网站图片
学习分享 | 今天刚学完爬虫,就随便写了一个爬虫代码爬取某网站的图片网站就是这个图片网站,我选的是1080p格式,4k的要会员,我反正是还不会导入的包如下 import requests from ...
python爬虫爬取起点小说_python3爬虫-使用requests爬取起点小说
import requests from lxml import etree from urllib import parse import os, time def get_page_html(ur ...
python request 爬虫爬取起点中文网小说
1.网页分析.进入https://www.qidian.com/,点击全部,进行翻页,你就会发现一个规律, url=https://www.qidian.com/all?orderId=&st ...
Python爬虫爬取某盗版小说网站小说.
前言我将这个程序分为两个功能,一是实现爬取小说的最新章节,二是爬取小说的所有章节. 仅供学习. 获取小说详情页的html 通过函数gethtml()实现. def gethtml(url):#得到小 ...
简易爬虫-利用Python爬虫爬取圣墟小说到本地
大家好,今天给大家带来Python爬虫的简易制作,很适合新手练手. 爬虫即是利用程序模仿真实用户浏览网页并记录目标内容,从而可避过网站的广告,以获取较好的阅读体验. 本次以辰东大神的新书<圣墟& ...
Python简单爬取起点中文网小说（仅学习）
目录前言一.爬虫思路二.使用步骤 1.引入库 2.读取页面 3.分析HTML 3.从标签中取出信息 4.爬取正文总结前言实习期间自学了vba,现在开始捡回以前上课学过的python,在此记 ...

python爬虫，爬取起点网站小说

python爬虫，爬取起点网站小说相关推荐

最新文章

热门文章