python3爬虫实战：requests库+正则表达式爬取头像

网站url：https://www.woyaogexing.com/touxiang/qinglv/new/

浏览网页：可以发现每个图片都链接到了另一个网页

我们需要获取主目录中的每个图片对应的另一个html页面的url，再从这些url中提取图片

获得要爬取的网页的html

import requests
response = requests.get('https://www.woyaogexing.com/touxiang/qinglv/new/')
response.encoding = 'utf-8'
print(response.text)

我们需要的url在html中的位置如下：

用正则表达式筛选出需要的url

import re
import requests
response = requests.get('https://www.woyaogexing.com/touxiang/qinglv/new/')
response.encoding = 'utf-8'
html = response.text
pattern  = re.compile('href="(/touxiang/qinglv/20\d+/\d+\.html)"',re.S)
urls = re.findall(pattern,html)
for url in urls:print(url)

对其中的每个url在进行一次提取html操作：

import requests
import re
url = '/touxiang/qinglv/2021/1142841.html'
response = requests.get('https://www.woyaogexing.com/'+url)
response.encoding = 'utf-8'
html = response.text
print(html)

我们在这里就可以看见图片的url了

正则表达式筛选：

import requests
import re
url = '/touxiang/qinglv/2021/1142841.html'
response = requests.get('https://www.woyaogexing.com/'+url)
response.encoding = 'utf-8'
html = response.text
pattern = re.compile('href="(//img\d\.woyaogexing\.com/20\d\d.*?\.jpeg)"',re.S)
pic_urls = re.findall(pattern,html)
for pic_url in pic_urls:print(pic_url)

将图片保存至本地即可
完整代码：

import re
import os
import requestsglobal i
i = 0
def get_one_page(url):response = requests.get(url)response.encoding = 'utf-8'html = response.textreturn htmldef get_urls(html):pattern  = re.compile('href="(/touxiang/qinglv/20\d+/\d+\.html)"',re.S)urls = re.findall(pattern,html)return urlsdef get_pic_url(html):pattern = re.compile('href="(//img\d\.woyaogexing\.com/20\d\d.*?\.jpeg)"',re.S)pic_urls = re.findall(pattern,html)return pic_urlsdef save_pic(url,pic_path):global iif not os.path.exists(pic_path):os.mkdir(pic_path)with open(os.path.join(pic_path,str(i)+'.jpg'),'wb') as f:f.write(requests.get(url).content)i += 1def main():html = get_one_page('https://www.woyaogexing.com/touxiang/qinglv/new/')urls = get_urls(html)for url in urls:sub_html = get_one_page('https://www.woyaogexing.com'+url)pic_urls = get_pic_url(sub_html)for pic_url in pic_urls:save_pic('http:'+pic_url,'D:\\test\\')if __name__ == '__main__':main()

效果如下：

python3爬虫实战：requests库+正则表达式爬取头像相关推荐

python爬虫requests实战_Python爬虫之requests库网络爬取简单实战
实例1:直接爬取网页实例2 : 构造headers,突破访问限制,模拟浏览器爬取网页实例3 : 分析请求参数,构造请求参数爬取所需网页实例4: 爬取图片实例5: 分析请求参数,构造请求参数爬取 ...
python3爬虫实战（一）爬取创业邦创投库
从创业邦网站拉取创业公司数据入口链接:http://www.cyzone.cn/event/list-764-0-1-0-0-0-0/,要求抓取前30页. 抓取以下信息:公司名称,详情URL,当前融 ...
python爬虫实战（一）--爬取知乎话题图片
原文链接python爬虫实战(一)–爬取知乎话题图片前言在学习了python基础之后,该尝试用python做一些有趣的事情了–爬虫. 知识准备: 1.python基础知识 2.urllib库使用 ...
Python 爬虫实战，模拟登陆爬取数据
Python 爬虫实战,模拟登陆爬取数据从0记录爬取某网站上的资源连接: 模拟登陆爬取数据保存到本地结果演示: 源网站展示: 爬到的本地文件展示: 环境准备: python环境安装略安装r ...
爬虫系列（1）：极简爬虫——基于requests和re爬取安居客上海二手房价数据
爬虫系列(1):极简爬虫--基于requests和re爬取安居客上海二手房价数据入坑爬虫已经有一年多,一直想好好记录下从各位前辈和大佬处学到的技术,因此开了一个爬虫系列,想借此细致地介绍和演示其中的 ...
Python爬虫实战系列(一)-request爬取网站资源
Python爬虫实战系列(一)-request爬取网站资源 python爬虫实战系列第一期文章目录 Python爬虫实战系列(一)-request爬取网站资源前言一.request库是什么? 二 ...
起点中文网爬虫实战requests库以及xpath的应用
起点中文网爬虫实战requests库以及xpath的应用知识梳理: 本次爬虫是一次简单的复习应用,需要用到requests库以及xpath. 在开始爬虫之前,首先需要导入这两个库 import re ...
python3爬虫系列16之多线程爬取汽车之家批量下载图片
python3爬虫系列16之多线程爬取汽车之家批量下载图片 1.前言上一篇呢,python3爬虫系列14之爬虫增速多线程,线程池,队列的用法(通俗易懂),主要介绍了线程,多线程,和两个线程池的使用. ...
Crawler：基于BeautifulSoup库+requests库实现爬取2018最新电影《后来的我们》热门短评
Crawler:基于BeautifulSoup库+requests库实现爬取2018最新电影<后来的我们>热门短评目录输出结果实现代码输出结果实现代码 # -*- coding: ...

python3爬虫实战：requests库+正则表达式爬取头像

python3爬虫实战：requests库+正则表达式爬取头像

获得要爬取的网页的html

python3爬虫实战：requests库+正则表达式爬取头像相关推荐

最新文章

热门文章