Scrapy下载视频

1.前置设置

添加浏览器伪装以及ip代理

settings文件：

BOT_NAME = 'xinpianchang'SPIDER_MODULES = ['xinpianchang.spiders']
NEWSPIDER_MODULE = 'xinpianchang.spiders'ROBOTSTXT_OBEY = False
# 指定显示日志类型
LOG_LEVEL = 'ERROR'SPIDER_MIDDLEWARES = {'xinpianchang.middlewares.XinpianchangSpiderMiddleware': 543,
}

middlewares文件：

import random
from fake_useragent import UserAgent
class XinpianchangDownloaderMiddleware:http = ['xxxxx:xxxx'#网上自己找ip代理]https = ['xxxxx:xxxx'#网上自己找ip代理]@classmethoddef from_crawler(cls, crawler):s = cls()crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)return s# 拦截请求def process_request(self, request, spider):request.headers['User-Agent'] = str(UserAgent().random)if request.url.split(':')[0] == 'http':request.meta['Proxy'] = 'http://' + random.choice(self.http)else:request.meta['Proxy'] = 'https://' + random.choice(self.https)return None# 拦截所以响应def process_response(self, request, response, spider):return response# 拦截所有发生异常def process_exception(self, request, exception, spider):  passdef spider_opened(self, spider):spider.logger.info('Spider opened: %s' % spider.name)

2.分析网页

分析，先观看一个视频的网址：

https://www.xinpianchang.com/a11844431?from=ArticleList

F12查看寻找视频的url：

而我们只需要在源代码获取appKey和media后面的数据就可以爬取json数据。

所以现在在需要在源代码查找这个appKey。

但是发现该源代码有两种，每次刷新不一样，所以需要针对每次源代码的不同来爬取代码：

select = re.compile('<html  xmlns:wb="http://open.weibo.com/wb">')
if select.search(text) == None:url = re.compile(r'"appKey":"(?P<appKey>.*?)"')
else:url = re.compile(r'appKey = "(?P<appKey>.*?)";')appKey = url.search(text)
print(appKey.group('appKey'))

成功获取，可以进行下一步获取每个视频的不同编号

select = re.compile('<html  xmlns:wb="http://open.weibo.com/wb">')
if select.search(text) == None:url = re.compile(r'"vid":"(?P<vid>.*?)"')
else:url = re.compile(r'vid = "(?P<vid>.*?)";')vid = url.search(text)
print(appKey.group('vid'))

成功获取

可以去获取json数据了：

href = f"https://mod-api.xinpianchang.com/mod/api/v2/media/{vid.group('vid')}?appKey={appKey.group('appKey')}"

随便点击其中一个连接，都可以成功获取：

可以看到视频清晰度有许多种，但是我默认选择最清楚的，即选择第一个：

成功获取名字和下载url：

def get_mp4(self,response):print(response.json()['data']['title'])title = response.json()['data']['title']url = response.json()['data']['resource']['progressive'][0]['url']print(url)

3.保存mp4

下一步就是进行，这需要配置items和管道。

先设置一下settings

ITEM_PIPELINES = {#'scrapy.pipelines.files.FilesPipeline': 1,'xinpianchang.pipelines.VideoDownloadPipeline': 1, } # 数字为优先级,}
FILES_STORE = 'video'

items

import scrapyclass XinpianchangItem(scrapy.Item):file_urls = scrapy.Field()files = scrapy.Field()

管道

import scrapy
from itemadapter import ItemAdapter
from scrapy.pipelines.files import FilesPipelineclass VideoDownloadPipeline(FilesPipeline):def get_media_requests(self, item, info):# 依次对视频地址发送请求，meta用于传递视频的文件名yield scrapy.Request(url=item['file_urls'], meta={'title': item['files']})def file_path(self, request, response=None, info=None, *, item=None):filename = request.meta['title']  # 获取视频文件名return filename  # 返回下载的视频文件名def item_completed(self, results, item, info):print(item['files'],'is ok!')return item

爬虫文件

import scrapy
import re
from ..items import XinpianchangItemclass Xin1Spider(scrapy.Spider):name = 'xin1'def start_requests(self):yield scrapy.Request('https://www.xinpianchang.com/channel/index/id-85/sort-like/duration_type-0''/resolution_type-/type-?from=articleListPage', self.parse)def get_mp4(self,response):title = response.json()['data']['title']url = response.json()['data']['resource']['progressive'][0]['url']item = XinpianchangItem()item['file_urls'] = urlitem['files'] = title+'.mp4'yield itemdef videopage(self, response):text = response.textselect = re.compile('<html  xmlns:wb="http://open.weibo.com/wb">')if select.search(text) == None:a = re.compile(r'"appKey":"(?P<appKey>.*?)"')v = re.compile(r'"vid":"(?P<vid>.*?)"')else:a = re.compile(r'appKey = "(?P<appKey>.*?)";')v = re.compile(r'vid = "(?P<vid>.*?)";')appKey = a.search(text)vid = v.search(text)# print(appKey.group('appKey'),vid.group('vid'))href = f"https://mod-api.xinpianchang.com/mod/api/v2/media/{vid.group('vid')}?appKey={appKey.group('appKey')}"# print(href)yield scrapy.Request(href, callback=self.get_mp4)def parse(self, response):id = response.xpath('/html/body/div[7]/div[2]/ul/li/@data-articleid').getall()name = response.xpath('/html/body/div[7]/div[2]/ul/li/div/div[1]/a/p/text()').getall()for i in range(0, len(id)):# print(name[i], id[i])href = 'https://www.xinpianchang.com/a' + id[i]# print(href)yield scrapy.Request(href, callback=self.videopage)

成功下载

具体代码可以下载
源码下载

Scrapy下载视频示例1相关推荐

Python下载M3U8加密视频示例
大家好,我是小小明. 最近看到几个视频网站的地址依然是m3u8格式,不禁有了使用python进行下载的想法,虽然下载m3u8格式视频的工具很多,但如果我们自行编码就能应对更多的情况. 关于m3u8的基 ...
linux awk命令详解，使用system来内嵌系统命令,批量github，批量批下载视频, awk合并两列...
linux awk命令详解简介 awk是一个强大的文本分析工具,相对于grep的查找,sed的编辑,awk在其对数据分析并生成报告时,显得尤为强大.简单来说awk就是把文件逐行的读入,以空格为默认分 ...
Scrapy 下载器中间件(Downloader Middleware)
Scrapy 下载器中间件官方文档:https://scrapy-chs.readthedocs.io/zh_CN/1.0/topics/downloader-middleware.html 官方英 ...
Python3 根据m3u8下载视频，批量下载ts文件并且合并
Python3 根据m3u8下载视频,批量下载ts文件并且合并 m3u8是苹果公司推出一种视频播放标准,是一种文件检索格式,将视频切割成一小段一小段的ts格式的视频文件,然后存在服务器中(现在为了减少 ...
用Python批量下载视频
已离职,无任何利益相关,请放心食用公司具体网址均已作脱敏处理,防止恶意爬虫攻击 (我都被自己感动了55555) 前言搞社会实践(da gong)的地方是做网站的,公司把视频放到了*拍短视频的服 ...
Python爬虫下载视频（梨视频）
梨视频示例:Ctrl+Alt+L格式化代码 import re import requests import hashlib import time # print(respose.status_co ...
vue项目中使用a标签下载视频文件
vue项目中使用a标签下载视频文件前提条件: 1.地址为http 2.文件格式为MP4(仅代表这次项目中使用的情况) 示例代码: <el-button size="mini" ...
云服务器、个人服务器、软路由、NAS的奇特用法（一）you-get下载视频以b站为例（可支持网易云音乐、acfun、土豆、优酷等详情见附录）
文章目录程序,让生活更懒--愿程序改变生活前言一.you-get是什么? 二.适用场景二.使用步骤 1.安装you-get(默认已经安装好了python3) 2.使用you-get下载视频(以 ...
知识点拾遗二（下载视频）
前言为求实用,提高以后用python下载视频资料的效率,增强下载视频相关知识点在脑海里的可得性,特此记录此文章会不定时更新,完善下载视频方法一----->you-get 下载与使用使用 ...

Scrapy下载视频示例1

Scrapy下载视频

1.前置设置

2.分析网页

3.保存mp4

Scrapy下载视频示例1相关推荐

最新文章

热门文章