http://www.xigua66.com/ 视频网站，可能会报病毒，慎点。

1、http过程

由于ts文件是m3u8的传输文件，m3u8是苹果公司推出一种视频播放标准，是m3u的一种，不过编码方式是utf-8，是一种文件检索格式，将视频切割成一小段一小段的ts格式的视频文件，然后存在服务器中（现在为了减少I/o访问次数，一般存在服务器的内存中），通过m3u8解析出来路径，然后去请求。

重点是获取其中的playlist文件

self.palylist_url = re.findall("video: {\n            url: '(.*?)',", ts_data)[0]

2、下载ts文件

直接使用python中的urllib中的方法来调用

urllib.request.urlretrieve(url,target)

有几个难点

2.1长时间无反应

可设置socket超时时间来解决

socket.setdefaulttimeout(20)

2.2 超时重下，且避免进入死循环

设置计数器count，使用while循环

try:urllib.request.urlretrieve(url,target)
except socket.timeout:count = 1while count <= 5:try:urllib.request.urlretrieve(url,target)                                                breakexcept socket.timeout:err_info = url+' Reloading for %d time'%count if count == 1 else 'Reloading for %d times'%countprint(err_info)count += 1if count > 5:print("downloading fialed!")

2.3 远程主机关闭问题

有时urlopen太频繁，会导致error10054远程主机关闭，可通过重新下载解决。

https://blog.csdn.net/qq_40910788/article/details/84844464

try:urllib.request.urlretrieve(url,target)
except socket.timeout:count = 1while count <= 5:try:urllib.request.urlretrieve(url,target)                                                breakexcept socket.timeout:err_info = url+' Reloading for %d time'%count if count == 1 else 'Reloading for %d times'%countprint(err_info)count += 1except:#解决远程主机关闭问题self.download_file(url, target)if count > 5:print("downloading fialed!")
except:#解决远程主机关闭问题self.download_file(url, target)

3、多线程下载

python3.X之后，重新封装了线程池packet，

from concurrent.futures import ThreadPoolExecutor

该类有多种实现方式（submit、map等）。这里使用map

from concurrent.futures import ThreadPoolExecutorself.pool = ThreadPoolExecutor(max_workers=10)def download_for_multi_process(self, ts):url_header = re.findall('(http.*/)', self.palylist_url)[0]if ts[-1].startswith('out'):ts_url = url_header + ts[-1]#下载index = re.findall('out(.*)\.ts',ts[-1])[0]self.download_file(ts_url, self.target+'/out'+index.zfill(4)+'.ts')print(ts_url+'--->Done')elif ts[-1].endswith('.ts'):ts_url = ts[-1]index = re.findall('out(.*)\.ts',ts[-1])[0]self.download_file(ts_url, self.target+'/out'+index.zfill(4)+'.ts')print(ts_url+'--->Done')else:print(ts[-1]+'无效')def download_with_multi_process(self, ts_list):print('开始多线程下载')print('下载链接及情况：')task = self.pool.map(self.download_for_multi_process,ts_list)#此时非阻塞for t in task:#此时会变成阻塞pass

4、合并ts文件为mp4

Windows的copy /b方法对于ts文件有数量上限，多于某个值，就无法使用copy /b *.ts new.ts来完成。因此使用分步合并的方式。先合并一部分，在将合并后的文件再次合并。

    def merge_ts_file_with_os(self):print('开始合并')L=[]file_dir=self.targetfor root, dirs, files in os.walk(file_dir): for file in files:  if os.path.splitext(file)[1] == '.ts':  L.append(file)L.sort()blocks = [L[i:i+self.max_num] for i in range(0,len(L),self.max_num)]os.system('cd '+self.target)tmp=[]for index, block in enumerate(blocks):b='+'.join(block)new_name=' out_new_'+str(index).zfill(2)+'.ts'tmp.append(new_name)os.system('copy /b '+b+new_name)cmd='+'.join(tmp)num = int(re.findall('player-(.*?).html', self.url)[0].split('-')[-1])+1os.system('copy /b '+cmd+' E'+str(num).zfill(2)+'.mp4')os.system('del /Q out*.ts')print('合并完成')

5、源代码

#coding:utf-8
import urllib.request
import http.cookiejar
import urllib.error
import urllib.parse
import re
import socket
import os
from concurrent.futures import ThreadPoolExecutorclass Xigua66Downloader:def __init__(self, url, target='.'):self.target = targetself.url = urlself.playlist_url = Noneself.max_num=250self.header={ "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",    "Accept-Language":"zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3",    "User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 Safari/537.36 SE 2.X MetaSr 1.0",    "Connection": "keep-alive"   }self.cjar = http.cookiejar.CookieJar()self.cookie = urllib.request.HTTPCookieProcessor(self.cjar)  self.opener = urllib.request.build_opener(self.cookie)      urllib.request.install_opener(self.opener)self.pool = ThreadPoolExecutor(max_workers=10)#设置超时时间为20s#利用socket模块，使得每次重新下载的时间变短socket.setdefaulttimeout(20)def download_file(self, url, target):#解决下载不完全问题且避免陷入死循环try:urllib.request.urlretrieve(url,target)except socket.timeout:count = 1while count <= 5:try:urllib.request.urlretrieve(url,target)                                                breakexcept socket.timeout:err_info = url+' Reloading for %d time'%count if count == 1 else 'Reloading for %d times'%countprint(err_info)count += 1except:#解决远程主机关闭问题self.download_file(url, target)if count > 5:print("downloading fialed!")except:#解决远程主机关闭问题self.download_file(url, target)def open_web(self, url):try:response = self.opener.open(url, timeout=3)    except urllib.error.URLError as e:print('open ' + url + ' error')if hasattr(e, 'code'):    print(e.code)    if hasattr(e, 'reason'):    print(e.reason)    else:            return response.read()'''第一步、获取真正的url地址'''def get_available_IP(self):print('开始获取真实的url')req = urllib.request.Request(url=self.url,headers=self.header)data = self.open_web(req).decode('gbk')target_js = re.findall('<ul id="playlist"><script type="text/javascript" src="(.*?)"></script>',data)[0]data = self.open_web("http://www.xigua66.com"+target_js).decode('gbk')data = urllib.parse.unquote(data)find_33uu = re.findall('33uu\$\$(.*)33uu\$\$', data)if len(find_33uu) == 0:find_zyp = re.findall('zyp\$\$(.*)zyp\$\$', data)if len(find_zyp) != 0:find = find_zyp[0]label = 'zyp'else:find = find_33uu[0]label = '33uu'tv_lists = re.findall('%u7B2C(.*?)%u96C6\$https://(.*?)\$', find)#[(集数,url)]return tv_lists, label'''第二步、获取各个ts文件数量与名称'''def get_playlist(self, tv_lists, label):num = int(re.findall('player-(.*?).html', self.url)[0].split('-')[-1])url = 'https://' + tv_lists[num][-1]print('开始下载第'+str(num+1)+'集：\n'+url)print('开始获取playlist_url')ts_data = self.open_web(url).decode('utf-8')if label == '33uu':self.palylist_url = re.findall("url: '(.*?\.m3u8)'", ts_data)[-1]else:#label='zyp'self.palylist_url = re.findall("url: '(.*?\.m3u8)'", ts_data)[-1]#url检查#/2019/04/03/dkqcLONDC9I26yyG/playlist.m3u8#https://www4.yuboyun.com/hls/2019/02/27/9eBF1A0o/playlist.m3u8if self.palylist_url.startswith('http'):passelse:self.palylist_url = re.findall('(http.*?\.com)', url)[0] + self.palylist_urlprint(self.palylist_url)print('开始获取playlist')palylist_data = self.open_web(self.palylist_url).decode('utf-8')print('已获得playlist列表')ts_list = re.findall('#EXTINF:(.*?),\n(.*?)\n', palylist_data)#[(时间长度，ts文件名)]return ts_list'''第三步、下载ts文件'''def download_with_single_process(self, ts_list):url_header = re.findall('(http.*/)', self.palylist_url)[0]print('开始单线程下载\n下载链接及情况：')for index, ts in enumerate(ts_list):if ts[-1].startswith('out'):ts_url = url_header + ts[-1]#下载self.download_file(ts_url, self.target+'/out'+str(index).zfill(4)+'.ts')print(ts_url+'--->Done')elif ts[-1].endswith('.ts'):ts_url = ts[-1]self.download_file(ts_url, self.target+'/out'+str(index).zfill(4)+'.ts')print(ts_url+'--->Done')else:print(ts[-1]+'无效')print('全部下载完成')def download_for_multi_process(self, ts):url_header = re.findall('(http.*/)', self.palylist_url)[0]if ts[-1].startswith('out'):ts_url = url_header + ts[-1]#下载index = re.findall('out(.*)\.ts',ts[-1])[0]self.download_file(ts_url, self.target+'/out'+index.zfill(4)+'.ts')print(ts_url+'--->Done')elif ts[-1].endswith('.ts'):ts_url = ts[-1]index = re.findall('out(.*)\.ts',ts[-1])[0]self.download_file(ts_url, self.target+'/out'+index.zfill(4)+'.ts')print(ts_url+'--->Done')else:print(ts[-1]+'无效')def download_with_multi_process(self, ts_list):print('开始多线程下载')print('下载链接及情况：')"""<urlopen error [WinError 10054] 远程主机强迫关闭了一个现有的连接。>""""""建议优化代码""""""https://blog.csdn.net/qq_40910788/article/details/84844464"""task = self.pool.map(self.download_for_multi_process,ts_list)#此时非阻塞for t in task:#此时会变成阻塞pass'''from multiprocessing.dummy import Poolpool = Pool(10)pool.map(self.download_for_multi_process, ts_list)pool.close()pool.join()''''''第四步、合并ts文件'''def merge_ts_file_with_os(self):print('开始合并')L=[]file_dir=self.targetfor root, dirs, files in os.walk(file_dir): for file in files:  if os.path.splitext(file)[1] == '.ts':  L.append(file)L.sort()blocks = [L[i:i+self.max_num] for i in range(0,len(L),self.max_num)]os.system('cd '+self.target)tmp=[]for index, block in enumerate(blocks):b='+'.join(block)new_name=' out_new_'+str(index).zfill(2)+'.ts'tmp.append(new_name)os.system('copy /b '+b+new_name)cmd='+'.join(tmp)num = int(re.findall('player-(.*?).html', self.url)[0].split('-')[-1])+1os.system('copy /b '+cmd+' E'+str(num).zfill(2)+'.mp4')os.system('del /Q out*.ts')print('合并完成')def merge_ts_file_with_ffmpeg():passdef main_process(self):available_IP = self.get_available_IP()ts_list = self.get_playlist(available_IP)self.download_with_multi_process(ts_list)self.merge_ts_file_with_os()if __name__ == '__main__':web_url= "http://www.xigua66.com/mainland/yitiantulongji2019/player-0-36.html"down = Xigua66Downloader(web_url)available_IP, label = down.get_available_IP()ts_list = down.get_playlist(available_IP, label)down.download_with_multi_process(ts_list)down.merge_ts_file_with_os()

6、结果

6.1获得真实地址

>>> available_IP
'https://yuboyun.com/v/9eBF1A0o'

6.2 获得ts列表

[(时间，文件名),()...]

>>> ts_list
[('10.520000', 'out000.ts'), ('5.680000', 'out001.ts'), ('2.280000', 'out002.ts'), ('1.680000', 'out003.ts'), ('5.680000', 'out004.ts'), ('5.440000', 'https://www.78pan.com/api/stats/hls/2019/02/27/9eBF1A0o/out005.ts'), ('3.800000', 'out006.ts'), ('6.240000', 'out007.ts'), ('4.080000', 'out008.ts'), ('5.440000', 'out009.ts'), ('6.040000', 'out010.ts'),  .....]

6.3下载文件

开始多线程下载
下载链接及情况：
https://www4.yuboyun.com/hls/2019/02/27/9eBF1A0o/out003.ts--->Done
https://www4.yuboyun.com/hls/2019/02/27/9eBF1A0o/out002.ts--->Done
https://www4.yuboyun.com/hls/2019/02/27/9eBF1A0o/out007.ts--->Done

6.4合并文件

python多线程爬取ts视频相关推荐

python多线程爬取ts文件并合成mp4视频
python多线程爬取ts文件并合成mp4视频声明:仅供技术交流,请勿用于非法用途,如有其它非法用途造成损失,和本博客无关目录 python多线程爬取ts文件并合成mp4视频前言一.分析页面 ...
python多线程爬取m3u8视频（包含AES解密）
python爬取m3u8视频(包含AES解密) 前情提要部分代码摘录于某位大哥(写代码的时候收藏书签了的打算写博客的时候带上链接的,无奈手贱删除了chrome用户,所有的书签也没了,找到再补上),在 ...
Python 多线程爬取西刺代理
西刺代理是一个国内IP代理,由于代理倒闭了,所以我就把原来的代码放出来供大家学习吧. 镜像地址:https://www.blib.cn/url/xcdl.html 首先找到所有的tr标签,与class ...
python多线程爬取斗图啦数据
python多线程爬取斗图啦网的表情数据使用到的技术点 requests请求库 re 正则表达式 pyquery解析库,python实现的jquery threading 线程 queue 队列 ' ...
python多线程爬取妹子图
python多线程爬取妹子图 python使用版本: 3.7 目的: 自己选择下载目录,逐个将主题图片保存到选定目录下. 效果: 一秒钟左右下载一张图片,下了七八十组图片暂时没什么问题,不放心的话,可 ...
Python爬虫爬取Twitter视频、文章、图片
Python爬虫爬取Twitter视频.文章.图片 Twitter的Python爬虫 https://github.com/bisguzar/twitter-scraper 2.2k星标 (2020. ...
斗图斗不过小伙伴？python多线程爬取斗图网表情包，助你成为斗图帝！
最近python基础课讲到了多线程,老师让交个多线程的实例练习.于是来试试多线程爬虫,正好复习一下mooc上自学的嵩天男神的爬虫知识.想法很美好,过程却很心酸,从早上开始写,每次出现各种奇怪问题,到现 ...
python如何爬取网页视频_快就完事了！10分钟用python爬取网站视频和图片
原标题:快就完事了!10分钟用python爬取网站视频和图片话不多说,直接开讲!教你如何用Python爬虫爬取各大网站视频和图片. 638855753 网站分析: 我们点视频按钮,可以看到的链接是: ...
python多线程爬取多个网址_【Python爬虫】多线程爬取斗图网站（皮皮虾，我们上车）...
原标题:[Python爬虫]多线程爬取斗图网站(皮皮虾,我们上车) 斗图我不怕没有斗图库的程序猿是无助,每次在群里斗图都以惨败而告终,为了能让自己在斗图界立于不败之地,特意去网上爬取了斗图包.在这里 ...

python多线程爬取ts视频