【代码】网易云音乐(周杰伦-晴天)评论的爬取
参考爬取网易云音乐评论,典型的ajax加载,大多人是去破解js加密,有点繁琐。
爬取周杰伦-晴天这一条评论,因为数据量稍大。以后爬取整个歌手。
这是加密api的情况:
import requestsdef get_comment():url = r'https://music.163.com/weapi/v1/resource/comments/R_SO_4_2069470?csrf_token='headers = {'Host': 'music.163.com','Referer': 'https://music.163.com/song?id=482999668','User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) ''AppleWebKit/537.36 (KHTML, like Gecko) ''Chrome/71.0.3578.98 Safari/537.36',# 加上cookies, 否则{'code': -460, 'msg': 'Cheating'}# proxies也是必须的,这里勉强用自己的vpn代替了}proxies = {'https': 'https://110.52.234.72','https': 'https://119.101.112.66'}formdata = {"params": 'Wk3drXP2/Nj8YbOQoL3ORmBM784lqxwm0VELQyBipJWx/rd8fUmklRZ6vL+G1f2dbZ/8WE7f25gWe+2BdXp3+d2AwkiTy5DxeVd4SiHX5qat+jU642hSysQVtHDfJHmCi6rjndr/YEBSccqnzIbueeA9H08OlzAZoYa5T6xlbQpxgtTdX5E1MF6R71ykxkS8',"encSecKey": '1d5e93ee97662d6f9dfaf07dbe4b9d4f9ffe6b90b484d8acc14696214a556000198d51ce3d87d9123db07f96307c919c02d84fa4a204e9d0a387404141fd43400fb2ec9aaa07ae99d99df133cc6d4c31ee8ab7859d83351b154c1ab2bed81a84159a25956ed1485551639e37fc3502ab049a03051ca40f85ef4dd648aabe9286'}response = requests.post(url=url, headers=headers, proxies=proxies)print(response.status_code)result = response.json()print(result)comments = result.get("comments")for comment in comments:user = comment.get('user')img_url = user.get('avatarUrl')name = user.get('nickname')uid= user.get('userId')commentid = comment.get("commentId")commenttime = comment.get("time")content = comment.get('content')print(name, uid, img_url, commentid,commenttime,content)get_comment()
偶然看到没有加密的url尝试了下,出现requests.exceptions.ProxyError
Process Process-4:
OSError: [WinError 10048] Only one usage of each socket address (protocol/network address/port) is normally permitted
During handling of the above exception, another exception occurred:urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x0000019F0FB6EF28>: Failed to establish a new connection: [WinError 10048] Only one usage of each socket address (protocol/network address/port) is normally permitted
During handling of the above exception, another exception occurred:During handling of the above exception, another exception occurred:
requests.exceptions.ProxyError: HTTPConnectionPool(host='127.0.0.1', port=1080): Max retries exceeded with url: http://music.163.com/api/v1/resource/comments/R_SO_4_186016?limit=20&offset=6840 (Caused by ProxyError('Cannot connect to proxy.', NewConnectionError('<urllib3.connection.HTTPConnection object at 0x0000019F0FB6EF28>: Failed to establish a new connection: [WinError 10048] Only one usage of each socket address (protocol/network address/port) is normally permitted',)))
网上两种做法,配置requests参数和加proxies
看了一些ip代理,测试一下无效
配置了requests参数,感觉会好一点
requests.adapters.DEFAULT_RETRIES = 5 # 增加重连次数s = requests.session()s.keep_alive = False # 关闭多余连接response = s.post(url=url, headers=headers, proxies=proxies)if response.status_code != 200:print(response.status_code) # 如果请求不成功则睡2-3秒,再继续posttime.sleep(random.random()*5)continue
到4800页的时候,出现requests.exceptions.ChunkedEncodingError
,一些解决办法
Process Process-4805:
ValueError: invalid literal for int() with base 16: b''During handling of the above exception, another exception occurred:
http.client.IncompleteRead: IncompleteRead(0 bytes read)During handling of the above exception, another exception occurred:raise ProtocolError('Connection broken: %r' % e, e)
urllib3.exceptions.ProtocolError: ('Connection broken: IncompleteRead(0 bytes read)', IncompleteRead(0 bytes read))During handling of the above exception, another exception occurred:
requests.exceptions.ChunkedEncodingError: ('Connection broken: IncompleteRead(0 bytes read)', IncompleteRead(0 bytes read))
重写Thread类的run方法,如果线程报错而退出,则创建新线程,继续跑:
class HandleChunkedEncodingError(Thread):def __init__(self, target, name, args):Thread.__init__(self)self.name = nameself.args = argsself.target = targetdef run(self):while True:try:self.target(*self.args)except Exception as e:# print('thread', self.name, 'running error: ', e)time.sleep(5) # 创建新线程rethd = HandleChunkedEncodingError(target=self.target, name=self.name, args=self.args)rethd.start()rethd.join()else:break
到9000页时,报错mongodb端口占用,可能插入操作的进程没有及时join?不清楚
inset error localhost:27017: [WinError 10048] Only one usage of each socket address (protocol/network address/port) is normally permitted
用Process发现占用cpu,爬到1W页改成Thread,cpu占用率下降,又发现:
Exception in thread Thread-341:for comment in get_comment(i):
TypeError: 'NoneType' object is not iterable
然而我TM判断了了啊
if comments is not None:return comments # 如果收不到评论,则重新post, 否则返回
改成
` if comments is None:continueelse:return comments # 如果收不到评论,则重新post, 否则返回
爬到3.5W页的时候,打开网易云网易云音乐本站点根据 GDPR 条款升级更新中,暂停服务,敬请期待归来!
应该是ip切换到了欧洲
切换后,网页正常了,但不幸又{'code': -460, 'msg': 'Cheating'}
查到cookie里需要_ntes_nuid
字段,是一个32位字母数字混排的值,只要改一个字符居然就可以post。
换行打印太长了看着眼花,不写入.log
文件了,改成进度提示信息原地刷新:
sys.stdout.write("\r{0}".format(info))
sys.stdout.flush()
嘛,总之还可以,1h 12W条。有用selenium爬取的,1h 0.3W条。selenium还是不适合爬东西。
这是不加密api的情况:
import time
import sys
import random
from threading import Threadimport requests
import pymongodef get_comment(i):offset = str(i*20)url = r'http://music.163.com/api/v1/resource/comments/R_SO_4_186016?' \'limit=20&offset='+offsetreplace = random.randint(1,9)headers = {'Host': 'music.163.com','Referer': 'https://music.163.com/song?id=482999668','Origin': 'https://music.163.com','User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) ''AppleWebKit/537.36 (KHTML, like Gecko) ''Chrome/71.0.3578.98 Safari/537.36',# 加上cookies 否则{'code': -460, 'msg': 'Cheating'}'Cookie': '_ntes_nnid=3{}533f97b25070a32c249f59513ad20c,1{}92582485123; _ntes_nuid=3{}533f97b25070a32c249f59513ad20c;.............'.format(replace, replace, replace)}proxies = {'https' : '123.163.117.246:33915','https' : '180.125.17.139:49562','https' : '117.91.246.244:36838','https' : '121.227.24.195:40862','https' : '180.116.55.146:37993','https' : '117.90.7.57:28613','https' : '193.112.111.90:51933','https' : '61.130.236.216:58127','https' : '60.189.203.144:16820','https' : '111.177.185.148:9999'} # 这些ip大概是无效的while True:requests.adapters.DEFAULT_RETRIES = 5 # 增加重连次数s = requests.session()s.keep_alive = False # 关闭多余连接response = s.get(url=url, headers=headers, proxies=proxies)if response.status_code != 200 and response.status_code != 500: # 避免服务器内部错误刷屏# print(response.status_code) # 如果请求不成功则睡2-3秒,再继续posttime.sleep(2)continuetry:result = response.json()comments = result.get("comments")if comments is None: # 此处存疑continueelse:return commentsexcept Exception as e:# print('get error:', e) # 显示异常continuedef insert2db(item):client = pymongo.MongoClient(host='localhost', port=27017)db = client.musiccollection = db.comment2collection.insert(item)def save(page):for comment in get_comment(page):user = comment.get('user')img_url = user.get('avatarUrl')name = user.get('nickname')uid = user.get('userId')cid = comment.get("commentId")ctime = comment.get("time")content = comment.get('content')item = {'name': name,'uid': uid,'img_url': img_url,'cid': cid,'ctime': ctime,'content': content}while True:try:insert2db(item)except Exception as e:# print('inset error:', e)# time.sleep(0.5)passelse:breakclass HandleChunkedEncodingError(Thread):def __init__(self, target, name, args):Thread.__init__(self)self.name = nameself.args = argsself.target = targetdef run(self):while True:try:self.target(*self.args)except Exception as e:# print('thread', self.name, 'running error: ', e)time.sleep(5) # 创建新线程rethd = HandleChunkedEncodingError(target=self.target, name=self.name, args=self.args)rethd.start()rethd.join()else:breakdef main():for i in range(90840, 104485, 10):ps = []time.sleep(random.random() * 0) # 每次创建线程间睡2-3秒左右for j in range(10):p = HandleChunkedEncodingError(target=save, name='thd'+str(i+j), args=(i + j,))ps.append(p)for p in ps:p.start()for p in ps:p.join()info = 'page ' + str(i+10) + 'done' # 每10页爬取,保证能检测到问题sys.stdout.write("\r{0}".format(info))sys.stdout.flush()if __name__ == '__main__':main()
【代码】网易云音乐(周杰伦-晴天)评论的爬取相关推荐
- python爬取网易云音乐飙升榜音乐_python爬取网易云音乐热歌榜 python爬取网易云音乐热歌榜实例代码...
想了解python爬取网易云音乐热歌榜实例代码的相关内容吗,FXL在本文为您仔细讲解python爬取网易云音乐热歌榜的相关知识和一些Code实例,欢迎阅读和指正,我们先划重点:python,网易热歌榜 ...
- python爬取网易云音乐_我用Python爬取了网易云音乐
原标题:我用Python爬取了网易云音乐 来源:别动我的猫尾巴 headers需要进行修改,headers设置不对会被屏蔽导致爬取不成功.一个headers用久了也会爬取不成功 代码如下: impor ...
- java 爬虫音乐,Java爬取网易云音乐所有歌曲一:爬取所有歌手及其对应id
使用jsoup包进行爬虫 org.jsoup jsoup 1.6.1 具体代码如下 下一篇:根据歌手Id获取所有专辑 package com.ssm.jsoup.music; import co ...
- python网络爬虫网易云音乐下载_python网络爬虫爬取网易云音乐
#爬取网易云音乐 url="https://music.163.com/discover/toplist" #歌单连接地址 url2 = 'http://music.163.com ...
- 如何用 Python 爬取网易云音乐的 10w+ 评论?附详细代码解读
在简单学习了Python爬虫之后,我的下一个目标就是网易云音乐.因为本人平时就是用它听的歌,也喜欢看歌里的评论,所以本文就来爬一爬网易云音乐的评论吧! 正式进入主题 首先是找到目标网页并分析网页结构, ...
- python爬网易云音乐评论最多的歌_使用Python爬一爬网易云音乐上那些评论火爆的歌曲...
网易云音乐这款音乐APP本人比较喜欢,用户量也比较大,而网易云音乐之所以用户众多和它的歌曲评论功能密不可分,很多歌曲的评论非常有意思,其中也不乏很多感人的评论.但是,网易云音乐并没有提供热评排行榜和按 ...
- python爬虫网易云音乐评论最多的歌_使用Python爬一爬网易云音乐上那些评论火爆的歌曲...
网易云音乐这款音乐APP本人比较喜欢,用户量也比较大,而网易云音乐之所以用户众多和它的歌曲评论功能密不可分,很多歌曲的评论非常有意思,其中也不乏很多感人的评论.但是,网易云音乐并没有提供热评排行榜和按 ...
- python爬虫网易云音乐最热评论并分析_Python3实现爬虫抓取网易云音乐的热门评论分析(图)...
这篇文章主要给大家介绍了关于Python3实战之爬虫抓取网易云音乐热评的相关资料,文中通过示例代码介绍的非常详细,对大家的学习或者工作具有一定的参考学习价值,需要的朋友们下面随着小编来一起学习学习吧. ...
- 用python爬取网易云评论最多的歌_巧用Python爬取网易云音乐歌曲全部评论
一.首先分析数据的请求方式 网易云音乐歌曲页面的URL形式为https://music.163.com/#/song?id=歌曲id号,这里我用Delacey的Dream it possible 为例 ...
- 抓取网易云音乐歌曲热门评论生成词云(转)
非原创作品,转载自:http://blog.csdn.net/marksinoberg/article/details/70809830 前言 网易云音乐一直是我向往的"神坛",听 ...
最新文章
- PHP经典乱码“口”字与解决办法
- php 单例模式 序列化,php设计模式(二)单例模式
- 《C#高级编程》中文第七版 读书笔记(目录阐述)
- Linux通过文件大小查找,linux 根据文件大小查找文件
- go语言和python的区别_golang和python有什么区别?
- python 字典转对象
- 电脑如何安装php文件夹在哪个文件夹,win7系统桌面文件在c盘哪个文件夹
- 十分钟带你理解Kubernetes核心概念
- 资料 |《深度学习500问》,川大优秀毕业生的诚意之作
- Security+ 学习笔记45 移动设备安全
- Exchange 2007 474 问题解决方法
- AE 动效工作流技巧 —— 减少 Bodymovin 导出的 JSON 大小并提升性能(二)
- 读透《华为数据之道》
- 分区助手磁盘移动毁我双系统
- Grunt的安装与使用 (以Windows 64位为例)
- 2022年中国版权保护中心计算机软件著作权登记最全申请步骤流程
- association weak 属性
- P4学习笔记(一)初识P4
- 递推DP(至少和至多之间的转换
- Android指南针之加速度传感器地磁传感器