参考爬取网易云音乐评论,典型的ajax加载,大多人是去破解js加密,有点繁琐。
爬取周杰伦-晴天这一条评论,因为数据量稍大。以后爬取整个歌手。
这是加密api的情况:

import requestsdef get_comment():url = r'https://music.163.com/weapi/v1/resource/comments/R_SO_4_2069470?csrf_token='headers = {'Host': 'music.163.com','Referer': 'https://music.163.com/song?id=482999668','User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) ''AppleWebKit/537.36 (KHTML, like Gecko) ''Chrome/71.0.3578.98 Safari/537.36',# 加上cookies, 否则{'code': -460, 'msg': 'Cheating'}# proxies也是必须的,这里勉强用自己的vpn代替了}proxies = {'https': 'https://110.52.234.72','https': 'https://119.101.112.66'}formdata = {"params": 'Wk3drXP2/Nj8YbOQoL3ORmBM784lqxwm0VELQyBipJWx/rd8fUmklRZ6vL+G1f2dbZ/8WE7f25gWe+2BdXp3+d2AwkiTy5DxeVd4SiHX5qat+jU642hSysQVtHDfJHmCi6rjndr/YEBSccqnzIbueeA9H08OlzAZoYa5T6xlbQpxgtTdX5E1MF6R71ykxkS8',"encSecKey": '1d5e93ee97662d6f9dfaf07dbe4b9d4f9ffe6b90b484d8acc14696214a556000198d51ce3d87d9123db07f96307c919c02d84fa4a204e9d0a387404141fd43400fb2ec9aaa07ae99d99df133cc6d4c31ee8ab7859d83351b154c1ab2bed81a84159a25956ed1485551639e37fc3502ab049a03051ca40f85ef4dd648aabe9286'}response = requests.post(url=url, headers=headers, proxies=proxies)print(response.status_code)result = response.json()print(result)comments = result.get("comments")for comment in comments:user = comment.get('user')img_url = user.get('avatarUrl')name = user.get('nickname')uid= user.get('userId')commentid = comment.get("commentId")commenttime = comment.get("time")content = comment.get('content')print(name, uid, img_url, commentid,commenttime,content)get_comment()

偶然看到没有加密的url尝试了下,出现requests.exceptions.ProxyError

Process Process-4:
OSError: [WinError 10048] Only one usage of each socket address (protocol/network address/port) is normally permitted
During handling of the above exception, another exception occurred:urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x0000019F0FB6EF28>: Failed to establish a new connection: [WinError 10048] Only one usage of each socket address (protocol/network address/port) is normally permitted
During handling of the above exception, another exception occurred:During handling of the above exception, another exception occurred:
requests.exceptions.ProxyError: HTTPConnectionPool(host='127.0.0.1', port=1080): Max retries exceeded with url: http://music.163.com/api/v1/resource/comments/R_SO_4_186016?limit=20&offset=6840 (Caused by ProxyError('Cannot connect to proxy.', NewConnectionError('<urllib3.connection.HTTPConnection object at 0x0000019F0FB6EF28>: Failed to establish a new connection: [WinError 10048] Only one usage of each socket address (protocol/network address/port) is normally permitted',)))

网上两种做法,配置requests参数和加proxies
看了一些ip代理,测试一下无效
配置了requests参数,感觉会好一点

         requests.adapters.DEFAULT_RETRIES = 5  # 增加重连次数s = requests.session()s.keep_alive = False  # 关闭多余连接response = s.post(url=url, headers=headers, proxies=proxies)if response.status_code != 200:print(response.status_code)  # 如果请求不成功则睡2-3秒,再继续posttime.sleep(random.random()*5)continue

到4800页的时候,出现requests.exceptions.ChunkedEncodingError,一些解决办法

Process Process-4805:
ValueError: invalid literal for int() with base 16: b''During handling of the above exception, another exception occurred:
http.client.IncompleteRead: IncompleteRead(0 bytes read)During handling of the above exception, another exception occurred:raise ProtocolError('Connection broken: %r' % e, e)
urllib3.exceptions.ProtocolError: ('Connection broken: IncompleteRead(0 bytes read)', IncompleteRead(0 bytes read))During handling of the above exception, another exception occurred:
requests.exceptions.ChunkedEncodingError: ('Connection broken: IncompleteRead(0 bytes read)', IncompleteRead(0 bytes read))

重写Thread类的run方法,如果线程报错而退出,则创建新线程,继续跑:

class HandleChunkedEncodingError(Thread):def __init__(self, target, name, args):Thread.__init__(self)self.name = nameself.args = argsself.target = targetdef run(self):while True:try:self.target(*self.args)except Exception as e:# print('thread', self.name, 'running error: ', e)time.sleep(5)  # 创建新线程rethd = HandleChunkedEncodingError(target=self.target, name=self.name, args=self.args)rethd.start()rethd.join()else:break

到9000页时,报错mongodb端口占用,可能插入操作的进程没有及时join?不清楚

inset error localhost:27017: [WinError 10048] Only one usage of each socket address (protocol/network address/port) is normally permitted

用Process发现占用cpu,爬到1W页改成Thread,cpu占用率下降,又发现:

Exception in thread Thread-341:for comment in get_comment(i):
TypeError: 'NoneType' object is not iterable

然而我TM判断了了啊

                if comments is not None:return comments    # 如果收不到评论,则重新post, 否则返回

改成

`               if comments is None:continueelse:return comments    # 如果收不到评论,则重新post, 否则返回

爬到3.5W页的时候,打开网易云网易云音乐本站点根据 GDPR 条款升级更新中,暂停服务,敬请期待归来!应该是ip切换到了欧洲
切换后,网页正常了,但不幸又{'code': -460, 'msg': 'Cheating'}
查到cookie里需要_ntes_nuid字段,是一个32位字母数字混排的值,只要改一个字符居然就可以post。

换行打印太长了看着眼花,不写入.log文件了,改成进度提示信息原地刷新:

sys.stdout.write("\r{0}".format(info))
sys.stdout.flush()

嘛,总之还可以,1h 12W条。有用selenium爬取的,1h 0.3W条。selenium还是不适合爬东西。
这是不加密api的情况:

import time
import sys
import random
from threading import Threadimport requests
import pymongodef get_comment(i):offset = str(i*20)url = r'http://music.163.com/api/v1/resource/comments/R_SO_4_186016?' \'limit=20&offset='+offsetreplace = random.randint(1,9)headers = {'Host': 'music.163.com','Referer': 'https://music.163.com/song?id=482999668','Origin': 'https://music.163.com','User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) ''AppleWebKit/537.36 (KHTML, like Gecko) ''Chrome/71.0.3578.98 Safari/537.36',# 加上cookies 否则{'code': -460, 'msg': 'Cheating'}'Cookie': '_ntes_nnid=3{}533f97b25070a32c249f59513ad20c,1{}92582485123; _ntes_nuid=3{}533f97b25070a32c249f59513ad20c;.............'.format(replace, replace, replace)}proxies = {'https' : '123.163.117.246:33915','https' : '180.125.17.139:49562','https' : '117.91.246.244:36838','https' : '121.227.24.195:40862','https' : '180.116.55.146:37993','https' : '117.90.7.57:28613','https' : '193.112.111.90:51933','https' : '61.130.236.216:58127','https' : '60.189.203.144:16820','https' : '111.177.185.148:9999'}    # 这些ip大概是无效的while True:requests.adapters.DEFAULT_RETRIES = 5  # 增加重连次数s = requests.session()s.keep_alive = False  # 关闭多余连接response = s.get(url=url, headers=headers, proxies=proxies)if response.status_code != 200 and response.status_code != 500: # 避免服务器内部错误刷屏# print(response.status_code)  # 如果请求不成功则睡2-3秒,再继续posttime.sleep(2)continuetry:result = response.json()comments = result.get("comments")if comments is None:  # 此处存疑continueelse:return commentsexcept Exception as e:# print('get error:', e)                # 显示异常continuedef insert2db(item):client = pymongo.MongoClient(host='localhost', port=27017)db = client.musiccollection = db.comment2collection.insert(item)def save(page):for comment in get_comment(page):user = comment.get('user')img_url = user.get('avatarUrl')name = user.get('nickname')uid = user.get('userId')cid = comment.get("commentId")ctime = comment.get("time")content = comment.get('content')item = {'name': name,'uid': uid,'img_url': img_url,'cid': cid,'ctime': ctime,'content': content}while True:try:insert2db(item)except Exception as e:# print('inset error:', e)# time.sleep(0.5)passelse:breakclass HandleChunkedEncodingError(Thread):def __init__(self, target, name, args):Thread.__init__(self)self.name = nameself.args = argsself.target = targetdef run(self):while True:try:self.target(*self.args)except Exception as e:# print('thread', self.name, 'running error: ', e)time.sleep(5)  # 创建新线程rethd = HandleChunkedEncodingError(target=self.target, name=self.name, args=self.args)rethd.start()rethd.join()else:breakdef main():for i in range(90840, 104485, 10):ps = []time.sleep(random.random() * 0)        # 每次创建线程间睡2-3秒左右for j in range(10):p = HandleChunkedEncodingError(target=save, name='thd'+str(i+j), args=(i + j,))ps.append(p)for p in ps:p.start()for p in ps:p.join()info = 'page ' + str(i+10) + 'done'        # 每10页爬取,保证能检测到问题sys.stdout.write("\r{0}".format(info))sys.stdout.flush()if __name__ == '__main__':main()

【代码】网易云音乐(周杰伦-晴天)评论的爬取相关推荐

  1. python爬取网易云音乐飙升榜音乐_python爬取网易云音乐热歌榜 python爬取网易云音乐热歌榜实例代码...

    想了解python爬取网易云音乐热歌榜实例代码的相关内容吗,FXL在本文为您仔细讲解python爬取网易云音乐热歌榜的相关知识和一些Code实例,欢迎阅读和指正,我们先划重点:python,网易热歌榜 ...

  2. python爬取网易云音乐_我用Python爬取了网易云音乐

    原标题:我用Python爬取了网易云音乐 来源:别动我的猫尾巴 headers需要进行修改,headers设置不对会被屏蔽导致爬取不成功.一个headers用久了也会爬取不成功 代码如下: impor ...

  3. java 爬虫音乐,Java爬取网易云音乐所有歌曲一:爬取所有歌手及其对应id

    使用jsoup包进行爬虫 org.jsoup jsoup 1.6.1 具体代码如下    下一篇:根据歌手Id获取所有专辑 package com.ssm.jsoup.music; import co ...

  4. python网络爬虫网易云音乐下载_python网络爬虫爬取网易云音乐

    #爬取网易云音乐 url="https://music.163.com/discover/toplist" #歌单连接地址 url2 = 'http://music.163.com ...

  5. 如何用 Python 爬取网易云音乐的 10w+ 评论?附详细代码解读

    在简单学习了Python爬虫之后,我的下一个目标就是网易云音乐.因为本人平时就是用它听的歌,也喜欢看歌里的评论,所以本文就来爬一爬网易云音乐的评论吧! 正式进入主题 首先是找到目标网页并分析网页结构, ...

  6. python爬网易云音乐评论最多的歌_使用Python爬一爬网易云音乐上那些评论火爆的歌曲...

    网易云音乐这款音乐APP本人比较喜欢,用户量也比较大,而网易云音乐之所以用户众多和它的歌曲评论功能密不可分,很多歌曲的评论非常有意思,其中也不乏很多感人的评论.但是,网易云音乐并没有提供热评排行榜和按 ...

  7. python爬虫网易云音乐评论最多的歌_使用Python爬一爬网易云音乐上那些评论火爆的歌曲...

    网易云音乐这款音乐APP本人比较喜欢,用户量也比较大,而网易云音乐之所以用户众多和它的歌曲评论功能密不可分,很多歌曲的评论非常有意思,其中也不乏很多感人的评论.但是,网易云音乐并没有提供热评排行榜和按 ...

  8. python爬虫网易云音乐最热评论并分析_Python3实现爬虫抓取网易云音乐的热门评论分析(图)...

    这篇文章主要给大家介绍了关于Python3实战之爬虫抓取网易云音乐热评的相关资料,文中通过示例代码介绍的非常详细,对大家的学习或者工作具有一定的参考学习价值,需要的朋友们下面随着小编来一起学习学习吧. ...

  9. 用python爬取网易云评论最多的歌_巧用Python爬取网易云音乐歌曲全部评论

    一.首先分析数据的请求方式 网易云音乐歌曲页面的URL形式为https://music.163.com/#/song?id=歌曲id号,这里我用Delacey的Dream it possible 为例 ...

  10. 抓取网易云音乐歌曲热门评论生成词云(转)

    非原创作品,转载自:http://blog.csdn.net/marksinoberg/article/details/70809830 前言 网易云音乐一直是我向往的"神坛",听 ...

最新文章

  1. PHP经典乱码“口”字与解决办法
  2. php 单例模式 序列化,php设计模式(二)单例模式
  3. 《C#高级编程》中文第七版 读书笔记(目录阐述)
  4. Linux通过文件大小查找,linux 根据文件大小查找文件
  5. go语言和python的区别_golang和python有什么区别?
  6. python 字典转对象
  7. 电脑如何安装php文件夹在哪个文件夹,win7系统桌面文件在c盘哪个文件夹
  8. 十分钟带你理解Kubernetes核心概念
  9. 资料 |《深度学习500问》,川大优秀毕业生的诚意之作
  10. Security+ 学习笔记45 移动设备安全
  11. Exchange 2007 474 问题解决方法
  12. AE 动效工作流技巧 —— 减少 Bodymovin 导出的 JSON 大小并提升性能(二)
  13. 读透《华为数据之道》
  14. 分区助手磁盘移动毁我双系统
  15. Grunt的安装与使用 (以Windows 64位为例)
  16. 2022年中国版权保护中心计算机软件著作权登记最全申请步骤流程
  17. association weak 属性
  18. P4学习笔记(一)初识P4
  19. 递推DP(至少和至多之间的转换
  20. Android指南针之加速度传感器地磁传感器

热门文章

  1. 架构设计---技术栈01
  2. getline()函数详解
  3. Data Import Handler - DIH相关命令
  4. 数字营销专业术语介绍
  5. 等额本息贷款月付款额的推导公式
  6. 手把手教你在 PPT中插入 LaTex 数学公式
  7. c语言图形学画扇形代码,利用CSS绘制任意角度的扇形示例代码
  8. c++,数组与指针的差别
  9. Kafka系列之:增加Kafka节点扩展Kafka集群
  10. CSDN Markdown 显示连续两个中划线 --