知乎根据话题ID爬虫

Python爬虫

根据话题URL加载网页，筛选内容，存储答案内容和图片URL

随便写的，获取知乎问题答案下的回答内容和图片

主要耗时长的是调整保存文件的排版

可以直接运行，代码如下：

import re
import json
import requests
import urllib3urllib3.disable_warnings()t = 30  # 设置根据赞数筛选# 26830927  国内自然风景最美的地方是哪里？
qid = 26830927# 图片列表，防止重复
img_urls = []def get_answers():page_no = 0with open("answer.cache", "a", encoding="utf-8") as answer_cache:while True:print(page_no + 1)answer_cache.write("第" + (page_no + 1).__str__() + "页:\t\n")is_end = get_answers_by_page(page_no, answer_cache)page_no += 1if page_no >= 4:breakif is_end:breakanswer_cache.close()def get_answers_by_page(page_no, answer_cache):# 页偏移量，由limit决定offset = page_no * 20url = "https://www.zhihu.com/api/v4/questions/{}/answers?include=data%5B%2A%5D.is_normal%2Cadmin_closed_comment" \"%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky" \"%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cvoteup_count" \"%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Crelevant_info" \"%2Cquestion%2Cexcerpt%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp" \"%2Cis_labeled%3Bdata%5B%2A%5D.mark_infos%5B%2A%5D.url%3Bdata%5B%2A%5D.author.follower_count%2Cbadge%5B%2A" \"%5D.topics&limit=20&offset={}&platform=desktop&sort_by=default".format(qid, offset)headers = {"User-Agent": "Mozilla/5.0  (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) ""Chrome/69.0.3497.100 Safari/537.36",}r = requests.get(url, verify=False, headers=headers)data = json.loads(r.content.decode("utf-8"))items = data["data"]for item in items:if item.get("voteup_count") > t:print(item)answer_cache.write("\n\nanswer" + ":\t" + item.get("url") + "  ")answer_cache.write("\t\t voteup_count:\t" + str(item.get("voteup_count")) + "\n")answer_cache.write("author:\t" + item.get("author").get("name") + ":\t" +"https://www.zhihu.com/people/" + item.get("author").get("url_token")+"\n\n")matched_img_url = re.findall(r'data-original="([^"]+)"', item.get("content"))cons = re.findall("<p>.*?</p>", item.get("content"), re.U)for con in cons:con = con.replace("<p>", "").replace("</p>", "").replace("</b>", "").replace("<b>", "").replace("<br/>", "\n")answer_cache.write(con + "\n")for img_url in matched_img_url:if img_url not in img_urls:img_urls.append(img_url)answer_cache.write(img_url + "\n")if item.get("is_end"):return Truereturn Falseif __name__ == "__main__":get_answers()

知乎根据话题ID爬虫相关推荐

python爬虫：requests+pyquery实现知乎热门话题爬取
文章目录前言 1. requests库的基本使用 2. pyquery库的基本使用 3. 爬取知乎热门话题前言有些东西想忘都忘不了,而有些却转背就忘了!这段时间忙于找工作和学习mysql,把爬虫 ...
python爬取知乎话题广场_用于爬取知乎某个话题下的精华问题中所有回答的爬虫...
思路我的整个算法的思路还是很简单的,文字版步骤如下: 1.通过话题广场进入某个话题的页面,避免了登陆注册页面的验证,查找到对应要爬取的话题,从 url 中得到话题id 2.该页面的所有资源采用了延迟 ...
数据挖掘文本分类知乎问题单分类（二）：爬取知乎某话题下的问题（数据爬取）
数据挖掘文本分类知乎问题单分类(二):爬取知乎某话题下的问题(数据爬取) 爬虫目标 Scrapy框架介绍 Scrapy框架原理 [^1] Scrapy工作流程 [^2] 具体实现安装Scrapy ...
周末了，围观知乎福利话题，放松一下
公众号:爱写bug(ID:iCodeBugs) 前言: 周末了,围观几个知乎福利话题: 女生身材好是什么体验?:https://www.zhihu.com/question/328457531 拥有一 ...
Python中国知网（cnki）爬虫及数据可视化分析设计
开发环境: Pycharm + Python3.6 + Django2.0 + mysql数据库,redis数据库毕业设计-中国知网(cnki)爬虫及数据可视化,采用Django和Celery将爬虫 ...
python知乎爬虫收藏夹_Python爬取知乎问题收藏夹爬虫入门
简介知乎的网站是比较好爬的,没有复杂的反爬手段,适合初学爬虫的人作为练习因为刚刚入门python,所以只是先把知乎上热门问题的一些主要信息保存到数据库中,待以后使用这些信息进行数据分析,爬取的网页 ...
python爬虫知乎荐书_python爬虫必看书籍推荐
网络爬虫(又称为网页蜘蛛,网络机器人,在FOAF社区中间,更经常的称为网页追逐者),是一种按照一定的规则,自动地抓取万维网信息的程序或者脚本.另外一些不常使用的名字还有蚂蚁.自动索引.模拟程序或者蠕虫 ...
__init__在python中的用法知乎_python使用selenium爬虫知乎的方法示例
说起爬虫一般想到的情况是,使用 python 中都通过 requests 库获取网页内容,然后通过 beautifulSoup 进行筛选文档中的标签和内容.但是这样有个问题就是,容易被反扒机制所拦住. ...
python爬取知网论文关键词_Python爬虫根据关键词爬取知网论文摘要并保存到数据库中【入门必学】...
搜索出来的结果和知网上的结果几乎一样,另外以后面试找Python工作,项目经验展示是核心,如果你缺项目练习,去小编的Python交流.裙 :一久武其而而流一思(数字的谐音)转换下可以找到了,里面很多新 ...

知乎根据话题ID爬虫

Python爬虫

知乎根据话题ID爬虫相关推荐

最新文章

热门文章