python抓取头条文章

python抓取头条美文并存储到mongodb

# Author:song
from multiprocessing import Pool
from urllib.parse import urlencode
import requests
import json
from requests import RequestException
from bs4 import BeautifulSoup
import re
import pymongo
client = pymongo.MongoClient('localhost',connect=False)
db = client['toutiaowenzhang']def get_index(offset):data = {'offset': offset,'format': 'json','keyword': '美文','autoload': 'true','count': 20,'cur_tab': 1,'from':'search_tab'}url = 'https://www.toutiao.com/search_content/?'+urlencode(data)response = requests.get(url)try:if response.status_code == 200:return response.textelse:return Noneexcept RequestException:return Nonedef get_urls(html):data = json.loads(html)if data and 'data' in data.keys():for item in data.get('data'):yield item.get('article_url')def get_index_detail(url):response = requests.get(url)try:if response.status_code == 200:return response.textelse:return Noneexcept RequestException:return Nonedef parse_detail(html):try:soup = BeautifulSoup(html,'lxml')title = soup.select('title')[0].get_text()compile_allarticle= re.compile('content.*?&lt;div&gt(.*?)&lt;/div&gt;',re.S)allarticle = re.findall(compile_allarticle,html)# article =re.sub('(&lt;.*?&lt;span&gt;)','',allarticle[0])#正则匹配上不需要的那部分article =re.sub('[a-zA-Z0-9/#;&\._]','',str(allarticle)).strip()#直接把字母数字全部替换data = {'title':title,'article':article}return dataexcept TypeError:#解决出现了404界面pass
def save_to_mongodb(result):if db['toutiaowenzhang'].insert(result):print('successful')else:print('fail')def main(offset):html = get_index(offset)items = get_urls(html)for item in items:if item:ab = get_index_detail(item)result = parse_detail(ab)save_to_mongodb(result)
if __name__=='__main__':groups = [x*20 for x in range(3)]pool = Pool()pool.map(main,groups)

转载于:https://www.cnblogs.com/master-song/p/8922850.html

python抓取头条文章相关推荐

python抓取网页文章_使用Python从公共API抓取新闻和文章
python抓取网页文章 Whether you are data scientist, programmer or AI specialist, you surely can put huge nu ...
python 抓取头条街拍图片
#抓取头条图片,存入文本文件 #根据崔大庆视频整理 import requests import re import json import os from requests.exception ...
用python爬取头条文章_AI第四课：Python爬取今日头条文章
到目前为止,能使用python写一点简单的程序了,本次的任务是爬取今日头条的文章信息. 大致涉及的知识点:json数据格式,浏览器插件jsonView,浏览器开发者模式,html基础,http代理,h ...
基于python的今日头条文章抓取内含signature算法
基于python的今日头条文章抓取内含signature算法扫二维码添加微信备注:爬虫 , 拉你进爬虫交流群或许你会成为第一个加群的人~ 刚有的创群想法! 1. 简单文字描述头条爬虫注意点由于 ...
python日志保存为html文件,用 Python 抓取公号文章保存成 HTML
上次为大家介绍了如果用 Python 抓取公号文章并保存成 PDF 文件存储到本地.但用这种方式下载的 PDF 只有文字没有图片,所以只适用于没有图片或图片不重要的公众号,那如果我想要图片和文字下载下 ...
python抓取网站乱码_如何使用Python抓取网站
python抓取网站乱码 by Devanshu Jain 由Devanshu Jain It is that time of the year when the air is filled with ...
手把手教你入侵网站修改数据_手把手教你使用Python抓取QQ音乐数据（第四弹）...
[一.项目目标] 通过手把手教你使用Python抓取QQ音乐数据(第一弹)我们实现了获取 QQ 音乐指定歌手单曲排行指定页数的歌曲的歌名.专辑名.播放链接. 通过手把手教你使用Python抓取QQ音乐 ...
python爬取学籍_仝卓学籍造假微博道歉，用Python抓取微博的评论看看群众都说什么...
前言本文的文字及图片来源于网络,仅供学习.交流使用,不具有任何商业用途,版权归原作者所有,如有问题请及时联系我们以作处理. 欢迎关注小编,除了分享技术文章之外还有很多福利,私信学习资料可以领取包括不 ...
Python抓取小说
Python抓取小说前言这个脚本命令MAC在抓取小说写,使用Python它有几个码. 代码 # coding=utf-8import re import urllib2 import charde ...

python抓取头条文章

python抓取头条文章相关推荐

最新文章

热门文章