No.5 爬虫学习——MongoDB爬虫实践：虎扑论坛(唐松编《Python网络爬虫从入门到实践》P116-123)

题目：获取虎扑步行街论坛上所有帖子的数据，内容包括帖子名称、帖子链接、作者、作者链接、创建时间、回复数、浏览数、最后回复用户和最后回复时间，网络地址为：https://bbs.hupu.com/bxj

使用mysql作为数据存储器，完整代码如下：

import requests
from bs4 import BeautifulSoup
import pymysql
import timeheaders = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36'
}data_list = []def get_info(url):html = requests.get(url,headers=headers)soup = BeautifulSoup(html.text,'lxml')names = soup.select('div > div.post-title > a')authors = soup.select('div > div.post-auth > a')times = soup.select('div > div.post-time')replys = soup.select('div > div.post-datum')for name,author,posttime,reply in zip(names,authors,times,replys):data = {'nameik':'https://bbs.hupu.com'+name['href'],'name':name.get_text().strip(),'author':author.get_text().strip(),'authorik':author['href'],'posttime':posttime.get_text().strip(),'reply':reply.get_text().strip().split('/')[0],'reading':reply.get_text().strip().split('/')[1]}print(data)data_list.append(data)'''
建表语句
CREATE TABLE `hupuss` (`nameik` varchar(200) DEFAULT NULL,`name` varchar(200) DEFAULT NULL,`author` varchar(200) DEFAULT NULL,`authorik` varchar(200) DEFAULT NULL,`posttime` varchar(200) DEFAULT NULL,`reply` varchar(200) DEFAULT NULL,`reading` varchar(200) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8'''def get_sql(data_list):data = data_list[0]  cols = ", ".join('`{}`'.format(k) for k in data.keys())  val_cols = ', '.join('%({})s'.format(k) for k in data.keys())  sql = """INSERT INTO hupu(%s) VALUES(%s)""" % (cols, val_cols)return sqldef get_mysql():conn = pymysql.connect(host='localhost', user='root', passwd='123456', db='mydb', port=3306, charset='utf8')cursor = conn.cursor()sql = get_sql(data_list)cursor.executemany(sql,data_list)conn.commit()if __name__ == '__main__':urls = ['https://bbs.hupu.com/bxj-{}'.format(str(i)) for i in range(0,11)]for url in urls:get_info(url)time.sleep(2)get_mysql()

使用mongoDB存储数据代码如下：

import requests
from bs4 import BeautifulSoup
import pymongo
import timeclient = pymongo.MongoClient('localhost', 27017)
mydb = client['mydb']
hupustreet = mydb['hupu']headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36'
}def get_info(url):html = requests.get(url,headers=headers)soup = BeautifulSoup(html.text,'lxml')names = soup.select('div > div.post-title > a')authors = soup.select('div > div.post-auth > a')#名称、链接times = soup.select('div > div.post-time')replys = soup.select('div > div.post-datum')for name,author,posttime,reply in zip(names,authors,times,replys):data = {'帖子链接':'https://bbs.hupu.com'+name['href'],'帖子名称':name.get_text().strip(),'作者':author.get_text().strip(),'作者链接':author['href'],'创建时间':posttime.get_text().strip(),'回复数':reply.get_text().strip().split('/')[0],'浏览数':reply.get_text().strip().split('/')[1]}print(data)hupustreet.insert_one(data)if __name__ == '__main__':urls = ['https://bbs.hupu.com/bxj-{}'.format(str(i)) for i in range(0,11)]for url in urls:get_info(url)time.sleep(2)

No.5 爬虫学习——MongoDB爬虫实践：虎扑论坛(唐松编《Python网络爬虫从入门到实践》P116-123)相关推荐

小猿学python_小猿圈详解小白如何学习Python网络爬虫
人工智能发展的今天,现在很多企业也都在学习python技术开发,但是真正会的却不是很多,特别是很多都喜欢爬虫,因为可以爬取一些自己喜欢的内容,那么对于小白的话该如何学习python爬虫呢?下面小猿圈P ...
Python网络爬虫数据采集实战（八）：Scrapy框架爬取QQ音乐存入MongoDB
通过前七章的学习,相信大家对整个爬虫有了一个比较全貌的了解 ,其中分别涉及四个案例:静态网页爬取.动态Ajax网页爬取.Selenium浏览器模拟爬取和Fillder今日头条app爬取,基本涵盖了爬虫 ...
python网络爬虫学习资料
第一:Python爬虫学习系列教程(来源于某博主:http://cuiqingcai.com/1052.html) Python版本:2.7 整体目录: 一.爬虫入门 1. Python爬虫入门一之综 ...
Python网络爬虫全网资源汇总
网络爬虫是什么? 百度百科书籍 <Python网络爬虫权威指南第2版> <Python网络爬虫框架Scrapy从入门到精通> <精通Python网络爬虫核心技术.框架 ...
介绍一位零基础学Python网络爬虫的工程师
今天给大家推荐一位软件开发工程师兼Python网络爬虫与数据分析爱好者,它是「Python爬虫与数据挖掘」公众号号主Python进阶者.他系一名软件开发工程师,在工作之余,热爱Python编程,专注于 ...
scrapy框架爬取虎扑论坛球队新闻
目录 Scrapy 框架制作 Scrapy 爬虫一共需要4步: Scrapy的安装介绍 Windows 安装方式一. 新建项目(scrapy startproject) 二.明确目标(mySpi ...
《Python编程：从入门到实践》第七章练习题
<Python编程:从入门到实践>第七章练习题 <Python编程:从入门到实践>第七章练习题 7-1 汽车租赁 7-2 餐馆订位 7-3 10的整数倍 7-4 比萨配料 7- ...
【Python爬虫】MongoDB爬虫实践：爬取虎扑论坛
MongoDB爬虫实践:爬取虎扑论坛网站地址为:https://bbs.hupu.com/bxj 1.网站分析首先,定位网页上帖子名称.帖子链接.作者.作者链接.创建时间.回复数目.浏览数目.最后 ...
《Python网络爬虫——从入门到实践》第六章将数据存储至MySQL数据库的学习心得与总结（出错与纠正方法）
<Python网络爬虫--从入门到实践>第六章将数据存储至MySQL数据库的学习心得与总结(出错与纠正方法) 作为刚开始入门python的小白,对大数据,网络爬虫比较感兴趣.完全是自我修炼 ...

No.5 爬虫学习——MongoDB爬虫实践：虎扑论坛(唐松编《Python网络爬虫从入门到实践》P116-123)

No.5 爬虫学习——MongoDB爬虫实践：虎扑论坛(唐松编《Python网络爬虫从入门到实践》P116-123)相关推荐

最新文章

热门文章