爬虫：requests BeautifulSoup 实战案例

爬取猫途鹰旅游网站：https://www.tripadvisor.cn/Attractions-g60763-Activities-New_York_City_New_York.html景点信息

from bs4 import BeautifulSoup
import requestsurl_saves = 'http://www.tripadvisor.com/Saves#37685322'
url = 'https://cn.tripadvisor.com/Attractions-g60763-Activities-New_York_City_New_York.html'
urls = ['https://cn.tripadvisor.com/Attractions-g60763-Activities-oa{}-New_York_City_New_York.html#ATTRACTION_LIST'.format(str(i)) for i in range(30,930,30)]headers = {'User-Agent':'','Cookie':''
}def get_attractions(url,data=None):wb_data = requests.get(url)time.sleep(4)soup = BeautifulSoup(wb_data.text,'html.parser')titles    = soup.select('div.property_title > a[target="_blank"]')imgs      = soup.select('img[width="160"]')cates     = soup.select('div.p13n_reasoning_v2')if data == None:for title,img,cate in zip(titles,imgs,cates):data = {'title'  :title.get_text(),'img'    :img.get('src'),'cate'   :list(cate.stripped_strings),}print(data)def get_favs(url,data=None):wb_data = requests.get(url,headers=headers)soup      = BeautifulSoup(wb_data.text,'lxml')titles    = soup.select('a.location-name')imgs      = soup.select('div.photo > div.sizedThumb > img.photo_image')metas = soup.select('span.format_address')if data == None:for title,img,meta in zip(titles,imgs,metas):data = {'title'  :title.get_text(),'img'    :img.get('src'),'meta'   :list(meta.stripped_strings)}print(data)for single_url in urls:get_attractions(single_url)

PC端爬取信息容易受到限制，若爬取失败，可尝试移动端

headers = {'User-Agent':'', #mobile device user agent from chrome
}mb_data = requests.get(url,headers=headers)
soup = BeautifulSoup(mb_data.text,'lxml')
imgs = soup.select('div.thumb.thumbLLR.soThumb > img')
for i in imgs:print(i.get('src'))

headers 提供网页爬取时的头部信息，让对方识别为人的操作。

在谷歌浏览器里输入chrome://version,就可以看到用户代理，将用户代理添加到头部信息。

爬虫：requests BeautifulSoup 实战案例相关推荐

python爬取电影网站存储于数据库_Python零基础爬虫教程（实战案例爬取电影网站资源链接）...
前言好像没法添加链接,文中的链接只能复制到浏览器查看了这篇是我写在csdn的,那里代码格式支持更好,文章链接 https://blog.csdn.net/d497465762/article/de ...
python爬网站的题库_Python零基础爬虫教程（实战案例爬取电影网站资源链接）
前言好像没法添加链接,文中的链接只能复制到浏览器查看了这篇是我写在csdn的,那里代码格式支持更好,文章链接 https://blog.csdn.net/d497465762/article/de ...
Python爬虫---爬虫介绍，实战案例
目录标题 1.爬虫介绍 1.1 爬虫的合法性 1.2 网络爬虫的尺寸 1.3 robots.txt协议 1.4 http&https协议 1.5 requests模块 1.5.1 reques ...
python 爬虫 requests+BeautifulSoup 爬取巨潮资讯公司概况代码实例
第一次写一个算是比较完整的爬虫,自我感觉极差啊,代码low,效率差,也没有保存到本地文件或者数据库,强行使用了一波多线程导致数据顺序发生了变化... 贴在这里,引以为戒吧. # -*- coding: ...
python微博爬虫(requests+BeautifulSoup+selenium)
备注:本爬虫程序在北京时间2020-5-20依旧有效,如果无效了,可以在评论中反馈. 1.需求输入:微博账号链接(如:广州公安 https://weibo.com/gzjd) 输出:该账号所有发表的 ...
利用Python爬虫requests+BeautifulSoup实现丁香营销师招聘爬取（源码）
为什么80%的码农都做不了架构师?>>> https://download.csdn.net/download/shiyan_31214/10807090 转载于:https: ...
爬虫之requests+BeautifulSoup详解
简介 Python标准库中提供了:urllib.urllib2.httplib等模块以供Http请求,但是,它的 API 太渣了.它是为另一个时代.另一个互联网所创建的.它需要巨量的工作,甚至包括各种 ...
数据挖掘r语言和python知乎_Hellobi Live |R语言爬虫实战案例分享：网易云课堂、知乎live、今日头条、B站视频...
课程名称 R语言爬虫实战案例分享:网易云课堂.知乎live.今日头条.B站视频网络数据抓取是数据科学中获取数据中的重要途径,但是一直以来受制于高门槛,都是专业程序员的专属技能.直到R语言和Pytho ...
Python爬虫实战案例一：爬取猫眼电影
背景笔者上一篇文章<基于猫眼票房数据的可视化分析>中爬取了猫眼实时票房数据,用于展示近三年电影票房概况.由于数据中缺少导演/演员/编剧阵容等信息,所以爬取猫眼电影数据进行补充.关于爬虫的 ...
基础爬虫实战案例之获取游戏商品数据
文章目录前言一.爬虫是什么? 二.爬虫实战案例 1.引入库 2.请求网页处理 3.生成访问链接 4.读入数据到mongodb 5.获得数据 6.加入多线程总结前言在想获取网站的一些数据时,能 ...

爬虫：requests BeautifulSoup 实战案例

爬虫：requests BeautifulSoup 实战案例相关推荐

最新文章

热门文章