爬取去哪儿酒店信息及评论

第一步,获取城市列表

import requests
import json
import codecs# 去哪儿城市列表
url = "https://touch.qunar.com/h-api/hotel/hotelcity/en"s = requests.get(url)file = codecs.open('./city.json','w','utf-8')file.write(s.text)
file.close()

运行结果:

第二步 根据城市列表爬取酒店信息(以汉庭酒店为例)

需要注意两点:

1.请求时需要带数据
当前时间必须在fromDate--toDate之前
"b":{"bizVersion":"17",
"cityUrl":cityid,
"fromDate":fromDate,
"toDate":toDate,
"q":"汉庭酒店",
"qFrom":3,
"start":start,
"num":20,
"minPrice":0,
"maxPrice":-1,
"level":"",
"sort":0,
"cityType":1,
"fromForLog":1,
"uuid":"",
"userName":"",
"userId":"",
"fromAction":"",
"searchType":0,
"hourlyRoom":False,
"locationAreaFilter":[],
"comprehensiveFilter":[],
"channelId":1},
"qrt":"h_hlist",
"source":"website"}2.以这个headers进行访问,cookie填你自己的cookie
headers = {"accept": "application/json, text/plain, */*","accept-encoding": "gzip, deflate, br","accept-language": "zh-CN,zh;q=0.9","content-length": "389","content-type": "application/json;charset=UTF-8","cookie":"你的cookie","origin": "https://hotel.qunar.com","referer": r_url,"user-agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36"}

上代码

import json
import requests
import datetime
import urllib.parse as p
import time
import codecs
import csv
import re
def get_session(cityid,city):fromDate = datetime.date.today().strftime("%Y-%m-%d")toDate = (datetime.date.today() + datetime.timedelta(days=1)).strftime("%Y-%m-%d")url = "https://hotel.qunar.com/cn/{}/?fromDate={}&toDate={}&cityName={}&from=qunarindex&cityurl=".format(cityid,fromDate,toDate,p.quote(city))headers = {"user-agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36","referer": "https://www.qunar.com/"}session = requests.session()res = session.get(url=url,headers = headers)return session,url,fromDate,toDate
def get_data(session,cityid,fromDate,toDate,start,url,headers):payload = {"b":{"bizVersion":"17","cityUrl":cityid,"fromDate":fromDate,"toDate":toDate,"q":"汉庭酒店","qFrom":3,"start":start,"num":20,"minPrice":0,"maxPrice":-1,"level":"","sort":0,"cityType":1,"fromForLog":1,"uuid":"","userName":"","userId":"","fromAction":"","searchType":0,"hourlyRoom":False,"locationAreaFilter":[],"comprehensiveFilter":[],"channelId":1},"qrt":"h_hlist","source":"website"}data = json.dumps(payload)res = session.post(url=url,data=data,headers=headers)if start == 0:# print(json.loads(res.text))print (json.loads(res.text))return res,session,json.loads(res.text)["data"]["tcount"]else:return res,sessiondef get_pages(cityid,city):session,r_url,fromDate,toDate = get_session(cityid,city)url = "https://hotel.qunar.com/napi/list"headers = {"accept": "application/json, text/plain, */*","accept-encoding": "gzip, deflate, br","accept-language": "zh-CN,zh;q=0.9","content-length": "389","content-type": "application/json;charset=UTF-8","cookie":"你的cookie","origin": "https://hotel.qunar.com","referer": r_url,"user-agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36"}start=0res,session,end_num = get_data(session,cityid,fromDate,toDate,start,url,headers)with open("{}.csv".format(city),'a+',encoding="utf-8") as f:f.write("name,url,city,seqno\n")for start in range(20,end_num,20):res,session = get_data(session,cityid,fromDate,toDate,start,url,headers)if res.status_code == 200:with open("{}.csv".format(city),'a+',encoding="utf-8") as f:time.sleep(2)# print(res.text)hotels = json.loads(res.text)['data']['hotels']# print(hotels[1])for i in range(0,len(hotels)):print(hotels[i]['seqNo'])# exit()seqNo = re.findall(cityid+'_(.*)',str(hotels[i]['seqNo']))print(seqNo)# exit()f.write(hotels[i]['name']+','+'https://hotel.qunar.com/cn/'+cityid+'/dt-'+seqNo[0]+','+city+','+hotels[i]['seqNo']+'\n')else:print("获取数据失败")time.sleep(2)
def get_city():f = codecs.open('./city.json','r','utf-8')return json.loads(f.read())
if __name__=="__main__":d1 = get_city()for k in range(len(d1['data'])):items = d1['data'][k].items()for key,value in items:for j in range(10,len(value)):cityid = value[j]['cityUrl']city = value[j]['cityName']print(cityid)get_pages(cityid,city)

运行结果

第三步爬取酒店评论

import requests
import csv
import jsonf= open(r'new.csv','r',encoding='gbk')
with open('remark1.csv','a+',encoding='utf-8') as f_remark:f_remark.write("name,star,feed,time\n")
reader = csv.reader(f)
for item in reader:print(reader.line_num)if reader.line_num == 1:continueif reader.line_num == 361:breakif reader.line_num <236:continueprint("当前内容:",item)headers = {"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9","accept-encoding": "gzip, deflate, br","accept-language": "zh-CN,zh;q=0.9",# "content-length": "389",# "content-type": "application/json;charset=UTF-8","cookie":"你的cookje",# "origin": "https://hotel.qunar.com","referer": item[1],"user-agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36"}for i in range(100):# print(i)try:url =" https://hotel.qunar.com/napi/ugcCmtList?hotelSeq={}&page={}&size=10".format(item[3],i+1)response = requests.get(url=url,headers=headers)# print(i)res = json.loads(response.text)# print(json.loads(res['data']['list'][0]['content'])['evaluation'])for j in range(len(res['data']['list'])):content = json.loads(res['data']['list'][j]['content'])star = content['evaluation']feed = content['feedContent'].replace('\n','').replace('\r', '').replace(',',',')time = content['modtime']# print(item[0],star,feed,time)if(star=='' or feed=='' or str(time)==''):continuewith open('remark1.csv','a+',encoding='utf-8') as f_remark:f_remark.write(item[0]+','+str(star)+','+str(feed)+','+str(time)+'\n')print(feed)except:break
f_remark.close()
f.close()

运行结果

爬取去哪儿酒店信息及评论相关推荐

  1. 基于selenium爬取去哪儿酒店信息

    去哪儿网站中,要爬取旅游的酒店信息,我们用通常的requests库进行爬取的时候发现,当我们要翻页的时候网址未出现变化,返回的网页源码信息始终只有第一页的内容,那么有没有一种方式可以使得能够翻页爬取呢 ...

  2. 最新爬取携程酒店信息上:思路讲解

    本以为携程的信息很好爬,但是在我目前能力一般的时候,经过尝试,发现了携程真的有太多坑了,虽然说代码和大佬比起来不是最优的,但是可以完成爬取任务. 在这里记录一下本次学习过程,为后人乘凉. 要爬取所有的 ...

  3. 使用requests、BeautifulSoup、线程池爬取艺龙酒店信息并保存到Excel中

    import requests import time, random, csv from fake_useragent import UserAgent from bs4 import Beauti ...

  4. python爬取携程酒店信息_不写代码玩转爬虫实例(3) - 抓取携程酒店信息

    背景需求 有不少朋友问永恒君携程网站的酒店信息怎么抓取,今天这篇文章来分享一下使用web scraper来快速实现抓取携程酒店信息. 例如,在携程官网搜索北京 密云水库的酒店信息, 可以搜索到非常多的 ...

  5. Python爬取携程酒店信息

    文章目录 前言 一.请求头,请求参数 二.获取JSON数据 总结 前言 还是毕设- 要用到哈尔滨黑河酒店的数据 但每个城市都一样 还是从携程下手- 一.请求头,请求参数 在携程主页搜索我们要爬取的城市 ...

  6. 最新爬取携程酒店信息代码

    代码方面,我们使用scrapy框架爬取酒店信息,经过测试发现,使用这种方法不会被封ip和cookie. 思路: 1.得到城市的编号 2.通过编号,进入酒店列表,并且得到酒店总数 3.计算酒店页数,构造 ...

  7. 快速简单爬取携程酒店信息简介

    先查看网站发送信息格式,发现可以通过ajax来拿取信息,还没有ip访问限制.然后顺便爬了杭州5000家酒店信息 import scrapy import time import json from x ...

  8. python爬取酒店信息_Python 爬取美團酒店信息

    事由:近期和朋友聊天,聊到黃山酒店事情,需要了解一下黃山的酒店情況,然后就想着用python 爬一些數據出來,做個參考 主要思路:通過查找,基本思路清晰,目標明確,僅僅爬取美團莫一地區的酒店信息,不過 ...

  9. 利用selenium爬取携程酒店信息

    上节博客我们利用requests请求库,正则表达式来提取信息(链接https://mp.csdn.net/postedit/81865681),提到过使用selenium也可以抓取酒店信息,在这里利用 ...

最新文章

  1. 【转】触屏手机电话拨打链接
  2. linux显示磁盘使用情况命令,Linux显示磁盘使用率信息(iostat)
  3. 服务器运维管理系统哪个好用,宝塔和云帮手哪个服务器运维管理工具好用?
  4. SAP Spartacus core模块的单元测试
  5. java中父类与子类, 不同的两个类中的因为构造函数由于递归调用导致栈溢出问题...
  6. x 6什么意思python_Python基础_6
  7. linux r后台执行,screen 命令简单用法 Linux后台执行 就用它
  8. java过滤集合数量,java – 使用lambdaj过滤集合
  9. 基于大数据技术的电信客户流失预测模型 研究及应用 大数据
  10. 取消idm浏览网页时的自动下载
  11. 9.1. Logical Operators
  12. MySQL——MySQL备份
  13. 智能手机也是一种计算机对不对,介绍手机内存的新闻,我转的,对不对不要喷啊...
  14. 【天梯赛】L2-039 清点代码库** (25 point(s))
  15. 数字签名技术及加密算法
  16. A. Sonya and Queries
  17. -wl,-soname的作用
  18. gulp 雪碧图制作
  19. MySQL游标的使用
  20. 关于微信小程序如何获取用户头像(保存到本地)新方法

热门文章

  1. Pinned Memory 多设备异步拷贝
  2. docker CLI官方教程 run方法解析(docer run 、docker attach 与 docker exec的区别)
  3. 武汉大学研究生慕课《学术道德与学术规范》
  4. 数理统计基础-相关系数
  5. Ipad开发课程系列目录--很好的教程,推荐给大家
  6. SE93 创建参数事务
  7. 兔子繁殖 c语言编程,c语言写的兔子繁殖- 斐波那契数列.每次只显示前两个.
  8. 第四章:迭代器与生成器
  9. 产品经理学习笔记(13)-用户反馈的意义
  10. 如何看待用户反馈意见