爬取去哪儿酒店信息及评论

第一步，获取城市列表

import requests
import json
import codecs# 去哪儿城市列表
url = "https://touch.qunar.com/h-api/hotel/hotelcity/en"s = requests.get(url)file = codecs.open('./city.json','w','utf-8')file.write(s.text)
file.close()

运行结果：

第二步根据城市列表爬取酒店信息（以汉庭酒店为例）

需要注意两点：

1.请求时需要带数据
当前时间必须在fromDate--toDate之前
"b":{"bizVersion":"17",
"cityUrl":cityid,
"fromDate":fromDate,
"toDate":toDate,
"q":"汉庭酒店",
"qFrom":3,
"start":start,
"num":20,
"minPrice":0,
"maxPrice":-1,
"level":"",
"sort":0,
"cityType":1,
"fromForLog":1,
"uuid":"",
"userName":"",
"userId":"",
"fromAction":"",
"searchType":0,
"hourlyRoom":False,
"locationAreaFilter":[],
"comprehensiveFilter":[],
"channelId":1},
"qrt":"h_hlist",
"source":"website"}2.以这个headers进行访问，cookie填你自己的cookie
headers = {"accept": "application/json, text/plain, */*","accept-encoding": "gzip, deflate, br","accept-language": "zh-CN,zh;q=0.9","content-length": "389","content-type": "application/json;charset=UTF-8","cookie":"你的cookie","origin": "https://hotel.qunar.com","referer": r_url,"user-agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36"}

上代码

import json
import requests
import datetime
import urllib.parse as p
import time
import codecs
import csv
import re
def get_session(cityid,city):fromDate = datetime.date.today().strftime("%Y-%m-%d")toDate = (datetime.date.today() + datetime.timedelta(days=1)).strftime("%Y-%m-%d")url = "https://hotel.qunar.com/cn/{}/?fromDate={}&toDate={}&cityName={}&from=qunarindex&cityurl=".format(cityid,fromDate,toDate,p.quote(city))headers = {"user-agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36","referer": "https://www.qunar.com/"}session = requests.session()res = session.get(url=url,headers = headers)return session,url,fromDate,toDate
def get_data(session,cityid,fromDate,toDate,start,url,headers):payload = {"b":{"bizVersion":"17","cityUrl":cityid,"fromDate":fromDate,"toDate":toDate,"q":"汉庭酒店","qFrom":3,"start":start,"num":20,"minPrice":0,"maxPrice":-1,"level":"","sort":0,"cityType":1,"fromForLog":1,"uuid":"","userName":"","userId":"","fromAction":"","searchType":0,"hourlyRoom":False,"locationAreaFilter":[],"comprehensiveFilter":[],"channelId":1},"qrt":"h_hlist","source":"website"}data = json.dumps(payload)res = session.post(url=url,data=data,headers=headers)if start == 0:# print(json.loads(res.text))print (json.loads(res.text))return res,session,json.loads(res.text)["data"]["tcount"]else:return res,sessiondef get_pages(cityid,city):session,r_url,fromDate,toDate = get_session(cityid,city)url = "https://hotel.qunar.com/napi/list"headers = {"accept": "application/json, text/plain, */*","accept-encoding": "gzip, deflate, br","accept-language": "zh-CN,zh;q=0.9","content-length": "389","content-type": "application/json;charset=UTF-8","cookie":"你的cookie","origin": "https://hotel.qunar.com","referer": r_url,"user-agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36"}start=0res,session,end_num = get_data(session,cityid,fromDate,toDate,start,url,headers)with open("{}.csv".format(city),'a+',encoding="utf-8") as f:f.write("name,url,city,seqno\n")for start in range(20,end_num,20):res,session = get_data(session,cityid,fromDate,toDate,start,url,headers)if res.status_code == 200:with open("{}.csv".format(city),'a+',encoding="utf-8") as f:time.sleep(2)# print(res.text)hotels = json.loads(res.text)['data']['hotels']# print(hotels[1])for i in range(0,len(hotels)):print(hotels[i]['seqNo'])# exit()seqNo = re.findall(cityid+'_(.*)',str(hotels[i]['seqNo']))print(seqNo)# exit()f.write(hotels[i]['name']+','+'https://hotel.qunar.com/cn/'+cityid+'/dt-'+seqNo[0]+','+city+','+hotels[i]['seqNo']+'\n')else:print("获取数据失败")time.sleep(2)
def get_city():f = codecs.open('./city.json','r','utf-8')return json.loads(f.read())
if __name__=="__main__":d1 = get_city()for k in range(len(d1['data'])):items = d1['data'][k].items()for key,value in items:for j in range(10,len(value)):cityid = value[j]['cityUrl']city = value[j]['cityName']print(cityid)get_pages(cityid,city)

运行结果

第三步爬取酒店评论

import requests
import csv
import jsonf= open(r'new.csv','r',encoding='gbk')
with open('remark1.csv','a+',encoding='utf-8') as f_remark:f_remark.write("name,star,feed,time\n")
reader = csv.reader(f)
for item in reader:print(reader.line_num)if reader.line_num == 1:continueif reader.line_num == 361:breakif reader.line_num <236:continueprint("当前内容：",item)headers = {"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9","accept-encoding": "gzip, deflate, br","accept-language": "zh-CN,zh;q=0.9",# "content-length": "389",# "content-type": "application/json;charset=UTF-8","cookie":"你的cookje",# "origin": "https://hotel.qunar.com","referer": item[1],"user-agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36"}for i in range(100):# print(i)try:url =" https://hotel.qunar.com/napi/ugcCmtList?hotelSeq={}&page={}&size=10".format(item[3],i+1)response = requests.get(url=url,headers=headers)# print(i)res = json.loads(response.text)# print(json.loads(res['data']['list'][0]['content'])['evaluation'])for j in range(len(res['data']['list'])):content = json.loads(res['data']['list'][j]['content'])star = content['evaluation']feed = content['feedContent'].replace('\n','').replace('\r', '').replace(',','，')time = content['modtime']# print(item[0],star,feed,time)if(star=='' or feed=='' or str(time)==''):continuewith open('remark1.csv','a+',encoding='utf-8') as f_remark:f_remark.write(item[0]+','+str(star)+','+str(feed)+','+str(time)+'\n')print(feed)except:break
f_remark.close()
f.close()

运行结果

爬取去哪儿酒店信息及评论相关推荐

基于selenium爬取去哪儿酒店信息
去哪儿网站中,要爬取旅游的酒店信息,我们用通常的requests库进行爬取的时候发现,当我们要翻页的时候网址未出现变化,返回的网页源码信息始终只有第一页的内容,那么有没有一种方式可以使得能够翻页爬取呢 ...
最新爬取携程酒店信息上：思路讲解
本以为携程的信息很好爬,但是在我目前能力一般的时候,经过尝试,发现了携程真的有太多坑了,虽然说代码和大佬比起来不是最优的,但是可以完成爬取任务. 在这里记录一下本次学习过程,为后人乘凉. 要爬取所有的 ...
使用requests、BeautifulSoup、线程池爬取艺龙酒店信息并保存到Excel中
import requests import time, random, csv from fake_useragent import UserAgent from bs4 import Beauti ...
python爬取携程酒店信息_不写代码玩转爬虫实例（3） - 抓取携程酒店信息
背景需求有不少朋友问永恒君携程网站的酒店信息怎么抓取,今天这篇文章来分享一下使用web scraper来快速实现抓取携程酒店信息. 例如,在携程官网搜索北京密云水库的酒店信息, 可以搜索到非常多的 ...
Python爬取携程酒店信息
文章目录前言一.请求头,请求参数二.获取JSON数据总结前言还是毕设- 要用到哈尔滨黑河酒店的数据但每个城市都一样还是从携程下手- 一.请求头,请求参数在携程主页搜索我们要爬取的城市 ...
最新爬取携程酒店信息代码
代码方面,我们使用scrapy框架爬取酒店信息,经过测试发现,使用这种方法不会被封ip和cookie. 思路: 1.得到城市的编号 2.通过编号,进入酒店列表,并且得到酒店总数 3.计算酒店页数,构造 ...
快速简单爬取携程酒店信息简介
先查看网站发送信息格式,发现可以通过ajax来拿取信息,还没有ip访问限制.然后顺便爬了杭州5000家酒店信息 import scrapy import time import json from x ...
python爬取酒店信息_Python 爬取美團酒店信息
事由:近期和朋友聊天,聊到黃山酒店事情,需要了解一下黃山的酒店情況,然后就想着用python 爬一些數據出來,做個參考主要思路:通過查找,基本思路清晰,目標明確,僅僅爬取美團莫一地區的酒店信息,不過 ...
利用selenium爬取携程酒店信息
上节博客我们利用requests请求库,正则表达式来提取信息(链接https://mp.csdn.net/postedit/81865681),提到过使用selenium也可以抓取酒店信息,在这里利用 ...

爬取去哪儿酒店信息及评论

爬取去哪儿酒店信息及评论

第一步，获取城市列表

运行结果：

第二步根据城市列表爬取酒店信息（以汉庭酒店为例）

需要注意两点：

上代码

运行结果

第三步爬取酒店评论

运行结果

爬取去哪儿酒店信息及评论相关推荐

最新文章

热门文章

爬取去哪儿酒店信息及评论

爬取去哪儿酒店信息及评论

第一步，获取城市列表

运行结果：

第二步 根据城市列表爬取酒店信息（以汉庭酒店为例）

需要注意两点：

上代码

运行结果

第三步爬取酒店评论

运行结果

爬取去哪儿酒店信息及评论相关推荐

最新文章

热门文章

第二步根据城市列表爬取酒店信息（以汉庭酒店为例）