基于selenium爬取去哪儿酒店信息

去哪儿网站中，要爬取旅游的酒店信息，我们用通常的requests库进行爬取的时候发现，当我们要翻页的时候网址未出现变化，返回的网页源码信息始终只有第一页的内容，那么有没有一种方式可以使得能够翻页爬取呢？这时候我们要用爬虫常用的selenium框架进行爬取了，下面就让我们来一起学习下，这篇关于用selenium怎么来爬取去哪儿网站的酒店信息，希望大家在阅读完之后有所收获。

下载selenium第三方库：

这里我们使用命令 pip install selenium进行安装，这里可能安装的过程会有点慢，我们可以加一个镜像进行安装，命令如下:

pip install selenium -i https://pypi.tuna.tsinghua.edu.cn/simple

安装浏览器驱动：

我们首先需要确定我们浏览器的版本，这里我使用的是谷歌浏览器

下载驱动的网址有以下几个：

谷歌浏览器

https://chromedriver.storage.googleapis.com/index.html

火狐浏览器

https://github.com/mozilla/geckodriver/releases

Edge浏览器

Microsoft Edge WebDriver - Microsoft Edge Developer

下载完成之后，把驱动导入python的安装目录里面：

这样我们的selenium安装所需的步骤就完成了，下面我们对网站进行爬取

导入所需要的库：

from selenium import webdriver
from lxml import html
import time
import re
from selenium.webdriver.common.by import By
from selenium.webdriver import ChromeOptions
import json

分析网页：

我们先明确我们要提取的信息，这里我要的到的内容为如下：

当我们访问网址

https://hotel.qunar.com/cn/fuzhou_fujian?fromDate=2023-05-20&toDate=2023-05-21&cityName=%E7%A6%8F%E5%B7%9E

会发现要获取网页的信息，我们需要提前登录才能查看，如图所示：

因此我们如何要使用selenium爬取的时候，需要进行模拟登录的操作，才能获取完整的网页源码信息

步骤一：

我们需要模拟登录网页，这里我采用的获取登录后的cookie信息，然后进行模拟登录的操作

代码如下：

option = ChromeOptions()
# 配置浏览器的相关设置，把浏览器设置系统不可检测
option.add_experimental_option('excludeSwitches', ['enable-automation'])
# 设置编码集
option.add_argument('lang=zh_CN.UTF-8')
browser = webdriver.Chrome(options=option)browser.get('https://hotel.qunar.com/cn/fuzhou_fujian?fromDate=2023-04-15&toDate=2023-04-16&cityName=%E7%A6%8F%E5%B7%9E')time.sleep(30)
dictCookies = browser.get_cookies()  # 获取list的cookies
jsonCookies = json.dumps(dictCookies)  # 转换成字符串保存with open('cookie.txt', 'w') as f:f.write(jsonCookies)
print('cookies保存成功！')

这里我们在selenium模拟打开网页的时候，我们设置30s等待时间，将我们的登录信息填写完毕，再获取我们的cookie信息，保存在文件“cookie.txt”中

步骤二：

获取cookie信息之后，接下来我们就正式开始爬取信息，首先我们要先要用上一步采集的cookie信息来进行模拟登录的操作，代码如下：

def crack_permissions(browser):# 休眠，避免浏览器加载过慢time.sleep(5)# 读取cookie文件，拿到用户的登录cookie信息with open('cookie.txt', 'r', encoding='utf8') as f:listCookies = json.loads(f.read())# 往browser里添加cookiesfor cookie in listCookies:cookie_dict = {'domain': '.qunar.com','name': cookie.get('name'),'value': cookie.get('value'),"expires": '','path': '/','httpOnly': False,'HostOnly': False,'Secure': False}browser.add_cookie(cookie_dict)# 刷新浏览器信息browser.refresh()time.sleep(2)

步骤三：

登录后我们就可以开始抓取网页的源码信息了，这里我们需要设置要爬取的页面数量，这里我设置的120页，值得注意的是我们每次翻页的时候，需要进行一个下滑网页的操作，加载网页的内容，等待网页渲染，否则不能得到完整的网页信息，下滑代码如下：

# 模拟下滑到底部操作for j in range(1, 4):browser.execute_script("window.scrollTo(0, document.body.scrollHeight);")time.sleep(1)

步骤四：

接下来我们对网页源码进行清洗，这里我使用的是xpath进行清洗，代码如下：

 # 获取网页信息resp = browser.page_source# 加载xpath，用于数据解析etree = html.etreexml = etree.HTML(resp)

然后对每一栏信息进行提取，代码如下：

for k in range(1, 21):# name: 酒店名称name = xml.xpath(f'//*[@id="hotel_lst_body"]/li[{k}]/div/div[3]/p[1]/a/text()')if len(name) > 0:mess_dict['name'] = name[0]else:mess_dict['name'] = ''# 酒店价格price = xml.xpath(f'//*[@id="hotel_lst_body"]/li[{k}]/div/div[2]/p[1]/a/text()')if len(price) > 0:try:mess_dict['price'] = int(price[0])except:mess_dict['price'] = 0else:mess_dict['price'] = 0# 类型，例如：舒适型、高档型等dangciText = xml.xpath(f'//*[@id="hotel_lst_body"]/li[{k}]/div/div[3]/p[1]/span[2]/text()')if len(dangciText) > 0:mess_dict['dangciText'] = dangciText[0]else:mess_dict['dangciText'] = ''# 酒店评分score = xml.xpath(f'//*[@id="hotel_lst_body"]/li[{k}]/div/div[3]/p[2]/span[1]/text()')if len(score) > 0:try:mess_dict['score'] = float(score[0])except:mess_dict['score'] = 0.0else:mess_dict['score'] = 0.0# 酒店整体评价commentDesc = xml.xpath(f'//*[@id="hotel_lst_body"]/li[{k}]/div/div[3]/p[2]/span[2]/text()')if len(commentDesc) > 0:'''这里需要加入一个判断逻辑，在标签上有时候会与评论数相重叠，这里需要判断提取信息是否为评论数'''tmp = re.findall('共(.*?)条评论', commentDesc[0])if len(tmp) > 0:mess_dict['commentDesc'] = ''else:mess_dict['commentDesc'] = commentDesc[0]else:mess_dict['commentDesc'] = ''# 酒店评论数commentCount = xml.xpath(f'//*[@id="hotel_lst_body"]/li[{k}]/div/div[3]/p[2]/span[3]/text()')if len(commentCount) == 0:commentCount = xml.xpath(f'//*[@id="hotel_lst_body"]/li[{k}]/div/div[3]/p[2]/span[2]/text()')if len(commentCount) > 0:tmp = re.findall('共(.*?)条评论', commentCount[0])if len(tmp) > 0:try:mess_dict['commentCount'] = int(tmp[0])except:mess_dict['commentCount'] = 0else:mess_dict['commentCount'] = 0else:mess_dict['commentCount'] = 0# 酒店大致位置locationInfo = xml.xpath(f'//*[@id="hotel_lst_body"]/li[{k}]/div/div[3]/p[3]/text()')if len(locationInfo) > 0:mess_dict['locationInfo'] = locationInfo[0]else:mess_dict['locationInfo'] = ''

步骤五：

经过以上步骤，我们得到了我们想要的信息内容，这里我们要把数据存储在一个地方，我采用的是数据库存储：

def Connect_Sql(data_name: str):db = pymysql.connect(host='localhost',user='root',password='root',db=data_name,port=3306)return dbdef save_data_sql(data):try:conn = Connect_Sql('ptu')cursor = conn.cursor()try:sql = "insert into hotel_mess values (%s,%s,%s,%s,%s,%s,%s,%s)"cursor.execute(sql, data)except:print("缺失")conn.commit()cursor.close()conn.close()except:print("失败！")

最后得到的酒店信息的大致内容如下：

完整的代码如下：

获取cookie：

from selenium import webdriver
from lxml import html
import time
import re
from selenium.webdriver.common.by import By
from selenium.webdriver import ChromeOptions
import json
from save_data import save_data_sql
import datetimeoption = ChromeOptions()
# 配置浏览器的相关设置，把浏览器设置系统不可检测
option.add_experimental_option('excludeSwitches', ['enable-automation'])
# 设置编码集
option.add_argument('lang=zh_CN.UTF-8')
browser = webdriver.Chrome(options=option)browser.get('https://hotel.qunar.com/cn/fuzhou_fujian?fromDate=2023-04-15&toDate=2023-04-16&cityName=%E7%A6%8F%E5%B7%9E')time.sleep(30)
dictCookies = browser.get_cookies()  # 获取list的cookies
jsonCookies = json.dumps(dictCookies)  # 转换成字符串保存with open('cookie2.txt', 'w') as f:f.write(jsonCookies)
print('cookies保存成功！')

爬取酒店信息：

from selenium import webdriver
from lxml import html
import time
import re
from selenium.webdriver.common.by import By
from selenium.webdriver import ChromeOptions
import json
from save_data import save_data_sql# 获取浏览器驱动
def get_driver():option = ChromeOptions()# 配置浏览器的相关设置，把浏览器设置系统不可检测option.add_experimental_option('excludeSwitches', ['enable-automation'])# 设置编码集option.add_argument('lang=zh_CN.UTF-8')browser = webdriver.Chrome(options=option)return browser# 破解权限，拿到浏览器的cookie，进行模拟登录，绕开登录反爬
def crack_permissions(browser):# 休眠，避免浏览器加载过慢time.sleep(5)# 读取cookie文件，拿到用户的登录cookie信息with open('cookie2.txt', 'r', encoding='utf8') as f:listCookies = json.loads(f.read())# 往browser里添加cookiesfor cookie in listCookies:cookie_dict = {'domain': '.qunar.com','name': cookie.get('name'),'value': cookie.get('value'),"expires": '','path': '/','httpOnly': False,'HostOnly': False,'Secure': False}browser.add_cookie(cookie_dict)# 刷新浏览器信息browser.refresh()time.sleep(2)# 启动任务
def start_task(browser):for i in range(120):# 模拟下滑到底部操作for j in range(1, 4):browser.execute_script("window.scrollTo(0, document.body.scrollHeight);")time.sleep(1)# 获取网页信息resp = browser.page_source# 加载xpath，用于数据解析etree = html.etreexml = etree.HTML(resp)# 指定日期date_time = '2023-05-20'mess_dict = {}for k in range(1, 21):# name: 酒店名称name = xml.xpath(f'//*[@id="hotel_lst_body"]/li[{k}]/div/div[3]/p[1]/a/text()')if len(name) > 0:mess_dict['name'] = name[0]else:mess_dict['name'] = ''# 酒店价格price = xml.xpath(f'//*[@id="hotel_lst_body"]/li[{k}]/div/div[2]/p[1]/a/text()')if len(price) > 0:try:mess_dict['price'] = int(price[0])except:mess_dict['price'] = 0else:mess_dict['price'] = 0# 类型，例如：舒适型、高档型等dangciText = xml.xpath(f'//*[@id="hotel_lst_body"]/li[{k}]/div/div[3]/p[1]/span[2]/text()')if len(dangciText) > 0:mess_dict['dangciText'] = dangciText[0]else:mess_dict['dangciText'] = ''# 酒店评分score = xml.xpath(f'//*[@id="hotel_lst_body"]/li[{k}]/div/div[3]/p[2]/span[1]/text()')if len(score) > 0:try:mess_dict['score'] = float(score[0])except:mess_dict['score'] = 0.0else:mess_dict['score'] = 0.0# 酒店整体评价commentDesc = xml.xpath(f'//*[@id="hotel_lst_body"]/li[{k}]/div/div[3]/p[2]/span[2]/text()')if len(commentDesc) > 0:'''这里需要加入一个判断逻辑，在标签上有时候会与评论数相重叠，这里需要判断提取信息是否为评论数'''tmp = re.findall('共(.*?)条评论', commentDesc[0])if len(tmp) > 0:mess_dict['commentDesc'] = ''else:mess_dict['commentDesc'] = commentDesc[0]else:mess_dict['commentDesc'] = ''# 酒店评论数commentCount = xml.xpath(f'//*[@id="hotel_lst_body"]/li[{k}]/div/div[3]/p[2]/span[3]/text()')if len(commentCount) == 0:commentCount = xml.xpath(f'//*[@id="hotel_lst_body"]/li[{k}]/div/div[3]/p[2]/span[2]/text()')if len(commentCount) > 0:tmp = re.findall('共(.*?)条评论', commentCount[0])if len(tmp) > 0:try:mess_dict['commentCount'] = int(tmp[0])except:mess_dict['commentCount'] = 0else:mess_dict['commentCount'] = 0else:mess_dict['commentCount'] = 0# 酒店大致位置locationInfo = xml.xpath(f'//*[@id="hotel_lst_body"]/li[{k}]/div/div[3]/p[3]/text()')if len(locationInfo) > 0:mess_dict['locationInfo'] = locationInfo[0]else:mess_dict['locationInfo'] = ''# 读入数据库save_data_sql((mess_dict['name'], mess_dict['price'], mess_dict['dangciText'], mess_dict['score'],mess_dict['commentDesc'], mess_dict['commentCount'], mess_dict['locationInfo'], date_time))print(mess_dict)time.sleep(1)browser.find_element(By.XPATH, '//*[@id="root"]/div/section/section[1]/aside[1]/div[7]/p[1]').click()time.sleep(1)# 加载浏览器驱动
browser = get_driver()
# 进入网页
browser.get('https://hotel.qunar.com/cn/fuzhou_fujian?fromDate=2023-05-20&toDate=2023-05-21&cityName=%E7%A6%8F%E5%B7%9E')
# 破解权限
crack_permissions(browser=browser)
# 启动任务
start_task(browser=browser)

总结

爬取酒店信息的重难点主要在于模拟的登录的过程，需要注意的坑是每次我们翻页的时候要下拉下滑网页，等待网页加载，加载完毕之后才能获取网页信息，再来就是网页信息的清洗，这里值得注意的是可能存在标签错位的情况，这里我们就要特殊情况，特殊处理。

基于selenium爬取去哪儿酒店信息相关推荐

爬取去哪儿酒店信息及评论
爬取去哪儿酒店信息及评论第一步,获取城市列表 import requests import json import codecs# 去哪儿城市列表 url = "https://touch ...
利用selenium爬取携程酒店信息
上节博客我们利用requests请求库,正则表达式来提取信息(链接https://mp.csdn.net/postedit/81865681),提到过使用selenium也可以抓取酒店信息,在这里利用 ...
python携程酒店评论_Python基于selenium爬取携程酒店评论信息
爬取站点任意一个携程酒店的详细链接,这里给出了四个,准备开四个线程爬取: https://hotels.ctrip.com/hotel/6278770.html#ctm_ref=hod_hp_hot ...
python selenium爬取去哪儿网的酒店信息——详细步骤及代码实现
目录准备工作一.webdriver部分二.定位到新页面三.提取酒店信息 ??这里要注意?? 四.输出结果五.全部代码准备工作 1.pip install selenium 2.配置浏览器驱 ...
python爬取酒店信息_python selenium爬取去哪儿网的酒店信息（详细步骤及代码实现）...
准备工作 1.pip install selenium 2.配置浏览器驱动.配置其环境变量 Selenium3.x调用浏览器必须有一个webdriver驱动文件 Chrome驱动文件下载chromed ...
利用Selenium爬取淘宝商品信息
文章来源:公众号-智能化IT系统. 一. Selenium和PhantomJS介绍 Selenium是一个用于Web应用程序测试的工具,Selenium直接运行在浏览器中,就像真正的用户在操作一样. ...
python关于二手房的课程论文_基于python爬取链家二手房信息代码示例
基本环境配置 python 3.6 pycharm requests parsel time 相关模块pip安装即可确定目标网页数据哦豁,这个价格..................看到都觉得脑阔 ...
最新爬取携程酒店信息上：思路讲解
本以为携程的信息很好爬,但是在我目前能力一般的时候,经过尝试,发现了携程真的有太多坑了,虽然说代码和大佬比起来不是最优的,但是可以完成爬取任务. 在这里记录一下本次学习过程,为后人乘凉. 要爬取所有的 ...
python selenium 爬取去哪儿网的数据
python selenium 爬取去哪儿网的数据完整代码下载:https://github.com/tanjunchen/SpiderProject/tree/master/selenium+qu ...

基于selenium爬取去哪儿酒店信息

基于selenium爬取去哪儿酒店信息相关推荐

最新文章

热门文章