Python 自动操作浏览器

1. 安装库

pip install selenium # Windows电脑安装selenium
pip3 install selenium # Mac电脑安装selenium

2.安装浏览器驱动

Chrome浏览器

http://chromedriver.storage.googleapis.com/index.html?path=103.0.5060.134/

3.设置浏览器引擎

# Chrome浏览器设置方法
from selenium import webdriver
#从selenium库中调用webdriver模块
driver = webdriver.Chrome()
# 设置引擎为Chrome，真实地打开一个Chrome浏览器
driver.close()
#关闭浏览器，以免浪费资源

并不想让浏览器弹出来，浮在其他界面上的话，可以采用下面的写法。

# 本地Chrome浏览器的静默模式设置：
from selenium import  webdriver
#从selenium库中调用webdriver模块
from selenium.webdriver.chrome.options import Options
# 从options模块中调用Options类chrome_options = Options()
# 实例化Option对象
chrome_options.add_argument('--headless') # 把Chrome浏览器设置为静默模式
driver = webdriver.Chrome(options = chrome_options)
# 设置引擎为Chrome，在后台默默运行
driver.close()

4.使用selenium获取数据

selenium库同样也具备解析数据、提取数据的能力。它和BeautifulSoup的底层原理一致，但在一些细节和语法上有所出入。

from selenium import webdriver
driver = webdriver.Chrome()
driver.get( url )
#打开指定网页
label = driver.find_element_by_tag_name('label')
print(label.text)
driver.close()
#关闭浏览器驱动

解析数据是由driver自动完成的，提取数据是driver的一个方法。

方法	用途
find_element_by_id	id属性
find_element_by_class_name	class属性
find_element_by_name	name属性
find_element_by_tag_name	元素标签名称
find_element_by_link_text	链接文本获取超链接
find_element_by_partial_link_text	链接部分文本获取超链接
find_element_by_xpath	使用元素的绝对位置来获取，或相对于有一个id或name属性的元素（理论上的父元素）的来获取你想要的元素
find_element_by_css_selector	通过css选择器

提取出的数据属于WebElement类对象，<class 'selenium.webdriver.remote.webelement.WebElement'>。它也有一个方法，可以通过属性名提取属性的值，这个方法是.get_attribute()。
如果想要提取多个数据，在element后面加上s即可，提取出来的对象是由WebElement对象组成的列表。

除了上面的公共方法也有两个私有方法：

from selenium.webdriver.common.by import By
driver.find_element(By.,'')
driver.find_elements(By.,'')

By类的属性
ID = id
XPATH = xpath
LINK_TEXT = link text
PARTIAL_LINK_TEXT = partial link text
NAME = name
TAG_NAME = tag name
CLASS_NAME = class name
CSS_SELECTOR = css selector

4.1 XPATH

页面示例：

<form id="loginForm"><input name="username" type="text" /><input name="password" type="password" /><input name="continue" type="submit" value="Login" /><input name="continue" type="button" value="Clear" />
</form>

查找form元素可以这样写：

form = driver.find_element_by_xpath("/html/body/form[1]")
#绝对定位(页面结构轻微调整就会被破坏)
form = driver.find_element_by_xpath("//form[1]")
#HTML页面中的第一个form元素
form = driver.find_element_by_xpath("//form[@id='loginForm']")
#包含 id 属性并且其值为 loginForm 的form元素

4.2 CSS_SELECTOR

页面元素示例：

<p class="content">Site content goes here.</p>

查找p元素：

driver.find_element_by_css_selector('p.content')

5.操作浏览器

.send_keys() # 模拟按键输入，自动填写表单
.click() # 点击元素
.clear() #清除元素的内容

6.selenium与BeautifulSoup的合作

除了用selenium解析与提取数据，还有一种解决方案，那就是，使用selenium获取网页，然后交给BeautifulSoup解析和提取。先用selenium获取完整的网页源代码，再使用BeautifulSoup把字符串格式的html解析为BeautifulSoup对象，再从中提取数据。

from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome()
#设置好浏览器后
driver.get( url )
#等待浏览器加载完数据
time.sleep(3)
#获取html字符串
HTML = driver.page_source
bs = BeautifulSoup(HTML,'html.parser')
label = bs.find('label')
driver.close()

selenium库的优点就是简单直观，然而由于真实的模拟人操作浏览器，需要等待网页加载，所以在爬取大量数据时，速度会比较慢。
另外，结尾附上selenium库的中文文档。

https://selenium-python-zh.readthedocs.io/en/latest/locating-elements.html