python爬虫之 requests库的使用

一、requests库的安装

1.直接在终端输入命令安装:

2.Pycharm安装

二、基于HTTP协议的requests的请求机制

　1、http协议:（以请求百度为例）
　　（1）请求url:
　　　　　　https://www.baidu.com/
　　（2）请求方式:
　　　　GET
　　（3）请求头:
　　　　Cookie：可能需要关注。
　　　　User-Agent: 用来证明你是浏览器
　　　　注意: 去浏览器的request headers中查找
　　　　Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.146 Safari/537.36
　　　　Host: www.baidu.com　　

2.requests几种使用方式

import requests
r = requests.get('https://api.github.com/events')
r = requests.post('http://httpbin.org/post', data = {'key':'value'})
r = requests.put('http://httpbin.org/put', data = {'key':'value'})
r = requests.put('http://httpbin.org/put', data = data)#需要先定义data变量
r = requests.delete('http://httpbin.org/delete')
r = requests.head('http://httpbin.org/get')
r = requests.options('http://httpbin.org/get')

3.实例

import requestsresponse = requests.get(url='https://www.baidu.com/')
response.encoding = 'utf-8'
print(response)  # <Response [200]>
# 返回响应状态码
print(response.status_code)  # 200
# 返回响应文本
# print(response.text)
print(type(response.text))  # <class 'str'>
#将爬取的内容写入xxx.html文件
with open('baidu.html', 'w', encoding='utf-8') as f:f.write(response.text)

三、GET请求讲解

　1、请求头headers使用（以访问“知乎发现”为例）

　（1）、直接爬取，则会出错：　　　

1 访问”知乎发现“
2 import requests
3 response = requests.get(url='https://www.zhihu.com/explore')
4 print(response.status_code)  # 400
5 print(response.text)  # 返回错误页面

（2）添加请求头


import requestsheaders = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.146 Safari/537.36'
}
response = requests.get(url='https://www.zhihu.com/explore', headers=headers)
print(response.status_code)
print(response.text)

2、params请求参数

　（1）在访问某些网站时，url会特别长，而且有一长串看不懂的字符串，这时可以用params进行参数替换

import requests
from urllib.parse import urlencode
#以百度搜索"赵丽颖"为例
# url = 'https://www.baidu.com/s?wd=%E8%94%A1%E5%BE%90%E5%9D%A4'
'''
方法1：
url = 'https://www.baidu.com/s?' + urlencode({"wd": "赵丽颖"})
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.146 Safari/537.36'
}
response = requests.get(url，headers）
'''
#方法2：
url = 'https://www.baidu.com/s?'
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.146 Safari/537.36'
}
# 在get方法中添加params参数
response = requests.get(url, headers=headers, params={"wd": "赵丽颖"})
print(url) # https://www.baidu.com/s?wd=%E8%94%A1%E5%BE%90%E5%9D%A4
# print(response.text)
with open('xukun.html', 'w', encoding='utf-8') as f:f.write(response.text)

　3、cookies参数使用

　（1）携带登录cookies破解github登录验证

携带cookies
携带登录cookies破解github登录验证请求url:https://github.com/settings/emails请求方式:GET请求头:User-AgenCookie: has_recent_activity=1; _ga=GA1.2.1416117396.1560496852; _gat=1; tz=Asia%2FShanghai; _octo=GH1.1.1728573677.1560496856; _device_id=1cb66c9a9599576a3b46df2455810999; user_session=1V8n9QfKpbgB-DhS4A7l3Tb3jryARZZ02NDdut3J2hy-8scm; __Host-user_session_same_site=1V8n9QfKpbgB-DhS4A7l3Tb3jryARZZ02NDdut3J2hy-8scm; logged_in=yes; dotcom_user=TankJam; _gh_sess=ZS83eUYyVkpCWUZab21lN29aRHJTUzgvWjRjc2NCL1ZaMHRsdGdJeVFQM20zRDdPblJ1cnZPRFJjclZKNkcrNXVKbTRmZ3pzZzRxRFExcUozQWV4ZG9kOUQzZzMwMzA2RGx5V2dSaTMwaEZ2ZDlHQ0NzTTBtdGtlT2tVajg0c0hYRk5IOU5FelYxanY4T1UvVS9uV0YzWmF0a083MVVYVGlOSy9Edkt0aXhQTmpYRnVqdFAwSFZHVHZQL0ZyQyt0ZjROajZBclY4WmlGQnNBNTJpeEttb3RjVG1mM0JESFhJRXF5M2IwSlpHb1Mzekc5M0d3OFVIdGpJaHg3azk2aStEcUhPaGpEd2RyMDN3K2pETmZQQ1FtNGNzYnVNckR4aWtibkxBRC8vaGM9LS1zTXlDSmFnQkFkWjFjanJxNlhCdnRRPT0%3D--04f6f3172b5d01244670fc8980c2591d83864f60

方法一：在请求头中拼接cookies

import requests# 请求url
url = 'https://github.com/settings/emails'# 请求头
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.146 Safari/537.36',# 在请求头中拼接cookies# 'Cookie': 'has_recent_activity=1; _ga=GA1.2.1416117396.1560496852; _gat=1; tz=Asia%2FShanghai; _octo=GH1.1.1728573677.1560496856; _device_id=1cb66c9a9599576a3b46df2455810999; user_session=1V8n9QfKpbgB-DhS4A7l3Tb3jryARZZ02NDdut3J2hy-8scm; __Host-user_session_same_site=1V8n9QfKpbgB-DhS4A7l3Tb3jryARZZ02NDdut3J2hy-8scm; logged_in=yes; dotcom_user=TankJam; _gh_sess=ZS83eUYyVkpCWUZab21lN29aRHJTUzgvWjRjc2NCL1ZaMHRsdGdJeVFQM20zRDdPblJ1cnZPRFJjclZKNkcrNXVKbTRmZ3pzZzRxRFExcUozQWV4ZG9kOUQzZzMwMzA2RGx5V2dSaTMwaEZ2ZDlHQ0NzTTBtdGtlT2tVajg0c0hYRk5IOU5FelYxanY4T1UvVS9uV0YzWmF0a083MVVYVGlOSy9Edkt0aXhQTmpYRnVqdFAwSFZHVHZQL0ZyQyt0ZjROajZBclY4WmlGQnNBNTJpeEttb3RjVG1mM0JESFhJRXF5M2IwSlpHb1Mzekc5M0d3OFVIdGpJaHg3azk2aStEcUhPaGpEd2RyMDN3K2pETmZQQ1FtNGNzYnVNckR4aWtibkxBRC8vaGM9LS1zTXlDSmFnQkFkWjFjanJxNlhCdnRRPT0%3D--04f6f3172b5d01244670fc8980c2591d83864f60'
}
github_res = requests.get(url, headers=headers)

　方法二：将cookies做为get的一个参数

import requests
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.146 Safari/537.36'}
cookies = {'Cookie': 'has_recent_activity=1; _ga=GA1.2.1416117396.1560496852; _gat=1; tz=Asia%2FShanghai; _octo=GH1.1.1728573677.1560496856; _device_id=1cb66c9a9599576a3b46df2455810999; user_session=1V8n9QfKpbgB-DhS4A7l3Tb3jryARZZ02NDdut3J2hy-8scm; __Host-user_session_same_site=1V8n9QfKpbgB-DhS4A7l3Tb3jryARZZ02NDdut3J2hy-8scm; logged_in=yes; dotcom_user=TankJam; _gh_sess=ZS83eUYyVkpCWUZab21lN29aRHJTUzgvWjRjc2NCL1ZaMHRsdGdJeVFQM20zRDdPblJ1cnZPRFJjclZKNkcrNXVKbTRmZ3pzZzRxRFExcUozQWV4ZG9kOUQzZzMwMzA2RGx5V2dSaTMwaEZ2ZDlHQ0NzTTBtdGtlT2tVajg0c0hYRk5IOU5FelYxanY4T1UvVS9uV0YzWmF0a083MVVYVGlOSy9Edkt0aXhQTmpYRnVqdFAwSFZHVHZQL0ZyQyt0ZjROajZBclY4WmlGQnNBNTJpeEttb3RjVG1mM0JESFhJRXF5M2IwSlpHb1Mzekc5M0d3OFVIdGpJaHg3azk2aStEcUhPaGpEd2RyMDN3K2pETmZQQ1FtNGNzYnVNckR4aWtibkxBRC8vaGM9LS1zTXlDSmFnQkFkWjFjanJxNlhCdnRRPT0%3D--04f6f3172b5d01244670fc8980c2591d83864f60'
}github_res = requests.get(url, headers=headers, cookies=cookies)print('15622792660' in github_res.text)

四、POST请求讲解

　1、GET和POST介绍
　　(1)GET请求: (HTTP默认的请求方法就是GET)
    　　 * 没有请求体
   　　 * 数据必须在1K之内！
  　　   * GET请求数据会暴露在浏览器的地址栏中

　　 (2)GET请求常用的操作：
   　　    1. 在浏览器的地址栏中直接给出URL，那么就一定是GET请求
   　　    2. 点击页面上的超链接也一定是GET请求
   　　    3. 提交表单时，表单默认使用GET请求，但可以设置为POST

　（3）POST请求
　　   (1). 数据不会出现在地址栏中
　　    (2). 数据的大小没有上限
  　　 (3). 有请求体
　　   (4). 请求体中如果存在中文，会使用URL编码！

！！！requests.post()用法与requests.get()完全一致，特殊的是requests.post()有一个data参数，用来存放请求体数据!

　2、POST请求自动登录github

　　对于登录来说，应该在登录输入框内输错用户名或密码然后抓包分析通信流程，假如输对了浏览器就直接跳转了，还分析什么鬼？就算累死你也找不到数据包

'''POST请求自动登录github。github反爬:1.session登录请求需要携带login页面返回的cookies2.email页面需要携带session页面后的cookies
'''import requests
import re
# 一 访问login获取authenticity_token
login_url = 'https://github.com/login'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36','Referer': 'https://github.com/'
}
login_res = requests.get(login_url, headers=headers)
# print(login_res.text)
authenticity_token = re.findall('name="authenticity_token" value="(.*?)"', login_res.text, re.S)[0]
# print(authenticity_token)
login_cookies = login_res.cookies.get_dict()# 二 携带token在请求体内往session发送POST请求
session_url = 'https://github.com/session'session_headers = {'Referer': 'https://github.com/login','User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36',
}form_data = {"commit": "Sign in","utf8": "✓","authenticity_token": authenticity_token,"login": "username","password": "githubpassword",'webauthn-support': "supported"
}# 三 开始测试是否登录
session_res = requests.post(session_url,data=form_data,cookies=login_cookies,headers=session_headers,# allow_redirects=False
)session_cookies = session_res.cookies.get_dict()url3 = 'https://github.com/settings/emails'
email_res = requests.get(url3, cookies=session_cookies)print('账号' in email_res.text)自动登录github（手动处理cookies信息）

五、response响应

1、response属性

import requestsheaders = {'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.76 Mobile Safari/537.36',
}response = requests.get('https://www.github.com', headers=headers)# response响应
print(response.status_code)  # 获取响应状态码
print(response.url)  # 获取url地址
print(response.text)  # 获取文本
print(response.content)  # 获取二进制流
print(response.headers)  # 获取页面请求头信息
print(response.history)  # 上一次跳转的地址
print(response.cookies)  # # 获取cookies信息
print(response.cookies.get_dict())  # 获取cookies信息转换成字典
print(response.cookies.items())  # 获取cookies信息转换成字典
print(response.encoding)  # 字符编码
print(response.elapsed)  # 访问时间

六、requests高级用法

1.超时设置

# 超时设置
# 两种超时:float or tuple
# timeout=0.1  # 代表接收数据的超时时间
# timeout=(0.1,0.2)  # 0.1代表链接超时  0.2代表接收数据的超时时间import requestsresponse = requests.get('https://www.baidu.com',timeout=0.0001)

2.代理

import requests,randomproxies = random.choice([{"HTTP": "http://309435:szayclhp@114.67.228.116:16819"},])
#{"协议类型":"类型://账号:密码@IP:端口"}
res = requests.get("http://www.hngp.gov.cn/kaifeng/ggcx?appCode=H62&pageSize=15&soCode=3a4137d48249496f9ffd731d474e84f0&pageNo=2",proxies=proxies,verify=False)