python3的urllib3和requests

urllib

Py2.x：Urllib库
Urllin2库
Py3.x：Urllib库
变化：在Pytho2.x中使用import urllib2——-对应的，在Python3.x中会使用import urllib.request，urllib.error。
在Pytho2.x中使用import urllib——-对应的，在Python3.x中会使用import urllib.request，urllib.error，urllib.parse。
在Pytho2.x中使用import urlparse——-对应的，在Python3.x中会使用import urllib.parse。
在Pytho2.x中使用import urlopen——-对应的，在Python3.x中会使用import urllib.request.urlopen。
在Pytho2.x中使用import urlencode——-对应的，在Python3.x中会使用import urllib.parse.urlencode。
在Pytho2.x中使用import urllib.quote——-对应的，在Python3.x中会使用import urllib.request.quote。
在Pytho2.x中使用cookielib.CookieJar——-对应的，在Python3.x中会使用http.CookieJar。
在Pytho2.x中使用urllib2.Request——-对应的，在Python3.x中会使用urllib.request.Request.

1.基本方法

`urllib.request.urlopen`(url, data=None, [timeout, ], cafile=None, capath=None, cadefault=False, context=None*)

- url: 需要打开的网址

- data：Post提交的数据

- timeout：设置网站的访问超时时间

直接用urllib.request模块的urlopen（）获取页面，page的数据格式为bytes类型，需要decode（）解码，转换成str类型。

from urllib import request
response = request.urlopen(r'http://python.org/') # <http.client.HTTPResponse object at 0x00000000048BC908> HTTPResponse类型
page = response.read()
page = page.decode('utf-8')

urlopen返回对象提供方法：

- read() , readline() ,readlines() , fileno() , close() ：对HTTPResponse类型数据进行操作

- info()：返回HTTPMessage对象，表示远程服务器返回的头信息

- getcode()：返回Http状态码。如果是http请求，200请求成功完成;404网址未找到

- geturl()：返回请求的url

2.使用Request

`urllib.request.Request`(url, data=None, headers={}, method=None)

使用request（）来包装请求，再通过urlopen（）获取页面。

url = r'http://www.lagou.com/zhaopin/Python/?labelWords=label'
headers = {'User-Agent': r'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) 'r'Chrome/45.0.2454.85 Safari/537.36 115Browser/6.0.3','Referer': r'http://www.lagou.com/zhaopin/Python/?labelWords=label','Connection': 'keep-alive'
}
req = request.Request(url, headers=headers)
page = request.urlopen(req).read()
page = page.decode('utf-8')

用来包装头部的数据：

- User-Agent ：这个头部可以携带如下几条信息：浏览器名和版本号、操作系统名和版本号、默认语言

- Referer：可以用来防止盗链，有一些网站图片显示来源http://***.com，就是检查Referer来鉴定的

- Connection：表示连接状态，记录Session的状态。

3.Post数据

`urllib.request.urlopen`(url, data=None, [timeout, ], cafile=None, capath=None, cadefault=False, context=None*)

urlopen（）的data参数默认为None，当data参数不为空的时候，urlopen（）提交方式为Post。

from urllib import request, parse
url = r'http://www.lagou.com/jobs/positionAjax.json?'
headers = {'User-Agent': r'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) 'r'Chrome/45.0.2454.85 Safari/537.36 115Browser/6.0.3','Referer': r'http://www.lagou.com/zhaopin/Python/?labelWords=label','Connection': 'keep-alive'
}
data = {'first': 'true','pn': 1,'kd': 'Python'
}
data = parse.urlencode(data).encode('utf-8')
req = request.Request(url, headers=headers, data=data)
page = request.urlopen(req).read()
page = page.decode('utf-8')

`urllib.parse.urlencode`(query, doseq=False, safe='', encoding=None, errors=None)

urlencode（）主要作用就是将url附上要提交的数据。

data = {'first': 'true','pn': 1,'kd': 'Python'
}
data = parse.urlencode(data).encode('utf-8')

经过urlencode（）转换后的data数据为?first=true?pn=1?kd=Python，最后提交的url为

http://www.lagou.com/jobs/positionAjax.json?first=true?pn=1?kd=Python

Post的数据必须是bytes或者iterable of bytes，不能是str，因此需要进行encode（）编码

page = request.urlopen(req, data=data).read()

当然，也可以把data的数据封装在urlopen（）参数中

4.异常处理

def get_page(url):headers = {'User-Agent': r'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) 'r'Chrome/45.0.2454.85 Safari/537.36 115Browser/6.0.3','Referer': r'http://www.lagou.com/zhaopin/Python/?labelWords=label','Connection': 'keep-alive'}data = {'first': 'true','pn': 1,'kd': 'Python'}data = parse.urlencode(data).encode('utf-8')req = request.Request(url, headers=headers)try:page = request.urlopen(req, data=data).read()page = page.decode('utf-8')except error.HTTPError as e:print(e.code())print(e.read().decode('utf-8'))return page

5、使用代理

`urllib.request.ProxyHandler`(proxies=None)

当需要抓取的网站设置了访问限制，这时就需要用到代理来抓取数据。

data = {'first': 'true','pn': 1,'kd': 'Python'}
proxy = request.ProxyHandler({'http': '5.22.195.215:80'})  # 设置proxy
opener = request.build_opener(proxy)  # 挂载opener
request.install_opener(opener)  # 安装opener
data = parse.urlencode(data).encode('utf-8')
page = opener.open(url, data).read()
page = page.decode('utf-8')
return page

5、使用cookie

urllib.request.HTTPCookieProcessor()

爬取的网页涉及登录信息。访问每一个互联网页面，都是通过HTTP协议进行的，而HTTP协议是一个无状态协议，所谓的无状态协议即无法维持会话之间的状态。


import urllib.request
import urllib.parse
import urllib.error
import http.cookiejarurl='http://bbs.chinaunix.net/member.php?mod=logging&action=login&loginsubmit=yes&loginhash=La2A2'
data={'username':'zhanghao','password':'mima',
}
postdata=urllib.parse.urlencode(data).encode('utf8')
header={'User-Agent':'Mozilla/5.0 (X11; Fedora; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
}request=urllib.request.Request(url,postdata,headers=header)
#使用http.cookiejar.CookieJar()创建CookieJar对象
cjar=http.cookiejar.CookieJar()
#使用HTTPCookieProcessor创建cookie处理器，并以其为参数构建opener对象
cookie=urllib.request.HTTPCookieProcessor(cjar)
opener=urllib.request.build_opener(cookie)
#将opener安装为全局
urllib.request.install_opener(opener)

urllib3的使用：

生成请求(request)：

　　　首先，你必须导入urllib3模块：

　　　然后你需要一个PoolManager实例来生成请求,由该实例对象处理与线程池的连接以及线程安全的所有细节，不需要任何人为操作：

　　　通过request()方法创建一个请求：

　　　request()方法返回一个HTTPResponse对象。

　　　你还可以通过request()方法向请求(request)中添加一些其他信息，如：

　　　请求(request)中的数据项(request data)可包括：

Headers:

　　　在request()方法中，可以定义一个字典类型(dictionary),并作为headers参数传入：

Query parameters:

　　　对于GET、HEAD和DELETE请求，可以简单的通过定义一个字典类型作为fields参数传入即可：

　　　对于POST和PUT请求(request),需要手动对传入数据进行编码，然后加在URL之后：

Form data:

　　　对于PUT和POST请求(request),urllib3会自动将字典类型的field参数编码成表格类型.

JSON:

　　　在发起请求时,可以通过定义body 参数并定义headers的Content-Type参数来发送一个已经过编译的JSON数据：

Files & binary data:

　　　使用multipart/form-data编码方式上传文件,可以使用和传入Form data数据一样的方法进行,并将文件定义为一个元组的形式　　　　　(file_name,file_data):

　　　文件名(filename)的定义不是严格要求的,但是推荐使用,以使得表现得更像浏览器。同时，还可以向元组中再增加一个数据来定义文件的　MIME类型：

　　　如果是发送原始二进制数据，只要将其定义为body参数即可。同时，建议对header的Content-Type参数进行设置：

Timeout :

　　　使用timeout，可以控制请求的运行时间。在一些简单的应用中，可以将timeout参数设置为一个浮点数：

　　　要进行更精细的控制，可以使用Timeout实例，将连接的timeout和读的timeout分开设置：

　　　如果想让所有的request都遵循一个timeout，可以将timeout参数定义在PoolManager中：

　　　或者

　　　当在具体的request中再次定义timeout时，会覆盖PoolManager层面上的timeout。

请求重试(retrying requests):

　　　Urllib3 可以自动重试幂等请求，原理和handles redirect一样。可以通过设置retries参数对重试进行控制。Urllib3默认进行3次请求重　　试，并进行3次方向改变。

　　　给retries参数定义一个整型来改变请求重试的次数：

　　　关闭请求重试(retrying request)及重定向(redirect)只要将retries定义为False即可：

　　　关闭重定向(redirect)但保持重试(retrying request),将redirect参数定义为False即可：

　　　要进行更精细的控制，可以使用retry实例，通过该实例可以对请求的重试进行更精细的控制。

　　　例如，进行3次请求重试，但是只进行2次重定向：

　　　如果想让所有请求都遵循一个retry策略，可以在PoolManager中定义retry参数：

　　　或者

　　　当在具体的request中再次定义retry时，会覆盖 PoolManager层面上的retry。

requests

requests是使用Apache2 licensed 许可证的HTTP库。用python编写。比urllib2模块更简洁。Request支持HTTP连接保持和连接池，支持使用cookie保持会话，支持文件上传，支持自动响应内容的编码，支持国际化的URL和POST数据自动编码。在python内置模块的基础上进行了高度的封装，从而使得python进行网络请求时，变得人性化，使用Requests可以轻而易举的完成浏览器可有的任何操作。现代，国际化，友好。requests会自动实现持久连接keep-alive

# HTTP请求类型
# get类型
r = requests.get('https://github.com/timeline.json')
# post类型
r = requests.post("http://m.ctrip.com/post")
# put类型
r = requests.put("http://m.ctrip.com/put")
# delete类型
r = requests.delete("http://m.ctrip.com/delete")
# head类型
r = requests.head("http://m.ctrip.com/head")
# options类型
r = requests.options("http://m.ctrip.com/get")# 获取响应内容
print(r.content) #以字节的方式去显示，中文显示为字符
print(r.text) #以文本的方式去显示#URL传递参数
payload = {'keyword': '香港', 'salecityid': '2'}
r = requests.get("http://m.ctrip.com/webapp/tourvisa/visa_list", params=payload)
print（r.url） #示例为http://m.ctrip.com/webapp/tourvisa/visa_list?salecityid=2&keyword=香港#获取/修改网页编码
r = requests.get('https://github.com/timeline.json')
print （r.encoding）#json处理
r = requests.get('https://github.com/timeline.json')
print（r.json()） # 需要先import json    # 定制请求头
url = 'http://m.ctrip.com'
headers = {'User-Agent' : 'Mozilla/5.0 (Linux; Android 4.2.1; en-us; Nexus 4 Build/JOP40D) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.166 Mobile Safari/535.19'}
r = requests.post(url, headers=headers)
print （r.request.headers)#复杂post请求
url = 'http://m.ctrip.com'
payload = {'some': 'data'}
r = requests.post(url, data=json.dumps(payload)) #如果传递的payload是string而不是dict，需要先调用dumps方法格式化一下# post多部分编码文件
url = 'http://m.ctrip.com'
files = {'file': open('report.xls', 'rb')}
r = requests.post(url, files=files)# 响应状态码
r = requests.get('http://m.ctrip.com')
print(r.status_code)# 响应头
r = requests.get('http://m.ctrip.com')
print (r.headers)
print (r.headers['Content-Type'])
print (r.headers.get('content-type')) #访问响应头部分内容的两种方式# Cookies
url = 'http://example.com/some/cookie/setting/url'
r = requests.get(url)
r.cookies['example_cookie_name']    #读取cookiesurl = 'http://m.ctrip.com/cookies'
cookies = dict(cookies_are='working')
r = requests.get(url, cookies=cookies) #发送cookies#设置超时时间
r = requests.get('http://m.ctrip.com', timeout=0.001)#设置访问代理
proxies = {"http": "http://10.10.1.10:3128","https": "http://10.10.1.100:4444",}
r = requests.get('http://m.ctrip.com', proxies=proxies)#如果代理需要用户名和密码，则需要这样：
proxies = {"http": "http://user:pass@10.10.1.10:3128/",
}

json请求：

#! /usr/bin/python3
import requests
import jsonclass url_request():def __init__(self):''' init '''if __name__ == '__main__':heard = {'Content-Type': 'application/json'}payload = {'CountryName': '中国','ProvinceName': '四川省','L1CityName': 'chengdu','L2CityName': 'yibing','TownName': '','Longitude': '107.33393','Latitude': '33.157131','Language': 'CN'}r = requests.post("http://www.xxxxxx.com/CityLocation/json/LBSLocateCity", heards=heard, data=payload)data = r.json()if r.status_code!=200:print('LBSLocateCity API Error' + str(r.status_code))print(data['CityEntities'][0]['CityID'])  # 打印返回json中的某个key的valueprint(data['ResponseStatus']['Ack'])print(json.dump(data, indent=4, sort_keys=True, ensure_ascii=False))  # 树形打印json，ensure_ascii必须设为False否则中文会显示为unicode

Xml请求：

#! /usr/bin/python3
import requestsclass url_request():def __init__(self):"""init"""if __name__ == '__main__':heards = {'Content-type': 'text/xml'}XML = '<?xml version="1.0" encoding="utf-8"?><soap:Envelope xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/"><soap:Body><Request xmlns="http://tempuri.org/"><jme><JobClassFullName>WeChatJSTicket.JobWS.Job.JobRefreshTicket,WeChatJSTicket.JobWS</JobClassFullName><Action>RUN</Action><Param>1</Param><HostIP>127.0.0.1</HostIP><JobInfo>1</JobInfo><NeedParallel>false</NeedParallel></jme></Request></soap:Body></soap:Envelope>'url = 'http://jobws.push.mobile.xxxxxxxx.com/RefreshWeiXInTokenJob/RefreshService.asmx'r = requests.post(url=url, heards=heards, data=XML)data = r.textprint(data)

状态异常处理

import requestsURL = 'http://ip.taobao.com/service/getIpInfo.php'  # 淘宝IP地址库API
try:r = requests.get(URL, params={'ip': '8.8.8.8'}, timeout=1)r.raise_for_status()  # 如果响应状态码不是 200，就主动抛出异常
except requests.RequestException as e:print(e)
else:result = r.json()print(type(result), result, sep='\n')

上传文件

使用request模块，也可以上传文件，文件的类型会自动进行处理：

import requestsurl = 'http://127.0.0.1:8080/upload'
files = {'file': open('/home/rxf/test.jpg', 'rb')}
#files = {'file': ('report.jpg', open('/home/lyb/sjzl.mpg', 'rb'))}     #显式的设置文件名r = requests.post(url, files=files)
print(r.text)

request更加方便的是，可以把字符串当作文件进行上传：

import requestsurl = 'http://127.0.0.1:8080/upload'
files = {'file': ('test.txt', b'Hello Requests.')}     #必需显式的设置文件名r = requests.post(url, files=files)
print(r.text)

基本身份认证(HTTP Basic Auth)

import requests
from requests.auth import HTTPBasicAuthr = requests.get('https://httpbin.org/hidden-basic-auth/user/passwd', auth=HTTPBasicAuth('user', 'passwd'))
# r = requests.get('https://httpbin.org/hidden-basic-auth/user/passwd', auth=('user', 'passwd'))    # 简写
print(r.json())

另一种非常流行的HTTP身份认证形式是摘要式身份认证，Requests对它的支持也是开箱即可用的:

requests.get(URL, auth=HTTPDigestAuth('user', 'pass'))

Cookies与会话对象

如果某个响应中包含一些Cookie，你可以快速访问它们：

import requestsr = requests.get('http://www.google.com.hk/')
print(r.cookies['NID'])
print(tuple(r.cookies))

要想发送你的cookies到服务器，可以使用 cookies 参数：

import requestsurl = 'http://httpbin.org/cookies'
cookies = {'testCookies_1': 'Hello_Python3', 'testCookies_2': 'Hello_Requests'}
# 在Cookie Version 0中规定空格、方括号、圆括号、等于号、逗号、双引号、斜杠、问号、@，冒号，分号等特殊符号都不能作为Cookie的内容。
r = requests.get(url, cookies=cookies)
print(r.json())

会话对象让你能够跨请求保持某些参数，最方便的是在同一个Session实例发出的所有请求之间保持cookies，且这些都是自动处理的，甚是方便。

import requestsheaders = {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8','Accept-Encoding': 'gzip, deflate, compress','Accept-Language': 'en-us;q=0.5,en;q=0.3','Cache-Control': 'max-age=0','Connection': 'keep-alive','User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:22.0) Gecko/20100101 Firefox/22.0'}s = requests.Session()
s.headers.update(headers)
# s.auth = ('superuser', '123')
s.get('https://www.kuaipan.cn/account_login.htm')_URL = 'http://www.kuaipan.cn/index.php'
s.post(_URL, params={'ac':'account', 'op':'login'},data={'username':'****@foxmail.com', 'userpwd':'********', 'isajax':'yes'})
r = s.get(_URL, params={'ac':'zone', 'op':'taskdetail'})
print(r.json())
s.get(_URL, params={'ac':'common', 'op':'usersign'})

默认requests请求失败后不会重试，但是我们跑case时难免遇到一些网络,服务重启，外部原因导致case失败，我们可以在Session实例上附加HTTPAdapaters 参数，增加失败重试次数。

request_retry = requests.adapatrs.HTTPAdapaters(max_retries=3）self.session.mount('https://',request_retry)
self.session.mount('http://',request_retry)

1：保持请求之间的Cookies，我们可以这样做。import requests
self.session = requests.Session()
self.session.get(login_url) # 可以保持登录态2：请求时，会加上headers，一般我们会写成这样self.session.get(url, params, headers=headers)唯一不便的是之后的代码每次都需要这么写，代码显得臃肿，所以我们可以这样:#在构造函数中，这样设置是全局的。# 设置请求头
self.s = requests.Session()
self.s.headers = {'balabala'}# 移除服务器验证
self.s.verify = False# 设置代理
self.s.proxies={'aa'}# 如果后续headers有改变，再次赋值就可以了。
self.s.get(url, params, headers=new_headers)