BeautifulSoup4介绍与登录知乎案例

一、BeautifulSoup4介绍

和lxml一样，Beautiful Soup 也是一个HTML/XML的解析器，主要的功能也是如何解析和提取 HTML/XML 数据
lxml 只会局部遍历，而Beautiful Soup 是基于HTML DOM的，会载入整个文档，解析整个DOM树，因此时间和内存开销都会大很多，所以性能要低于lxml
BeautifulSoup 用来解析 HTML 比较简单，API非常人性化，支持CSS选择器、Python标准库中的HTML解析器，也支持 lxml 的 XML解析器
Beautiful Soup 3 目前已经停止开发，推荐现在的项目使用Beautiful Soup 4。使用 pip 安装即可：pip install beautifulsoup4
官方文档：http://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/
-

二、使用

示例代码：

html = """
<html><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><title>Insert title here</title></head><frameset rows="70%,*"><frame bordercolor="1" src="04_table.html" noresize="noresize" /><frameset cols="20%,*"><frame bordercolor="1" src="layout/b.html" noresize="noresize" /><frame bordercolor="1" noresize="noresize" name="content" /></frameset></frameset><body><a href="http://www.baidu.com" class="hehe" id="link1"><!-- hehehe --></a></body>
</html>
"""
bs=BeautifulSoup(html,"lxml")

1、四大对象种类，Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象,所有对象可以归纳为4种:

tag 它查找的是在所有内容中的第一个符合要求的标签

In [22]: bs.head
Out[22]: <head>\n<meta content="text/html; charset=unicode-escape" http-equiv="Content-Type"/>\n<title>Insert title here</title>\n</head>In [23]: print type(bs.head)
<class 'bs4.element.Tag'>In [24]: bs.name
Out[24]: u'[document]'In [25]: bs.head.name
Out[25]: 'head'In [26]: bs.a.attrs
Out[26]: {'class': ['hehe'], 'href': 'http://www.baidu.com', 'id': 'link1'}In [27]: bs.a['class']
Out[27]: ['hehe']In [31]: bs.a['class']="haha" #修改属性In [32]: bs.a['class']
Out[32]: 'haha'In [33]: del bs.a['class']In [34]: bs.a.attrs
Out[34]: {'href': 'http://www.baidu.com', 'id': 'link1'}

NavigableString


In [45]: bs.title.string   #.string获取标签内容
Out[45]: u'Insert title here'In [46]: print type(bs.title.string)
<class 'bs4.element.NavigableString'>

BeautifulSoup

In [38]: bs.name
Out[38]: u'[document]'In [39]: print type(bs.name)
<type 'unicode'>In [40]: bs.attrs
Out[40]: {}In [41]: # 文档本身属性为空

Comment 是一个特殊类型的 NavigableString 对象，其输出的内容不包括注释符号

In [35]: bs.a.string #获取a标签内文字
Out[35]: u' hehehe 'In [36]: print type(bs.a.string)
<class 'bs4.element.Comment'>

2. 遍历文档树

直接子节点： .contents .children 属性
.content 属性可以将tag的子节点以列表的方式输出
.children 返回的是一个生成器对象

In [52]: bs.frameset.contents
Out[52]:
[u'\n',<frame bordercolor="1" noresize="noresize" src="04_table.html"/>,u'\n',<frameset cols="20%,*">\n<frame bordercolor="1" noresize="noresize" src="layout/b.html"/>\n<frame bordercolor="1" name="content" noresize="noresize"/>\n</frameset>,u'\n']In [53]: bs.frameset.contents[3]
Out[53]: <frameset cols="20%,*">\n<frame bordercolor="1" noresize="noresize" src="layout/b.html"/>\n<frame bordercolor="1" name="content" noresize="noresize"/>\n</frameset>

所有子孙节点: .descendants 属性，也需要遍历
节点内容：.string属性
如果tag只有一个 NavigableString 类型子节点,那么这个tag可以使用 .string 得到子节点。如果一个tag仅有一个子节点,那么这个tag也可以使用 .string 方法,输出结果与当前唯一子节点的 .string 结果相同。

通俗点说就是：如果一个标签里面没有标签了，那么 .string 就会返回标签里面的内容。如果标签里面只有唯一的一个标签了，那么 .string 也会返回最里面的内容

3、搜索文档树

find_all(name, attrs, recursive, text, **kwargs)
name参数可以传字符串、正则表达式、列表。
text 参数可以搜搜文档中的字符串内容，与 name 参数的可选值一样, text 参数接受字符串 , 正则表达式 , 列表

In [93]: bs.find_all('a')
Out[93]: [<a class="hehe" href="http://www.baidu.com" id="link1"><!-- hehehe --></a>]In [94]: In [94]: bs.find_all(['a','title'])
Out[94]:
[<title>Insert title here</title>,<a class="hehe" href="http://www.baidu.com" id="link1"><!-- hehehe --></a>]In [95]: In [95]: bs.find_all(id="link1")
Out[95]: [<a class="hehe" href="http://www.baidu.com" id="link1"><!-- hehehe --></a>]In [96]: In [96]: bs.find_all(text="hehehe")
Out[96]: []In [97]: bs.find_all(text="Insert title here")
Out[97]: [u'Insert title here']In [98]: import reIn [99]: bs.find_all(text=re.compile("Insert title"))
Out[99]: [u'Insert title here']

4、select方式

In [83]: bs.select('title')  #通过标签名查找
Out[83]: [<title>Insert title here</title>]In [84]: bs.select('.hehe')  #通过类名查找
Out[84]: [<a class="hehe" href="http://www.baidu.com" id="link1"><!-- hehehe --></a>]In [85]: bs.select('#link1')  #通过id查找
Out[85]: [<a class="hehe" href="http://www.baidu.com" id="link1"><!-- hehehe --></a>]In [86]: bs.select('head > title') #查找head标签下title标签
Out[86]: [<title>Insert title here</title>]In [87]: bs.select('a #link1') #查找a标签中id为link1的
Out[87]: []In [88]: bs.select('a  #link1') #查找a标签中id为link1的
Out[88]: []In [89]: bs.select('a[herf="http://www.baidu.com"])  #通过属性查找File "<ipython-input-89-d90676467fe4>", line 1bs.select('a[herf="http://www.baidu.com"])  #通过属性查找^
SyntaxError: EOL while scanning string literalIn [90]: bs.select('a[herf="http://www.baidu.com"]')  #通过属性查找
Out[90]: []In [91]: bs.select('a[href="http://www.baidu.com"]')  #通过属性查找
Out[91]: [<a class="hehe" href="http://www.baidu.com" id="link1"><!-- hehehe --></a>]In [92]: # 以上的 select 方法返回的结果都是列表形式，可以遍历形式输出，然后用 get_text() 方法来获取它的内容

三、登录知乎案例

这个案例登录知乎，并把主页保存html。
关于验证码，有一种是字母数字，是将图片保存到本地，手动输入的，比如：jk12，请看注释掉的代码。另一种是点击倒立的文字，也是保存到本地的，手动输入坐标，坐标是有规律的，详细请看代码注释，如果第一个第二个是倒立的请输入：2，23，46
下面案例如果测试的话有三处需要改的地方：你的账号、你的密码、你的主页。
当然可以选择把55-57行代码注释掉。

 1 # coding:utf-82 3 from bs4 import BeautifulSoup4 import requests5 import time6 7 def captcha(captcha_data):8     with open("captcha.jpg","wb") as f:9         f.write(captcha_data)10     ''' 11     text=raw_input("请输入验证码：")12     return text 13     ''' 14     text=raw_input("请输入验证码个数以及坐标：")15     # 第一个坐标[23,23],第二个坐标[46,23]...16     arr=text.split(",") 17     if "1"==arr[0]:18         result='{"img_size":[200,44],"input_points":[[%s,23]]}' % int(arr[1])19     else:20         result='{"img_size":[200,44],"input_points":[[%s,23],[%s,23]]}' % (int(arr[1]),int(arr[2]))21 22     return result23 24 def zhihuLogin(): 25     # 构建一个Session对象，可以保存Cookie26     session=requests.Session()27 29     # get请求获取登录页面，找到需要的数据_xsrf,同时记录Cookie30     html=session.get("https://www.zhihu.com/#signin",headers=headers).text31 32     # 调用lxml解析库33     bs=BeautifulSoup(html,"lxml");34     # _xsrf作用是防止CSRF攻击(跨站请求伪造，也就是跨域攻击)35     # 跨域攻击通常通过利用Cookie伪装成网站信任的用户的请求，盗取用户信息、欺骗web服务器36     # 所以网站通过设置一个隐藏字段存放这个MD5字符串，用来校验用户Cookie和服务器Session37     _xsrf=bs.find("input",attrs={"name":"_xsrf"}).get("value")38 39     #captcha_url="http://www.zhihu.com/captcha.gif?r=%d&type=login"%(time.time()*1000)40     captcha_url="https://www.zhihu.com/captcha.gif?r=%d&type=login&lang=cn"%(time.time()*1000)41     captcha_data=session.get(captcha_url,headers=headers).content42     text=captcha(captcha_data)43 44     data={45         "_xsrf":_xsrf,46         "phone_num":"**你的账号**",47         "password":"**你的密码**",48         "captcha_type":"cn",49         "captcha":text50     }51 52     response=session.post("https://www.zhihu.com/login/phone_num",data=data,headers=headers)53     print response.text54 55     response=session.get("**你登录后的主页地址**",headers=headers)56     with open("my.html","w") as f:57         f.write(response.text.encode("utf-8"))58 59 if __name__=="__main__":60     zhihuLogin()

BeautifulSoup4介绍与登录知乎案例相关推荐

吴恩达登录知乎，亲自回答如何系统学习机器学习
如何系统地学习机器学习?知乎里有很多回答,近日,吴恩达老师亲自在知乎回答了这个问题: 作者:吴恩达链接:https://www.zhihu.com/question/266291909/answer ...
Python爬虫初学（三）—— 模拟登录知乎
模拟登录知乎这几天在研究模拟登录, 以知乎 - 与世界分享你的知识.经验和见解为例.实现过程遇到不少疑问,借鉴了知乎xchaoinfo的代码,万分感激! 知乎登录分为邮箱登录和手机登录两种方式,通过 ...
放假在家/异地/无法使用学校局域网-如何快速登录知网/web of science等学术平台
放假在家如何快速登录知网/web of science 文章目录放假在家如何快速登录知网/web of science 前言一.知网登录 1.进入知网官方网站 2.登录 3.输入学校名称,选中并前 ...
[Python]网络爬虫（三）：使用cookiejar管理cookie 以及模拟登录知乎
大家好哈,上一节我们研究了一下爬虫的异常处理问题,那么接下来我们一起来看一下Cookie的使用. 为什么要使用Cookie呢? Cookie,指某些网站为了辨别用户身份.进行session跟踪而储存在 ...
小试牛刀：使用Python模拟登录知乎
2019独角兽企业重金招聘Python工程师标准>>> 作者:刘帝伟(微博:@拾毅者) 原文链接:点击这里 BitTiger尊重原创版权,转载已经过授权. 最近突然对爬虫兴趣倍增,主 ...
Python爬虫：模拟登录知乎完全详解
[源码在最下面] 知乎登录分为邮箱登录和手机登录两种方式,通过浏览器的开发者工具查看,我们通过不同方式登录时,网址是不一样的.邮箱登录的地址email_url = 'https://www.zhihu ...
使用Python模拟登录知乎
环境与开发工具在抓包的时候,开始使用的是Chrome开发工具中的Network,结果没有抓到,后来使用Fiddler成功抓取数据.下面逐步来细化上述过程. 模拟知乎登录前,先看看本次案例使用的环境及 ...
python3.6,--登录知乎
#encoding=utf-8 # 登录知乎,通过保存验证图片方式 import urllib.request import urllib.parse import time import http. ...
2019年最新 Python 模拟登录知乎支持验证码
2019年最新 Python 模拟登录知乎支持验证码和保存 Cookies 知乎的登录页面已经改版多次,加强了身份验证,网络上大部分模拟登录均已失效,所以我重写了一份完整的,并实现了提交验证码 (包 ...

BeautifulSoup4介绍与登录知乎案例

一、BeautifulSoup4介绍

二、使用

1、四大对象种类，Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象,所有对象可以归纳为4种:

2. 遍历文档树

3、搜索文档树

4、select方式

三、登录知乎案例

BeautifulSoup4介绍与登录知乎案例相关推荐

最新文章

热门文章

BeautifulSoup4介绍与登录知乎案例

一 、BeautifulSoup4介绍

二、使用

1、四大对象种类，Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象,所有对象可以归纳为4种:

2. 遍历文档树

3、搜索文档树

4、select方式

三、登录知乎案例

BeautifulSoup4介绍与登录知乎案例相关推荐

最新文章

热门文章

一、BeautifulSoup4介绍