HTML解析库BeautifulSoup4

BeautifulSoup 是一个可以从HTML或XML文件中提取数据的Python库，它的使用方式相对于正则来说更加的简单方便，常常能够节省我们大量的时间。

BeautifulSoup也是有官方中文文档的：https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html

安装

BeautifulSoup的安装也是非常方便的，pip安装即可。

pip install beautifulsoup4

简单例子

以下是一段HTML代码，作为例子被多次用到，这是 爱丽丝梦游仙境 中的一段内容。

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p><p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p><p class="story">...</p>
"""

我们获取的网页数据通常会像上面这样是完全的字符串格式，所以我们首先需要使用BeautifulSoup来解析这段字符串。然后会获得一个BeautifulSoup对象，通过这个对象我们就可以进行一系列操作了。

In [1]: from bs4 import BeautifulSoupIn [2]: soup = BeautifulSoup(html_doc)In [3]: soup.title Out[3]: <title>The Dormouse's story</title> In [4]: soup.title.name Out[4]: 'title' In [5]: soup.title.string Out[5]: "The Dormouse's story" In [6]: soup.title.parent.name Out[6]: 'head' In [7]: soup.p Out[7]: <p class="title"><b>The Dormouse's story</b></p> In [8]: soup.p['class'] Out[8]: ['title'] In [9]: soup.a Out[9]: <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> In [10]: soup.find_all('a') Out[10]: [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] In [11]: soup.find(id="link3") Out[11]: <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

可以看到，相对于正则来说，操作简单了不止一个量级。

使用

指定解析器

在上面的例子中，我们可以看到在查找数据之前，是有一个解析网页的过程的：

soup = BeautifulSoup(html_doc)

BeautifulSoup会自动的在系统中选定一个可用的解析器，以下是主要的几种解析器：

解析器	使用方法	优势	劣势
Python标准库	`BeautifulSoup(markup, "html.parser")`	Python的内置标准库执行速度适中文档容错能力强	Python 2.7.3 or 3.2.2)前的版本中文档容错能力差
lxml HTML 解析器	`BeautifulSoup(markup, "lxml")`	速度快文档容错能力强	需要安装C语言库
lxml XML 解析器	BeautifulSoup(markup, ["lxml", "xml"])``BeautifulSoup(markup, "xml")	速度快唯一支持XML的解析器	需要安装C语言库
html5lib	`BeautifulSoup(markup, "html5lib")`	最好的容错性以浏览器的方式解析文档生成HTML5格式的文档	速度慢不依赖外部扩展

由于这个解析的过程在大规模的爬取中是会影响到整个爬虫系统的速度的，所以推荐使用的是lxml，速度会快很多，而lxml需要单独安装：

pip install lxml

安装成功后，在解析网页的时候，指定为lxml即可。

soup = BeautifulSoup(html_doc, 'lxml')

提示：如果一段HTML或XML文档格式不正确的话,那么在不同的解析器中返回的结果可能是不一样的，所以要指定某一个解析器。

节点对象

BeautifulSoup将复杂HTML文档转换成一个复杂的树形结构，每个节点都是Python对象，所有对象可以归纳为4种：Tag，NavigableString，BeautifulSoup，Comment。

tag

tag就是标签的意思，tag还有许多的方法和属性。

>>> soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')
>>> tag = soup.b
>>> type(tag)
<class 'bs4.element.Tag'>

name

每一个tag对象都有name属性，为标签的名字。
```
>>> tag.name
'b'
```
Attributes

在HTML中，tag可能有多个属性，所以tag属性的取值跟字典相同。
```
>>> tag['class']
'boldest'
```
如果某个tag属性有多个值，那么返回的则是列表格式。
```
>>> soup = BeautifulSoup('<p class="body strikeout"></p>')
>>> soup.p['class']
["body", "strikeout"] 
```

get_text()

通过get_text()方法我们可以获取某个tag下所有的文本内容。

In [1]: soup.body.get_text()
Out[1]: "The Dormouse's story\nOnce upon a time there were three little sisters; and their names were\nElsie,\nLacie and\nTillie;\nand they lived at the bottom of a well.\n...\n"

NavigableString

NavigableString的意思是可以遍历的字符串，一般被标签包裹在其中的的文本就是NavigableString格式。

In [1]: soup = BeautifulSoup('<p>No longer bold</p>')In [2]: soup.p.string
Out[2]: 'No longer bold' In [3]: type(soup.p.string) Out[3]: bs4.element.NavigableString

BeautifulSoup

BeautifulSoup对象就是解析网页获得的对象。

Comment

Comment指的是在网页中的注释以及特殊字符串。

Tag与遍历文档树

tag对象可以说是BeautifulSoup中最为重要的对象，通过BeautifulSoup来提取数据基本都围绕着这个对象来进行操作。

首先，一个节点中是可以包含多个子节点和多个字符串的。例如html节点中包含着head和body节点。所以BeautifulSoup就可以将一个HTML的网页用这样一层层嵌套的节点来进行表示。

以上方的爱丽丝梦游仙境为例：

contents和children

通过contents可以获取某个节点所有的子节点，包括里面的NavigableString对象。获取的子节点是列表格式。

In [1]: soup.head.contents
Out[1]: [<title>The Dormouse's story</title>]

而通过children同样的是获取某个节点的所有子节点，但是返回的是一个迭代器，这种方式会比列表格式更加的节省内存。

In [1]: tags = soup.head.childrenIn [2]: tags
Out[2]: <list_iterator at 0x110f76940>In [3]: for tag in tags: print(tag) <title>The Dormouse's story</title>

descendants

上面的contents和children获取的是某个节点的直接子节点，而无法获得子孙节点。通过descendants可以获得所有子孙节点，返回的结果跟children一样，需要迭代或者转类型使用。

In [1]: len(list(soup.body.descendants))
Out[1]: 19In [2]: len(list(soup.body.children))
Out[2]: 6

string和strings

我们常常会遇到需要获取某个节点中的文本值的情况，如果这个节点中只有一个字符串，那么使用string可以正常将其取出。

In [1]: soup.title.string
Out[1]: "The Dormouse's story"

而如果这个节点中有多个字符串的时候，BeautifulSoup就无法确定要取出哪个字符串了，这时候需要使用strings。

In [1]: list(soup.body.strings)
Out[1]:
["The Dormouse's story",'\n','Once upon a time there were three little sisters; and their names were\n', 'Elsie', ',\n', 'Lacie', ' and\n', 'Tillie', ';\nand they lived at the bottom of a well.', '\n', '...', '\n']

而使用stripped_strings可以将全是空白的行去掉。

In [1]: list(soup.body.stripped_strings)
Out[1]:
["The Dormouse's story",'Once upon a time there were three little sisters; and their names were','Elsie', ',', 'Lacie', 'and', 'Tillie', ';\nand they lived at the bottom of a well.', '...']

父节点parent和parents

有时我们也需要去获取某个节点的父节点，也就是包裹着当前节点的节点。

In [1]: soup.b.parent
Out[1]: <p class="title"><b>The Dormouse's story</b></p>

而使用parents则可以获得当前节点递归到顶层的所有父辈元素。

In [1]: [i.name for i in soup.b.parents]
Out[1]: ['p', 'body', 'html', '[document]']

兄弟节点

兄弟节点指的就是父节点相同的节点。

next_sibling 和 previous_sibling

兄弟节点选取的方法与当前节点的位置有关，next_sibling获取的是当前节点的下一个兄弟节点，previous_sibling获取的是当前节点的上一个兄弟节点。

所以，兄弟节点中排第一个的节点是没有previous_sibling的，最后一个节点是没有next_sibling的。
```
In [51]: soup.head.next_sibling
Out[51]: '\n'In [52]: soup.head.previos_siblingIn [59]: soup.body.previous_sibling Out[59]: '\n' 
```

next_siblings 和 previous_siblings

相对应的，next_siblings获取的是下方所有的兄弟节点，previous_siblings获取的上方所有的兄弟节点。

In [47]: [i.name for i in soup.head.next_siblings]
Out[47]: [None, 'body'] In [48]: [i.name for i in soup.body.next_siblings] Out[48]: [] In [49]: [i.name for i in soup.body.previous_siblings] Out[49]: [None, 'head']

find_all()

上方这种直接通过属性来进行访问属性的方法，很多时候只能适用于比较简单的一些场景，所以BeautifulSoup还提供了搜索整个文档树的方法find_all()。

需要注意的是，find_all()方法基本所有节点对象都能调用。

通过name搜索

就像以下演示的，find_all()可以直接查找出整个文档树中所有的b标签，并返回列表。

>>> soup.find_all('b')
[<b>The Dormouse's story</b>]

而如果传入的是一个列表，则会与列表中任意一个元素进行匹配。可以看到，搜索的结果包含了所有的a标签和b标签。

>>> soup.find_all(["a", "b"])
[<b>The Dormouse's story</b>,<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

通过属性搜索

我们在搜索的时候一般只有标签名是不够的，因为可能同名的标签很多，那么这时候我们就要通过标签的属性来进行搜索。

这时候我们可以通过传递给attrs一个字典参数来搜索属性。

In [1]: soup.find_all(attrs={'class': 'sister'})
Out[1]:
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

可以看到找出了所有class属性为sister的标签。

通过文本搜索

在find_all()方法中，还可以根据文本内容来进行搜索。

>>> soup.find_all(text="Elsie")
[u'Elsie']>>> soup.find_all(text=["Tillie", "Elsie", "Lacie"]) [u'Elsie', u'Lacie', u'Tillie']

可见找到的都是字符串对象，如果想要找到包含某个文本的tag，加上tag名即可。

>>> soup.find_all("a", text="Elsie")
[<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>]

限制查找范围为子节点

find_all()方法会默认的去所有的子孙节点中搜索，而如果将recursive参数设置为False，则可以将搜索范围限制在直接子节点中。

>>> soup.html.find_all("title")
[<title>The Dormouse's story</title>]>>> soup.html.find_all("title", recursive=False)
[]

通过正则表达式来筛选查找结果

在BeautifulSoup中，也是可以与re模块进行相互配合的，将re.compile编译的对象传入find_all()方法，即可通过正则来进行搜索。

In [1]: import reIn [2]: tags = soup.find_all(re.compile("^b"))In [3]: [i.name for i in tags] Out[3]: ['body', 'b']

可以看到，找到了标签名是以'b'开头的两个标签。

同样的，也能够以正则来筛选tag的属性。

In [1]: soup.find_all(attrs={'class': re.compile("si")})
Out[1]:
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

CSS选择器

在BeautifulSoup中，同样也支持使用CSS选择器来进行搜索。使用select()，在其中传入字符串参数，就可以使用CSS选择器的语法来找到tag。

>>> soup.select("title")
[<title>The Dormouse's story</title>]>>> soup.select("p > a")
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,<a class="sister" href="http://example.com/lacie"  id="link2">Lacie</a>,<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

beautifulSoupbs4的使用

# -*- coding: utf-8 -*-
# 斌彬电脑
# @Time : 2018/9/4 0004 4:43
from bs4 import BeautifulSoup
import lxml     #pin install lxml   安装 解释器#HTML代码
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p><p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p><p class="story">...</p>
"""# soup = BeautifulSoup(html)
soup = BeautifulSoup(html,'lxml')       #  指定解释器# print(soup)         #  格式化后的 html
# print(soup.title)  #  <title>The Dormouse's story</title>
# print(soup.title.name)      #   标签名  title
# print(soup.title.string)      #  标签内的内容  The Dormouse's story
# print(soup.a.parent.name)      #  a 标签的交级名字  p
# for i in (soup.title.parent.name) :     #   标签的交级名字
#     print(i)# print(soup.a)       #  得到整个标签 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
# print(soup.a['href'])       #  http://example.com/elsie
# print(soup.a['class'])       #  ['sister']# tag = soup.a
# print(type(tag))        #  <class 'bs4.element.Tag'>    标签
# print(tag.name)         #  a            标签 名
# print(tag['class'])     #  ['sister']    属性值# print(soup.get_text())   #  获取文本数据
# print(soup.body.get_text())   #  获取文本数据
# print(soup.b.string)   #  获取文本数据
# print(soup.b.get_text())   #  获取文本数据# print(soup.b.strings)   #  得到对象 <generator object _all_strings at 0x00000226E13A11A8># for i in (soup.head.children):      #  所有子节点
#     print(i)# s1 = soup.find_all('a')         #  所有的 a 标签
# for i in s1:
#     print(i)# print(soup.find_all(class_='sister'))   #   按类查找所有的标签
# print(soup.find_all(id='link3'))   #   按 id 查找标签# print(soup.find_all(text=['Tillie', 'Elsie']))   #   通过文本查找
# print(soup.find_all(text="Elsie"))   #   通过文本查找
# print(soup.find_all('a', text="Elsie"))   #   指定查看# import re       #  与正则一起使用
# for i in ( soup.find_all(re.compile("^b")) ):       #  以 b 开头的匹配
#     # print(i)
#     print(i.name)#  css 选择器
# print(soup.select('title'))   # 查找标签 [<title>The Dormouse's story</title>]
# print(soup.select('a'))   #
# print(soup.select('.sister'))   #  通过类名查找
# print(soup.select('#link2'))   #  通过 id 查找
# print(soup.select('p  a'))   #     p 标签下所有的 a 标签
# print(soup.select('p  #link1'))   #     p 标签下 的 id 为
# print(soup.select("a[class='sister']"))   #     a 标签下类名为con = soup.select('title')[0].get_text()        # 获取文本
print(con)

更简单高效的HTML数据提取-Xpath

XPath 是一门在 XML 文档中查找信息的语言。XPath 用于在 XML 文档中通过元素和属性进行导航。

相比于BeautifulSoup，Xpath在提取数据时会更加的方便。

安装

在Python中很多库都有提供Xpath的功能，但是最基本的还是lxml这个库，效率最高。在之前BeautifulSoup章节中我们也介绍到了lxml是如何安装的。

pip install lxml

语法

XPath 使用路径表达式在 XML 文档中选取节点。节点是通过沿着路径或者 step 来选取的。

我们将用以下的HTML文档来进行演示：

html_doc = '''<html><head></head> <body> <bookstore> <book category="COOKING"> <title lang="en">Everyday Italian</title> <author>Giada De Laurentiis</author> <year>2005</year> <price>30.00</price> </book> <book category="CHILDREN"> <title lang="en">Harry Potter</title> <author>J K. Rowling</author> <year>2005</year> <price>29.99</price> </book> <book category="WEB"> <title lang="en">XQuery Kick Start</title> <author>James McGovern</author> <author>Per Bothner</author> <author>Kurt Cagle</author> <author>James Linn</author> <author>Vaidyanathan Nagarajan</author> <year>2003</year> <price>49.99</price> </book> <book category="WEB"> <title lang="en">Learning XML</title> <author>Erik T. Ray</author> <year>2003</year> <price>39.95</price> </book> </bookstore> </body> </html>'''

from lxml import etreepage = etree.HTML(html_doc)

路径查找

表达式	描述
nodename	选取此节点的子节点。
/	从根节点选取。
//	从匹配选择的当前节点选择文档中的节点，而不考虑它们的位置。
.	选取当前节点。
..	选取当前节点的父节点。
@	选取属性。

查找当前节点的子节点

In [1]: page.xpath('head')
Out[1]: [<Element head at 0x111c74c48>]

从根节点进行查找

In [2]: page.xpath('/html')
Out[2]: [<Element html at 0x11208be88>]

从整个文档中所有节点查找

In [3]: page.xpath('//book')
Out[3]:
[<Element book at 0x1128c02c8>,<Element book at 0x111c74108>, <Element book at 0x111fd2288>, <Element book at 0x1128da348>]

选取当前节点的父节点

In [4]: page.xpath('//book')[0].xpath('..')
Out[4]: [<Element bookstore at 0x1128c0ac8>]

选取属性

In [5]: page.xpath('//book')[0].xpath('@category')
Out[5]: ['COOKING']

节点查找

表达式	结果
nodename[1]	选取第一个元素。
nodename[last()]	选取最后一个元素。
nodename[last()-1]	选取倒数第二个元素。
nodename[position()<3]	选取前两个子元素。
nodename[@lang]	选取拥有名为 lang 的属性的元素。
nodename[@lang='eng']	选取拥有lang属性，且值为 eng 的元素。

选取第二个book元素

In [1]: page.xpath('//book[2]/@category')
Out[1]: ['CHILDREN']

选取倒数第三个book元素

In [2]: page.xpath('//book[last()-2]/@category')
Out[2]: ['CHILDREN']

选取第二个元素开始的所有元素

In [3]: page.xpath('//book[position() > 1]/@category')
Out[3]: ['CHILDREN', 'WEB', 'WEB']

选取category属性为WEB的的元素

In [4]: page.xpath('//book[@category="WEB"]/@category')
Out[4]: ['WEB', 'WEB']

未知节点

通配符	描述
*	匹配任何元素节点。
@*	匹配任何属性节点。

匹配第一个book元素下的所有元素

In [1]: page.xpath('//book[1]/*')
Out[1]:
[<Element title at 0x111f76788>,<Element author at 0x111f76188>, <Element year at 0x1128c1a88>, <Element price at 0x1128c1cc8>]

获取节点中的文本

用text()获取某个节点下的文本
```
In [1]: page.xpath('//book[1]/author/text()')
Out[1]: ['Giada De Laurentiis']
```
如果这个节点下有多个文本，则只能取到一段。

用string()获取某个节点下所有的文本

In [2]: page.xpath('string(//book[1])')
Out[2]: '\n            Everyday Italian\n            Giada De Laurentiis\n            2005\n            30.00\n        '

选取多个路径

通过在路径表达式中使用“|”运算符，您可以选取若干个路径。

In [1]: page.xpath('//book[1]/title/text() | //book[1]/author/text()')
Out[1]: ['Everyday Italian', 'Giada De Laurentiis']

xpath 使用

# -*- coding: utf-8 -*-
# 斌彬电脑
# @Time : 2018/9/4 0004 7:07from lxml import etree
#HTML代码
html1= '''<html><head></head><body><bookstore><book category="COOKING"><title lang="en">Everyday Italian</title><author>Giada De Laurentiis</author><year>2005</year><price>30.00</price></book><book category="CHILDREN"><title lang="en">Harry Potter</title><author>J K. Rowling</author><year>2005</year><price>29.99</price></book><book category="WEB"><title lang="en">XQuery Kick Start</title><author>James McGovern</author><author>Per Bothner</author><author>Kurt Cagle</author><author>James Linn</author><author>Vaidyanathan Nagarajan</author><year>2003</year><price>49.99</price></book><book category="WEB"><title lang="en">Learning XML</title><author>Erik T. Ray</author><year>2003</year><price>39.95</price></book></bookstore>
</body></html>'''
html = etree.HTML(html1)
# print(html)
# print(html.xpath('/html/body/bookstore/book')  )      #  从根目录开始
# print(html.xpath('/html/body/bookstore/book/title/text()')[2]  )    #  title 的文本值
# print(html.xpath('/html/body/bookstore/book/title/@lang')  )    #  title 属性值# print(html.xpath('//title/@lang')  )    #  从任意位置找到 title#查找节点
# conn = html.xpath('//book')[0]
# print(conn.xpath('./price'))            # 当前节点 下的 price
# print(conn.xpath('./price/text()'))            # 当前节点 下的 price
# print(conn.xpath('./price/..'))            #  price 的父级标签 [<Element book at 0x2c67fc8>]
# print(html.xpath('//book[@category="WEB"]'))    #  按属性查找 节点
# print(html.xpath('//book[@category="WEB"]/year/text()'))    #  按属性查找 节点# 这里下标不是从 0 开始，是从 1 开始的
# print(html.xpath('//book[1]'))
# print(html.xpath('//book[2]'))
# print(html.xpath('//book[3]'))
# print(html.xpath('//book[4]'))
# print(html.xpath('//book[last()]'))     #  最后一个
# print(html.xpath('//book[last()-1]'))     #  倒数第二个
# print(html.xpath('//book[last()-2]'))     #  倒数第三个
# print(html.xpath('//book[last()-3]/author/text()'))     #  倒数第4个# print(html.xpath('//*'))     #  所有标签
# print(html.xpath('//book/*/text()'))    # book标签下的所有文本 返回列表
# print(html.xpath('string(//book)'))    # book标签下的所有文本  返回 str
# print(html.xpath('string(//book[1])'))    # book标签下的所有文本  返回 str
# print(html.xpath('string(//book[4])'))    # book标签下的所有文本  返回 str# print(html.xpath('//book/title/text() | //book/author/text()' ))    # 选择多个路径
# print(html.xpath('//book[1]/title/text() | //book[1]/author/text()' ))    # 选择多个路径

转载于:https://www.cnblogs.com/gdwz922/p/9582285.html

潭州课堂25班：Ph201805201 爬虫基础第六课选择器 (课堂笔记)相关推荐

C语言基础第六课——第二节if语句（if-else格式、不带else的if结构）、借例题简述写代码流程、从键盘上输入三个整数，求出其中的最大数（打擂法、三目运算符、排序）、if语句的嵌套计算个人所得税
C语言基础第六课--第二节一.if语句概述二.标准if-else格式三.借例题简述写代码流程四.不带else的if结构五.例题:从键盘上输入三个整数,求出其中的最大数.(打擂法.三目运算符. ...
潭州课堂25班：Ph201805201 第十课类的定义，属性和方法 (课堂笔记)
类的定义共同属性,特征,方法者,可分为一类,并以名命之 class Abc: # class 定义类, 后面接类名 ( 规则首字母大写 ) cls_name = '这个类的名字是Abc' # 在类 ...
潭州课堂25班：Ph201805201 爬虫基础第一课 (课堂笔记)
爬虫的概念: 其实呢,爬虫更官方点的名字叫数据采集,英文一般称作spider,就是通过编程来全自动的从互联网上采集数据. 比如说搜索引擎就是一种爬虫. 爬虫需要做的就是模拟正常的网络请求,比如你在网站 ...
潭州课堂25班：Ph201805201 爬虫基础第九课图像处理- PIL (课堂笔记）
Python图像处理-Pillow 简介 Python传统的图像处理库PIL(Python Imaging Library ),可以说基本上是Python处理图像的标准库,功能强大,使用简单. 但是由 ...
潭州课堂25班：Ph201805201 爬虫基础第十五课 js破解二 (课堂笔记）
PyExecJs使用 PyExecJS是Ruby的ExecJS移植到Python的一个执行JS代码的库. 安装 pip install PyExecJS 例子 >>> import ...
潭州课堂25班：Ph201805201 爬虫高级第三课 sclapy 框架腾讯招聘案例 (课堂笔记）...
到指定目录下,创建个项目进到 spiders 目录创建执行文件,并命名运行调试执行代码,: # -*- coding: utf-8 -*- import scrapy from ..items ...
潭州课堂25班：Ph201805201 爬虫高级第十二课 Scrapy-redis分布项目实战 (课堂笔记)...
建代理池, 1,获取多个网站的免费代理IP, 2,对免费代理进行检测,>>>>>携带IP进行请求, 3,检测到的可用IP进行存储, 4,实现api接口,方便调用, 5,各 ...
潭州课堂25班：Ph201805201 爬虫高级第五课 sclapy 框架日志和 settings 配置模拟登录(课堂笔记）...
当要对一个页面进行多次请求时, 设 dont_filter = True 忽略去重在 scrapy 框架中模拟登录创建项目创建运行文件设请求头 # -*- coding: utf-8 ...
潭州课堂25班：Ph201805201 django 项目第一课 (课堂笔记)
一.Django 现状 1.Django开发前景 1.1 老师做过的项目项目图展示: 1.2 Django的厉害之处在python中,与web开发环境相关的包有13045个 django就占了 ...

潭州课堂25班：Ph201805201 爬虫基础第六课选择器 (课堂笔记)

HTML解析库BeautifulSoup4

安装

简单例子

使用

指定解析器

节点对象

tag

NavigableString

BeautifulSoup

Comment

Tag与遍历文档树

contents和children

descendants

string和strings

父节点parent和parents

兄弟节点

find_all()

通过name搜索

通过属性搜索

通过文本搜索

限制查找范围为子节点

通过正则表达式来筛选查找结果

CSS选择器

更简单高效的HTML数据提取-Xpath

安装

语法

路径查找

节点查找

未知节点

获取节点中的文本

选取多个路径

潭州课堂25班：Ph201805201 爬虫基础第六课选择器 (课堂笔记)相关推荐

最新文章

热门文章

潭州课堂25班：Ph201805201 爬虫基础 第六课 选择器 (课堂笔记)

HTML解析库BeautifulSoup4

安装

简单例子

使用

指定解析器

节点对象

tag

NavigableString

BeautifulSoup

Comment

Tag与遍历文档树

contents和children

descendants

string和strings

父节点parent和parents

兄弟节点

find_all()

通过name搜索

通过属性搜索

通过文本搜索

限制查找范围为子节点

通过正则表达式来筛选查找结果

CSS选择器

更简单高效的HTML数据提取-Xpath

安装

语法

路径查找

节点查找

未知节点

获取节点中的文本

选取多个路径

潭州课堂25班：Ph201805201 爬虫基础 第六课 选择器 (课堂笔记)相关推荐

最新文章

热门文章

潭州课堂25班：Ph201805201 爬虫基础第六课选择器 (课堂笔记)

潭州课堂25班：Ph201805201 爬虫基础第六课选择器 (课堂笔记)相关推荐