python美味的汤-BeautifulSoup介绍

1.BeautifulSoup库的理解： BeautifulSoup对应一个HTML/XML文档的全部内容；
2.BeautifulSoup库解析器：

bs4的HTML解析器：

使用方法：BeautifulSoup(mk,‘html.parser’);
条件：安装bs4库

lxml的HTML解析器:

使用方法：BeautifulSoup(mk,‘lxml’);
条件：pip install lmxl

lxml的XML解析器：

使用方法: BeautifulSoup(mk,‘xml’)
条件：pip install lxml html5lib的解析器:
使用方法：BeautifulSoup(mk,‘html5lib’);
条件：pip install html5lib

2.基本元素：

（1）Tag：标签，最基本的信息组织单元
（2）Name：标签的名字；
（3）Attributes：标签的属性，字典形式组织；
（4）NavigableString：标签内非属性字符串；
（5）Comment：标签内字符串的注释部分；
3.简单案例：下面展示一些 简单实用。

import requests
r=requests.get("http://python123.io/ws/demo.html")
r.text
demo=r.text
from bs4 import BeautifulSoup
soup=BeautifulSoup(demo,"html.parser")
soup.a.name
soup.a.parent.name
soup.a.parent.parent.name
tag=soup.a
tag.attrs
tag.attrs['class']
tag.attrs['href']
type(tag.attrs)
type(tag)

项目结果展示

5.基于bs4库的HTML内容遍历方法

标签树的遍历：下行遍历，上行遍历，平行遍历；
.contents:子节点的列表，将所有儿子节点存入列表。
.children:子节点的迭代类型，与contents类似，用于循环遍历儿子节点。
.descendants:子孙节点的迭代类型，包含所有子孙节点，用于循环遍历；

一.标签的下行遍历：
1.遍历儿子节点： for child in soup.body.children: print(child)
1.遍历子孙节点： for child in soup.body.children: print(child)

二. 标签的上行遍历
1.parent:节点的父亲标签；
2.parents：节点的先辈标签迭代类型，用于循环遍历先辈节点；
三.标签的平行遍历
1.遍历后续节点: for sibling in soup.a.next_siblings: print(sibling)
2.遍历前续节点： for sibling in soup.a.previous_siblings: print(sibling)

1.爬取python123网页

import requests
r=requests.get("http://python123.io/ws/demo.html")
demo=r.text
demo

结果展示

2.标签的下行遍历--访问儿子节点

import requests
r=requests.get("http://python123.io/ws/demo.html")
demo=r.text
from bs4 import BeautifulSoup
soup=BeautifulSoup(demo,"html.parser")
soup.head
soup.head.contents#访问儿子节点
soup.body.contents
len(soup.body.contents)#body儿子节点的个数
soup.body.contents[1]

代码结果展示：

3.标签的上行遍历--访问父亲节点

import requests
r=requests.get("http://python123.io/ws/demo.html")
demo=r.text
from bs4 import BeautifulSoup
soup=BeautifulSoup(demo,"html.parser")
soup.title.parent#查看父亲标签
soup.html.parent

代码运行结果展示：

4.标签的平行遍历

import requests
r=requests.get("http://python123.io/ws/demo.html")
demo=r.text
from bs4 import BeautifulSoup
soup=BeautifulSoup(demo,"html.parser")
soup.a.next_sibling#a的下一个平行标签
soup.a.next_sibling.next_sibling#a的下一个的下一个的标签
soup.a.previous_sibling#a的前一个节点的平行标签
soup.a.previous_sibling.previous_sibling
soup.a.parent

程序代码运行结果展示：

5.基于bs4库的HTML格式化和编码

import requests
r=requests.get("http://python123.io/ws/demo.html")
demo=r.text
demo
from bs4 import BeautifulSoup
soup=BeautifulSoup(demo,"html.parser")
soup.prettify()
print(soup.prettify())#使代码分行显示，特别整齐

程序代码运行结果显示

6.bs4的编码

import requests
r=requests.get("http://python123.io/ws/demo.html")
demo=r.text
from bs4 import BeautifulSoup
soup=BeautifulSoup("<p>中文</p>","html.parser")
soup.p.string
print(soup.p.prettify())

程序代码运行结果：

7.信息提取的一般方法

import requests
r=requests.get("http://python123.io/ws/demo.html")
demo=r.text
demo
from bs4 import BeautifulSoup
soup=BeautifulSoup(demo,"html.parser")
for link in soup.find_all("a"):print(link.get('href'))

程序代码运行结果展示：

8.基于bs4库的HTML内容查找方法

import requests
r=requests.get("http://python123.io/ws/demo.html")
demo=r.text
demo
from bs4 import BeautifulSoup
soup.find('a')
soup.find(['a','b'])
for tag in soup.find_all(True):print(tag.name)
import re #引入正则表达式库
for tag in soup.find_all(re.compile('b')):#遍历出标签中以b开头的标签print(tag.name)

程序代码运行结果展示

9.案例

import requests
r=requests.get("http://python123.io/ws/demo.html")
demo=r.text
demo
from bs4 import BeautifulSoup
soup.find_all('p','course')
soup.find_all(id='link1')
soup.find_all(id='link')
import re
soup.find_all(id=re.compile('link'))

程序代码运行结果显示：
![在这里插入图片描述](https://img-blog.csdnimg.cn/20200512164045619.png)

python美味的汤-BeautifulSoup介绍相关推荐

Python爬虫之美丽的汤——BeautifulSoup
本文概要本篇文章主要介绍利用Python爬虫之美丽的汤--BeautifulSoup,适合练习爬虫基础同学,文中描述和代码示例很详细,干货满满,感兴趣的小伙伴快来一起学习吧! 是不是以为今天要教 ...
用Python解析HTML，BeautifulSoup使用简介
用Python解析HTML,BeautifulSoup使用简介 by cnDenis, http://cndenis.iteye.com, 2012年12月12日 Beautiful Soup,字面意 ...
python httpstr find_Python爬虫 | BeautifulSoup使用
BeautifulSoup介绍与lxml一样,BeautifulSoup也是一个HTML/XML的解析器,主要功能也是如何解析和提取HTML/XML数据. 几种解析工具的对比工具速度难度正则 ...
python机器学习可视化工具Yellowbrick介绍及平行坐标图实战示例
python机器学习可视化工具Yellowbrick介绍及平行坐标图实战示例目录 python机器学习可视化工具Yellowbrick介绍及平行坐标图实战示例 yellowbrick简介及安装
python加号换行,Python字符串拼接六种方法介绍
Python字符串拼接的6种方法: 1.加号第一种,有编程经验的人,估计都知道很多语言里面是用加号连接两个字符串,Python里面也是如此直接用"+"来连接两个字符串: prin ...
实战篇一 python常用模块和库介绍
# -_-@ coding: utf-8 -_-@ -- Python 常用模块和库介绍第一部分:json模块介绍 import json 将一个Python数据结构转换为JSON: dict_ = ...
python映射类型-python映射类型的相关介绍
映射类型是一类可迭代的键-值数据项的组合,提供了存取数据项及其键和值的方法,在python3中,支持两种无序的映射类型:内置的dict和标准库中的collections.defaultdict类型. ...
python nose测试框架全面介绍十---用例的跳过
又来写nose了,这次主要介绍nose中的用例跳过应用,之前也有介绍,见python nose测试框架全面介绍四,但介绍的不详细.下面详细解析下 nose自带的SkipTest 先看看nose自带的S ...
python 标准库之 glob 介绍（获取文件夹下所有同类文件）
python标准库之glob介绍 glob 文件名模式匹配,不用遍历整个目录判断每个文件是不是符合. 1.通配符星号(*)匹配零个或多个字符 import glob for name in glob ...

python美味的汤-BeautifulSoup介绍

python美味的汤-BeautifulSoup介绍相关推荐

最新文章

热门文章