Pyspark分类--NaiveBayes

NaiveBayes朴素贝叶斯分类

class pyspark.ml.classification.NaiveBayes(featuresCol=‘features’, labelCol=‘label’, predictionCol=‘prediction’, probabilityCol=‘probability’, rawPredictionCol=‘rawPrediction’, smoothing=1.0, modelType=‘multinomial’, thresholds=None, weightCol=None)

朴素贝叶斯分类器。它同时支持多项式和伯努利 NB。多项式 NB 可以处理有限支持的离散数据。例如，通过将文档转换为 TF-IDF 向量，可以用于文档分类。通过使每个向量成为二进制（0/1）数据。它也可以用作伯努利NB。输入特征值必须是非负的

featuresCol = Param(parent=‘undefined’, name=‘featuresCol’, doc=‘features column name.’)

modelType = Param(parent=‘undefined’, name=‘modelType’, doc=‘模型类型，字符串（区分大小写）。支持的选项：多项式（默认）和bernoulli。’)

predictionCol = Param(parent=‘undefined’, name=‘predictionCol’, doc=‘prediction column name.’)

probabilityCol = Param(parent=‘undefined’, name=‘probabilityCol’, doc=‘预测类条件的列名注意：并非所有模型都输出经过良好校准的概率估计！这些概率应被视为置信度，而不是精确概率。’)

rawPredictionCol = Param(parent=‘undefined’, name=‘rawPredictionCol’, doc=‘原始预测列名（又名置信度）。’）

smoothing = Param(parent=‘undefined’, name=‘smoothing’, doc=‘平滑参数，应该 >= 0，默认为 1.0’)

thresholds = Param(parent=‘undefined’, name=‘thresholds’, doc="多类分类中的阈值，用于调整每个类的预测概率。数组的长度必须等于类数，值> 0，但最多一个值可能为0。p最大的类

model.pi：先验概率的对数

model.theta：条件概率的对数

01.创建数据集并查看结构

from pyspark.sql import SparkSession
spark = SparkSession.builder.config("spark.driver.host","192.168.1.10")\.config("spark.ui.showConsoleProgress","false").appName("NaiveBayes")\.master("local[*]").getOrCreate()
from pyspark.sql import Row
from pyspark.ml.linalg import Vectors
df = spark.createDataFrame([Row(label=0.0, weight=0.1, features=Vectors.dense([0.0, 0.0])),Row(label=0.0, weight=0.5, features=Vectors.dense([0.0, 1.0])),Row(label=1.0, weight=1.0, features=Vectors.dense([1.0, 0.0]))])
df.show()
df.printSchema()

输出结果：

+---------+-----+------+
| features|label|weight|
+---------+-----+------+
|[0.0,0.0]|  0.0|   0.1|
|[0.0,1.0]|  0.0|   0.5|
|[1.0,0.0]|  1.0|   1.0|
+---------+-----+------+root|-- features: vector (nullable = true)|-- label: double (nullable = true)|-- weight: double (nullable = true)

02.使用朴素贝叶斯分类器，转换数据并查看结果

from pyspark.ml.classification import NaiveBayes
nb = NaiveBayes(smoothing=1.0, modelType="multinomial", weightCol="weight")
model = nb.fit(df)
model.transform(df).show()
print(model.transform(df).head(3))

输出结果：

+---------+-----+------+--------------------+--------------------+----------+
| features|label|weight|       rawPrediction|         probability|prediction|
+---------+-----+------+--------------------+--------------------+----------+
|[0.0,0.0]|  0.0|   0.1|[-0.8109302162163...|[0.44444444444444...|       1.0|
|[0.0,1.0]|  0.0|   0.5|[-1.3217558399823...|[0.59016393442622...|       0.0|
|[1.0,0.0]|  1.0|   1.0|[-1.7272209480904...|[0.32432432432432...|       1.0|
+---------+-----+------+--------------------+--------------------+----------+[Row(features=DenseVector([0.0, 0.0]), label=0.0, weight=0.1, rawPrediction=DenseVector([-0.8109, -0.5878]), probability=DenseVector([0.4444, 0.5556]), prediction=1.0),Row(features=DenseVector([0.0, 1.0]), label=0.0, weight=0.5, rawPrediction=DenseVector([-1.3218, -1.6864]), probability=DenseVector([0.5902, 0.4098]), prediction=0.0),Row(features=DenseVector([1.0, 0.0]), label=1.0, weight=1.0, rawPrediction=DenseVector([-1.7272, -0.9933]), probability=DenseVector([0.3243, 0.6757]), prediction=1.0)]

03.查看先验概率和条件概率的对数

print(model.pi)
print(model.theta)

输出结果：

[-0.8109302162163285,-0.587786664902119]
DenseMatrix([[-0.91629073, -0.51082562],[-0.40546511, -1.09861229]])

Pyspark分类--NaiveBayes相关推荐

【Pyspark教程】SQL、MLlib、Core等模块基础使用
文章目录零.Spark基本原理 0.1 pyspark.sql 核心类 0.2 spark的基本概念 0.3 spark部署方式 0.4 RDD数据结构 (1)创建RDD的2种方式 (2)RDD操作 ...
Spark MLlib中支持二次训练的模型算法
在Spark MLlib中可以做二次训练的模型大家好,我是心情有点低落的一拳超人今天给大家带来我整理的Spark 3.0.1 MLlib库中可以做二次训练的模型总结,首先给大家介绍一下什么是二次训 ...
项目实战-使用PySpark处理文本多分类问题
原文链接:https://cloud.tencent.com/developer/article/1096712 在大神创作的基础上,学习了一些新知识,并加以注释. TARGET:将旧金山犯罪记录(S ...
[分类算法] ：朴素贝叶斯 NaiveBayes
[分类算法] :朴素贝叶斯 NaiveBayes 1. 原理和理论基础(参考) 2. Spark代码实例: 1)windows 单机 import org.apache.spark.mllib.cla ...
Python大数据处理库PySpark实战——使用PySpark处理文本多分类问题
[导读]近日,多伦多数据科学家Susan Li发表一篇博文,讲解利用PySpark处理文本多分类问题的详情.我们知道,Apache Spark在处理实时数据方面的能力非常出色,目前也在工业界广泛使用. ...
PySpark——随机森林分类案例
PySpark--随机森林分类案例一.随机森林随机森林案例 """ Random Forest Classifier Example. ""&qu ...
基于 spark ml NaiveBayes实现中文文本分类
思路: 1 准备数据 2,代码编写准备数据这里数据我将它分为两类, 1 军事,2 nba , 我将文件数据放在下面代码编写: 这里面我用的是spark ml 进行代码的 ...
pyspark学习笔记（1）_安装和简单逻辑回归分类示例
一.安装我是Windows系统,使用的是anaconda.如其他系统环境可直接度娘怎么安装,教程很多哟(^U^) 安装方法: 1.下载Spark. 在Apache Spark官网下载Spark,直接 ...
朴素贝叶斯(NaiveBayes)针对小数据集中文文本分类预测
转自相国大人的博客, http://blog.csdn.net/github_36326955/article/details/54891204 做个笔记代码按照1 2 3 4的顺序进行即可: 1. ...

Pyspark分类--NaiveBayes

Pyspark分类--NaiveBayes相关推荐

最新文章

热门文章