spark文本处理-文章分类

这里我采用的还spark来做数据处理以及采用的是spark里面的算法
spark里面提供了词频-逆文本频率（TF-IDF）
它给一个文本的每一个词赋予了一个权值，权值的计算是基于文本中出现的频率，同时采用逆向文本频率做全局归一化。具体的算法推断大家可以去看官网介绍。
分类采用
NaiveBayes来做
我们来看一段数据（需要数据、代码的可以给我留言）
Four score and seven years ago our fathers brought forth on this continent, a new nation,
conceived in Liberty, and dedicated to the proposition that all men are created equal.
• Now we are engaged in a great civil war, testing whether that nation, or any nation so
conceived and so dedicated, can long endure. We are met on a great battle-field of that
war. We have come to dedicate a portion of that field, as a final resting place for those who
here gave their lives that that nation might live. It is altogether fitting and proper that we
should do this.
把数据处理
ArrayBuffer(four, score, seven, year, ago, our, father, brought, forth, contin, new, nation,
conceiv, liberti, dedic, proposit, all, men, creat, equal)
• ArrayBuffer(now, we, engag, great, civil, war, test, whether, nation, ani, nation, so, conceiv,
so, dedic, can, long, endur, we, met, great, battl, field, war, we, have, come, dedic, portion,
field, final, rest, place, those, who, here, gave, live, nation, might, live, altogeth, fit, proper,
we, should, do)
(1000, [17,63,94,197,234,335,412,437,445,521,530,556,588,673,799,893 ,937,960,990],
[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0, 1.0,1.0,1.0])
• (1000, [17,21,22,37,63,92,167,211,240,256,270,272,393,395,445,449,
460,472,480,498,535,612,676,688,694,706,724,732,790,909,916,939,960, 965,996],
[1.0,2.0,1.0,1.0,3.0,2.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,2.0,
1.0,1.0,2.0,1.0,1.0,4.0,1.0,1.0,1.0,1.0,1.0,1.0])
下面我直接代码把
加载数据然后对数据进行HashTF
val mock = sc.textFile("/zhouxiaoke/mock.tokens")
val watch = sc.textFile("/zhouxiaoke/watch.tokens")
val tf = new HashingTF(10000)
val mockData = mock.map { line =>
var targert = "1"
LabeledPoint(targert.toDouble, tf.transform(line.split(",")))
}
val watchData = watch.map { line =>
var targert = "0"
LabeledPoint(targert.toDouble, tf.transform(line.split(",")))
}
对数据进行IDF加权，然后数据转化成训练和测试数据
val idfModel = new IDF(minDocFreq = 3).fit(trainDocs)
val datasee=splits(0).map{ point=>
idfModel.transform(point.features).toArray
}
datasee.take(100)
// val eee=idfModel.transform()
val train = splits(0).map{ point=>
LabeledPoint(point.label,idfModel.transform(point.features))
}
val test = splits(1).map{ point=>
LabeledPoint(point.label,idfModel.transform(point.features))
}

val nbmodel = NaiveBayes.train(train, lambda = 1.0)
val bayesTrain = train.map(p => (nbmodel.predict(p.features), p.label))
val bayesTest = test.map(p => (nbmodel.predict(p.features), p.label))

然后查看数据的准确率以及ROC值
val metricsTrain = new BinaryClassificationMetrics(trainScores,100)
val metricsTest = new BinaryClassificationMetrics(testScores,100)
println("RF Training AuROC: ",metricsTrain.areaUnderROC())

println("RF Test AuROC: ",metricsTest.areaUnderROC())

需要代码合数据可以留邮箱。。。

spark文本处理-文章分类相关推荐

python文本数据分析-新闻分类任务
python文本数据分析-新闻分类任务文本分析文本数据停用词:1.语料中大量出现:2.没啥大用:3.留着过年嘛?所以根据停用词表进行筛选,去掉这些停用词. Tf-idf:关键词提取 <中国 ...
阅读笔记——基于机器学习的文本情感多分类的学习与研究
文章目录 1 文章简介 2 文本情感分类概述 3 文本情感多分类项目设计与实现 3.1 数据处理 3.2 特征选取 3.3 线性逻辑回归模型 3.4 朴素贝叶斯模型 4 项目结果与分析 5 总结 1 ...
程序员欣宸的文章分类汇总
欢迎访问我的GitHub 这里分类和汇总了欣宸的全部原创(含配套源码):https://github.com/zq2599/blog_demos 关于代码仓库代码仓库里是博客中涉及的源码和文件,地址 ...
复现实验：文本数据的分类与分析
声明:实验来源全部参照https://github.com/hycsy2019/TextClassification 实验操作-->实验目的: 对训练集数据进行预处理-->掌握数据预处理的 ...
通过主题词词典构建进行文本多标签分类
文章目录前言一.数据预处理 1.引入库 2.读入数据 3.文本分词 3.计算每句得分 4.计算每句得分总结前言目前,文本多标签分类具有非常多的深度学习的方法实现,本文将介绍最基础的,通过构造 ...
MXNet中使用卷积神经网络textCNN对文本进行情感分类
在图像识别领域,卷积神经网络是非常常见和有用的,我们试图将它应用到文本的情感分类上,如何处理呢?其实思路也是一样的,图片是二维的,文本是一维的,同样的,我们使用一维的卷积核去处理一维的文本(当作一维的 ...
公众号1200篇文章分类和索引
承蒙读者朋友们的关照,截止到今天,杂货铺的文章已经积累到了1200篇,其中有超过2/3的文章都是原创的,即使是转载,我给的底线是一定要加些自己的见解,因为至少得让读者了解到这篇文章的价值,而不仅仅是文 ...
NLP（三十六）使用keras-bert实现文本多标签分类任务
本文将会介绍如何使用keras-bert实现文本多标签分类任务,其中对BERT进行微调. 项目结构本项目的项目结构如下: 其中依赖的Python第三方模块如下: pandas==0.23.4 ...
Django博客系统（文章分类模型）
文章分类后台管理网站的管理员负责查看.添加.修改.删除数据 Django能够根据定义的模型类自动地生成管理模块登陆站点:http://127.0.0.1:8000/admin 需要服务器是启动状态 ...

spark文本处理-文章分类

spark文本处理-文章分类相关推荐

最新文章

热门文章