spark GBT算法
梯度增强树(GBT)是使用决策树组合的流行回归方法
相对于Random forest 来说GBT在实际应用中,效果更好
直接上代码
package mllibimport org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.GBTClassifier
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.ml.feature._
import org.apache.spark.sql.SparkSession/*** Created by dongdong on 17/7/10.*/case class Fearture_One(cid: String,population_gender: String,population_age: Double,population_registered_gps_city: String,population_education_nature: String,population_university_level: String,sociality_channel_type: String,action_registered_channel: String,action_this_month_once_week_average_login_count: Double,population_censu_city: String,population_gps_city: String,population_own_cell_city: String,population_rank1_cell_city: String,population_rank1_cell_cnt: Double,population_rank2_cell_city: String,population_rank2_cell_cnt: Double,population_rank3_cell_city: String,population_rank3_cell_cnt: Double,population_gps_censu_flag: Double,population_own_censu_flag: Double,population_gps_own_flag: Double,population_own_txl_flag: Double,population_gps_txl_flag: Double,population_censu_txl_flag: Double,population_cnt_7day_province: Double,population_cnt_7day_city: Double,population_cnt_login: Double,population_before_apply_city: String,population_after_apply_city: String,population_before_in_apply_address: Double,population_before_after_apply_address: Double,population_in_after_apply_address: Double,population_re_address_steady: String,population_apply_address_steady: String,population_score_fake_gps: Double,population_score_fake_contacts: Double,text: String,flag: String)object GBT_Profile {def main(args: Array[String]): Unit = {val inpath1 = "/Users/ant_git/src/data/user_profile_train/part-00000"val spark = SparkSession.builder().master("local[3]").appName("GBT_Profile").getOrCreate()import spark.implicits._//read data and transform dataframval originalData = spark.sparkContext.textFile(inpath1).map(line => {val arr = line.split("\001")val cid = arr(0)val population_gender = arr(3).replace("\\N", "N")val population_age = arr(4).replace("\\N", "0").toDoubleval population_registered_gps_city = arr(7).replace("\\N", "N")val population_education_nature = arr(10).replace("\\N", "N")val population_university_level = arr(11).replace("\\N", "N")val sociality_channel_type = arr(13).replace("\\N", "N")val action_registered_channel = arr(44).replace("\\N", "N")val action_this_month_once_week_average_login_count = arr(54).replace("\\N", "0").toDoubleval population_censu_city = arr(63).replace("\\N", "N")val population_gps_city = arr(64).replace("\\N", "N")// val population_jz_city = arr(65).replace("\\N", "N")// val population_ip_city = arr(66).replace("\\N", "N")val population_own_cell_city = arr(67).replace("\\N", "N")val population_rank1_cell_city = arr(68).replace("\\N", "N")val population_rank1_cell_cnt = arr(69).replace("\\N", "0").toDoubleval population_rank2_cell_city = arr(70).replace("\\N", "N")val population_rank2_cell_cnt = arr(71).replace("\\N", "0").toDoubleval population_rank3_cell_city = arr(72).replace("\\N", "N")val population_rank3_cell_cnt = arr(73).replace("\\N", "0").toDouble//val population_jxl_call_max_city = arr(74).replace("\\N", "N")// val population_jxl_call_max_city_cnt = arr(75).replace("\\N", "0").toDouble//val population_anzhuo_30day_max_city = arr(76).replace("\\N", "N")//val population_anzhuo_30day_max_city_cnt = arr(77).replace("\\N", "0").toDoubleval population_gps_censu_flag = arr(78).replace("\\N", "0").toDouble//val population_gps_jxl_flag = arr(79).replace("\\N", "0").toDouble//val population_gps_jz_flag = arr(80).replace("\\N", "0").toDouble//val population_ip_censu_flag = arr(81).replace("\\N", "0").toDouble// val population_ip_jxl_flag = arr(82).replace("\\N", "0").toDouble//val population_ip_jz_flag = arr(83).replace("\\N", "0").toDoubleval population_own_censu_flag = arr(84).replace("\\N", "0").toDouble//val population_own_jxl_flag = arr(85).replace("\\N", "0").toDouble//val population_own_jz_flag = arr(86).replace("\\N", "0").toDoubleval population_gps_own_flag = arr(87).replace("\\N", "0").toDouble//val population_gps_ip_flag = arr(88).replace("\\N", "0").toDouble//val population_ip_own_flag = arr(89).replace("\\N", "0").toDouble//val population_ip_txl_flag = arr(90).replace("\\N", "0").toDoubleval population_own_txl_flag = arr(91).replace("\\N", "0").toDoubleval population_gps_txl_flag = arr(92).replace("\\N", "0").toDoubleval population_censu_txl_flag = arr(93).replace("\\N", "0").toDouble//val population_jxl_txl_flag = arr(94).replace("\\N", "0").toDouble//val population_jz_txl_flag = arr(95).replace("\\N", "0").toDoubleval population_cnt_7day_province = arr(96).replace("\\N", "0").toDoubleval population_cnt_7day_city = arr(97).replace("\\N", "0").toDoubleval population_cnt_login = arr(102).replace("\\N", "0").toDoubleval population_before_apply_city = arr(107).replace("\\N", "N")val population_after_apply_city = arr(108).replace("\\N", "N")val population_before_in_apply_address = arr(111).replace("\\N", "0").toDoubleval population_before_after_apply_address = arr(112).replace("\\N", "0").toDoubleval population_in_after_apply_address = arr(113).replace("\\N", "0").toDoubleval population_re_address_steady = arr(116).replace("\\N", "N")val population_apply_address_steady = arr(117).replace("\\N", "N")val population_score_fake_gps = arr(127).replace("\\N", "0").toDoubleval population_score_fake_contacts = arr(128).replace("\\N", "0").toDoubleval text = population_gender + "|" +population_registered_gps_city + "|" +population_education_nature + "|" +population_university_level + "|" +sociality_channel_type + "|" +action_registered_channel + "|" +population_censu_city + "|" +population_gps_city + "|" +population_own_cell_city + "|" +population_rank1_cell_city + "|" +population_rank2_cell_city + "|" +population_rank3_cell_city + "|" +population_before_apply_city + "|" +population_after_apply_city + "|" +population_re_address_steady + "|" +population_apply_address_steadyval flag = arr(141)Fearture_One(cid: String,population_gender: String,population_age: Double,population_registered_gps_city: String,population_education_nature: String,population_university_level: String,sociality_channel_type: String,action_registered_channel: String,action_this_month_once_week_average_login_count: Double,population_censu_city: String,population_gps_city: String,population_own_cell_city: String,population_rank1_cell_city: String,population_rank1_cell_cnt: Double,population_rank2_cell_city: String,population_rank2_cell_cnt: Double,population_rank3_cell_city: String,population_rank3_cell_cnt: Double,population_gps_censu_flag: Double,population_own_censu_flag: Double,population_gps_own_flag: Double,population_own_txl_flag: Double,population_gps_txl_flag: Double,population_censu_txl_flag: Double,population_cnt_7day_province: Double,population_cnt_7day_city: Double,population_cnt_login: Double,population_before_apply_city: String,population_after_apply_city: String,population_before_in_apply_address: Double,population_before_after_apply_address: Double,population_in_after_apply_address: Double,population_re_address_steady: String,population_apply_address_steady: String,population_score_fake_gps: Double,population_score_fake_contacts: Double,text: String,flag: String)}).toDS//label to indexerval labelIndexer = new StringIndexer().setInputCol("flag").setOutputCol("indexedLabel").fit(originalData)//splits wordsval tokenizer = new RegexTokenizer().setInputCol("text").setOutputCol("words").setPattern("\\|")//words to vectorval word2Vec = new Word2Vec().setInputCol("words").setOutputCol("word2feature").setVectorSize(100)//.setMinCount(1).setMaxIter(10)//array fieldsval arr = Array("population_age","action_this_month_once_week_average_login_count","population_rank1_cell_cnt","population_rank2_cell_cnt","population_rank3_cell_cnt","population_gps_censu_flag","population_own_censu_flag","population_gps_own_flag","population_own_txl_flag","population_gps_txl_flag","population_censu_txl_flag","population_cnt_7day_province","population_cnt_7day_city","population_cnt_login","population_before_in_apply_address","population_before_after_apply_address","population_in_after_apply_address","population_score_fake_gps","population_score_fake_contacts","word2feature")//merge fields to Verctorval vectorAssembler = new VectorAssembler().setInputCols(arr).setOutputCol("assemblerVector")//creat GBTval gbt = new GBTClassifier().setLabelCol("indexedLabel").setFeaturesCol("assemblerVector")//set iterator.setMaxIter(25)//set tree depth.setMaxDepth(5)val labelConverter = new IndexToString().setInputCol("prediction").setOutputCol("predictedLabel").setLabels(labelIndexer.labels)val Array(trainingData, testData) = originalData.randomSplit(Array(0.8, 0.2))val pipeline = new Pipeline().setStages(Array(labelIndexer, tokenizer, word2Vec, vectorAssembler, gbt, labelConverter))val model = pipeline.fit(originalData)val predictionResultDF = model.transform(testData)predictionResultDF.show(false)val label_1 = predictionResultDF.select("cid", "flag", "predictedLabel").filter($"flag" === 1).count()val correct_1 = predictionResultDF.select("cid", "flag", "predictedLabel").filter($"flag" === $"predictedLabel").filter($"predictedLabel" === 1).count()val correct_0 = predictionResultDF.select("cid", "flag", "predictedLabel").filter($"flag" === $"predictedLabel").filter($"predictedLabel" === 0).count()val predicted_1 = predictionResultDF.select("cid", "predictedLabel").filter($"predictedLabel" === 1).repartition(1).write.format("csv").save("/Users/ant_git/Antifraud/src/data/predict/")val evaluator = new MulticlassClassificationEvaluator().setLabelCol("indexedLabel").setPredictionCol("prediction").setMetricName("accuracy")val accuracy = evaluator.evaluate(predictionResultDF)val error = 1.0 - accuracyprintln("Test Error = " + (1.0 - accuracy))spark.stop()}
}
总结:算法是别人封装好的,最重要的是特征如何进行处理,好的特征,很简单的算法都可以进行分类,不好的特征,再好的模型也很难有好的效果,所以如何进行特征的选择,对于机器学习来说是非常重要的
转载于:https://my.oschina.net/u/3455048/blog/1358310
spark GBT算法相关推荐
- Spark排序算法系列之(MLLib、ML)LR使用方式介绍(模型训练、保存、加载、预测)
转载请注明出处:http://blog.csdn.net/gamer_gyt 博主微博:http://weibo.com/234654758 Github:https://github.com/thi ...
- 基于Spark ALS算法的个性化推荐
今天来使用spark中的ALS算法做一个小推荐.需要数据的话可以点击查看初识sparklyr-电影数据分析,在文末点击阅读原文即可获取. 其实在R中还有一个包可以做推荐,那就是recommenderl ...
- java spark k-means算法
配置 配置请看我的其他文章 点击跳转 spark官方文档 点击跳转官方文档 数据 训练数据 实体类 用了swagger和lombok 不需要的可以删掉 import io.swagger.annota ...
- java spark 主成分分析算法(pca)
配置 配置请看我的其他文章 点击跳转 spark官方文档 点击跳转官方文档 数据 训练数据 代码 PCA算法的应用场景不是太明确,没做太多验证 实体类 用了swagger和lombok 不需要的可以删 ...
- spark(day06-spark算法、Spark Sql)
案例 处理u.data文件用户id 电影id 用户打分要求基于u.data文件,建立推荐系统模型,为789号用户推荐10部电影建模时,k的取值10~50之间,迭代次数:5~20次之间 λ:0.01~0 ...
- Spark GraphX算法 - Aggregate Messages (aggregateMessages)算法
1.官网地址 http://spark.apache.org/docs/latest/graphx-programming-guide.html#aggregate-messages-aggregat ...
- Spark GraphX算法 - Pregel算法
1.官网地址 http://spark.apache.org/docs/latest/graphx-programming-guide.html#pregel-api 2.demo样例
- Spark GraphX算法 - Connected Components(连通分支)算法
1.官网地址 http://spark.apache.org/docs/latest/graphx-programming-guide.html#connected-components 2.demo ...
- Spark GraphX算法 - PageRank算法
1.官网地址: http://spark.apache.org/docs/latest/graphx-programming-guide.html#pagerank 2.demo样例 object P ...
- spark ML算法之线性回归使用
前些天发现了一个巨牛的人工智能学习网站,通俗易懂,风趣幽默,忍不住分享一下给大家.点击跳转到网站:https://www.captainai.net/dongkelun 前言 本文是讲如何使用spar ...
最新文章
- ubuntu 挂载 exfat 格式 U盘 mount:unknown filesystem type ‘exfat‘
- 【C++】【四】企业链表
- 2007年教育学专业基础综合考试大纲
- Spring MVC,Thymeleaf,Spring Security应用程序中的CSRF保护
- oracle10数据库链接失败,Oracle10g出现Enterprise Manager 无法连接到数据库实例解决办法...
- java实体null值显示_java反射实现前端接收实体对象,去除“null”字符串(示例代码)...
- Angular中响应式表单 FormBuilder、FormControl 、FormGroup、FormArray、setControl、setValue用法总结
- 工作完成了,切勿激动,一定要先求证
- mysql可以考什么证_MySQL有没有什么比较权威的认证考试呢?
- 螺旋模型的概念简答题
- linux 生成p12证书,Linux下使用openssl制作CA及证书颁发
- 程序员工作交接文档怎么写_你认为程序员需不需要写文档?需要写哪些文档?...
- pic32用PICKIT3烧写bootloader
- 【LaTex】各种空格的实现(相对quad、qquad、\,、\:、\;、\!、endspace、thinspace、negthinspace绝对vspace和hspace膨胀hfill、vfill)
- 安装Xcode的方法
- 2008 r2 server sql 中文版补丁_Microsoft SQL Server 2008 r2 sp2补丁 64位 官方免费版
- 人生不该有如此压力,来吃下这口缓解焦虑的良药[50P]
- 消费贷款用途证明怎样提供
- 隐马尔可夫模型HMM
- Opencv实战 文字区域的提取
热门文章
- php文本框如何设置高度,更改文本框高度?
- 大学生动物介绍网页设计作品 dreamweaver作业静态HTML网页设计模板 保护动物网页作业制作
- DeepFace介绍
- 阿冰自己弄网站,利于开发的工具
- signature=cc8d613f503e9b933c233da06afc0fc6,襄阳市公安局交通警察支队违法车辆信息公告20210118...
- 幼儿抽象逻辑思维举例_幼教科目二 | 幼儿认知的发展(思维)
- word自动生成目录步骤之详细介绍,csdn首发!!!!!
- 金融反欺诈常用特征处理方法
- Python批量化实现SAR图像的海陆分割
- 大数据处理技术与人工智能技术