梯度增强树(GBT)是使用决策树组合的流行回归方法

相对于Random forest 来说GBT在实际应用中,效果更好

直接上代码

package mllibimport org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.GBTClassifier
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.ml.feature._
import org.apache.spark.sql.SparkSession/*** Created by dongdong on 17/7/10.*/case class Fearture_One(cid: String,population_gender: String,population_age: Double,population_registered_gps_city: String,population_education_nature: String,population_university_level: String,sociality_channel_type: String,action_registered_channel: String,action_this_month_once_week_average_login_count: Double,population_censu_city: String,population_gps_city: String,population_own_cell_city: String,population_rank1_cell_city: String,population_rank1_cell_cnt: Double,population_rank2_cell_city: String,population_rank2_cell_cnt: Double,population_rank3_cell_city: String,population_rank3_cell_cnt: Double,population_gps_censu_flag: Double,population_own_censu_flag: Double,population_gps_own_flag: Double,population_own_txl_flag: Double,population_gps_txl_flag: Double,population_censu_txl_flag: Double,population_cnt_7day_province: Double,population_cnt_7day_city: Double,population_cnt_login: Double,population_before_apply_city: String,population_after_apply_city: String,population_before_in_apply_address: Double,population_before_after_apply_address: Double,population_in_after_apply_address: Double,population_re_address_steady: String,population_apply_address_steady: String,population_score_fake_gps: Double,population_score_fake_contacts: Double,text: String,flag: String)object GBT_Profile {def main(args: Array[String]): Unit = {val inpath1 = "/Users/ant_git/src/data/user_profile_train/part-00000"val spark = SparkSession.builder().master("local[3]").appName("GBT_Profile").getOrCreate()import spark.implicits._//read data and transform dataframval originalData = spark.sparkContext.textFile(inpath1).map(line => {val arr = line.split("\001")val cid = arr(0)val population_gender = arr(3).replace("\\N", "N")val population_age = arr(4).replace("\\N", "0").toDoubleval population_registered_gps_city = arr(7).replace("\\N", "N")val population_education_nature = arr(10).replace("\\N", "N")val population_university_level = arr(11).replace("\\N", "N")val sociality_channel_type = arr(13).replace("\\N", "N")val action_registered_channel = arr(44).replace("\\N", "N")val action_this_month_once_week_average_login_count = arr(54).replace("\\N", "0").toDoubleval population_censu_city = arr(63).replace("\\N", "N")val population_gps_city = arr(64).replace("\\N", "N")// val population_jz_city = arr(65).replace("\\N", "N")// val population_ip_city = arr(66).replace("\\N", "N")val population_own_cell_city = arr(67).replace("\\N", "N")val population_rank1_cell_city = arr(68).replace("\\N", "N")val population_rank1_cell_cnt = arr(69).replace("\\N", "0").toDoubleval population_rank2_cell_city = arr(70).replace("\\N", "N")val population_rank2_cell_cnt = arr(71).replace("\\N", "0").toDoubleval population_rank3_cell_city = arr(72).replace("\\N", "N")val population_rank3_cell_cnt = arr(73).replace("\\N", "0").toDouble//val population_jxl_call_max_city = arr(74).replace("\\N", "N")// val population_jxl_call_max_city_cnt = arr(75).replace("\\N", "0").toDouble//val population_anzhuo_30day_max_city = arr(76).replace("\\N", "N")//val population_anzhuo_30day_max_city_cnt = arr(77).replace("\\N", "0").toDoubleval population_gps_censu_flag = arr(78).replace("\\N", "0").toDouble//val population_gps_jxl_flag = arr(79).replace("\\N", "0").toDouble//val population_gps_jz_flag = arr(80).replace("\\N", "0").toDouble//val population_ip_censu_flag = arr(81).replace("\\N", "0").toDouble// val population_ip_jxl_flag = arr(82).replace("\\N", "0").toDouble//val population_ip_jz_flag = arr(83).replace("\\N", "0").toDoubleval population_own_censu_flag = arr(84).replace("\\N", "0").toDouble//val population_own_jxl_flag = arr(85).replace("\\N", "0").toDouble//val population_own_jz_flag = arr(86).replace("\\N", "0").toDoubleval population_gps_own_flag = arr(87).replace("\\N", "0").toDouble//val population_gps_ip_flag = arr(88).replace("\\N", "0").toDouble//val population_ip_own_flag = arr(89).replace("\\N", "0").toDouble//val population_ip_txl_flag = arr(90).replace("\\N", "0").toDoubleval population_own_txl_flag = arr(91).replace("\\N", "0").toDoubleval population_gps_txl_flag = arr(92).replace("\\N", "0").toDoubleval population_censu_txl_flag = arr(93).replace("\\N", "0").toDouble//val population_jxl_txl_flag = arr(94).replace("\\N", "0").toDouble//val population_jz_txl_flag = arr(95).replace("\\N", "0").toDoubleval population_cnt_7day_province = arr(96).replace("\\N", "0").toDoubleval population_cnt_7day_city = arr(97).replace("\\N", "0").toDoubleval population_cnt_login = arr(102).replace("\\N", "0").toDoubleval population_before_apply_city = arr(107).replace("\\N", "N")val population_after_apply_city = arr(108).replace("\\N", "N")val population_before_in_apply_address = arr(111).replace("\\N", "0").toDoubleval population_before_after_apply_address = arr(112).replace("\\N", "0").toDoubleval population_in_after_apply_address = arr(113).replace("\\N", "0").toDoubleval population_re_address_steady = arr(116).replace("\\N", "N")val population_apply_address_steady = arr(117).replace("\\N", "N")val population_score_fake_gps = arr(127).replace("\\N", "0").toDoubleval population_score_fake_contacts = arr(128).replace("\\N", "0").toDoubleval text = population_gender + "|" +population_registered_gps_city + "|" +population_education_nature + "|" +population_university_level + "|" +sociality_channel_type + "|" +action_registered_channel + "|" +population_censu_city + "|" +population_gps_city + "|" +population_own_cell_city + "|" +population_rank1_cell_city + "|" +population_rank2_cell_city + "|" +population_rank3_cell_city + "|" +population_before_apply_city + "|" +population_after_apply_city + "|" +population_re_address_steady + "|" +population_apply_address_steadyval flag = arr(141)Fearture_One(cid: String,population_gender: String,population_age: Double,population_registered_gps_city: String,population_education_nature: String,population_university_level: String,sociality_channel_type: String,action_registered_channel: String,action_this_month_once_week_average_login_count: Double,population_censu_city: String,population_gps_city: String,population_own_cell_city: String,population_rank1_cell_city: String,population_rank1_cell_cnt: Double,population_rank2_cell_city: String,population_rank2_cell_cnt: Double,population_rank3_cell_city: String,population_rank3_cell_cnt: Double,population_gps_censu_flag: Double,population_own_censu_flag: Double,population_gps_own_flag: Double,population_own_txl_flag: Double,population_gps_txl_flag: Double,population_censu_txl_flag: Double,population_cnt_7day_province: Double,population_cnt_7day_city: Double,population_cnt_login: Double,population_before_apply_city: String,population_after_apply_city: String,population_before_in_apply_address: Double,population_before_after_apply_address: Double,population_in_after_apply_address: Double,population_re_address_steady: String,population_apply_address_steady: String,population_score_fake_gps: Double,population_score_fake_contacts: Double,text: String,flag: String)}).toDS//label to indexerval labelIndexer = new StringIndexer().setInputCol("flag").setOutputCol("indexedLabel").fit(originalData)//splits wordsval tokenizer = new RegexTokenizer().setInputCol("text").setOutputCol("words").setPattern("\\|")//words to vectorval word2Vec = new Word2Vec().setInputCol("words").setOutputCol("word2feature").setVectorSize(100)//.setMinCount(1).setMaxIter(10)//array fieldsval arr = Array("population_age","action_this_month_once_week_average_login_count","population_rank1_cell_cnt","population_rank2_cell_cnt","population_rank3_cell_cnt","population_gps_censu_flag","population_own_censu_flag","population_gps_own_flag","population_own_txl_flag","population_gps_txl_flag","population_censu_txl_flag","population_cnt_7day_province","population_cnt_7day_city","population_cnt_login","population_before_in_apply_address","population_before_after_apply_address","population_in_after_apply_address","population_score_fake_gps","population_score_fake_contacts","word2feature")//merge fields to Verctorval vectorAssembler = new VectorAssembler().setInputCols(arr).setOutputCol("assemblerVector")//creat GBTval gbt = new GBTClassifier().setLabelCol("indexedLabel").setFeaturesCol("assemblerVector")//set iterator.setMaxIter(25)//set tree depth.setMaxDepth(5)val labelConverter = new IndexToString().setInputCol("prediction").setOutputCol("predictedLabel").setLabels(labelIndexer.labels)val Array(trainingData, testData) = originalData.randomSplit(Array(0.8, 0.2))val pipeline = new Pipeline().setStages(Array(labelIndexer, tokenizer, word2Vec, vectorAssembler, gbt, labelConverter))val model = pipeline.fit(originalData)val predictionResultDF = model.transform(testData)predictionResultDF.show(false)val label_1 = predictionResultDF.select("cid", "flag", "predictedLabel").filter($"flag" === 1).count()val correct_1 = predictionResultDF.select("cid", "flag", "predictedLabel").filter($"flag" === $"predictedLabel").filter($"predictedLabel" === 1).count()val correct_0 = predictionResultDF.select("cid", "flag", "predictedLabel").filter($"flag" === $"predictedLabel").filter($"predictedLabel" === 0).count()val predicted_1 = predictionResultDF.select("cid", "predictedLabel").filter($"predictedLabel" === 1).repartition(1).write.format("csv").save("/Users/ant_git/Antifraud/src/data/predict/")val evaluator = new MulticlassClassificationEvaluator().setLabelCol("indexedLabel").setPredictionCol("prediction").setMetricName("accuracy")val accuracy = evaluator.evaluate(predictionResultDF)val error = 1.0 - accuracyprintln("Test Error = " + (1.0 - accuracy))spark.stop()}
}

总结:算法是别人封装好的,最重要的是特征如何进行处理,好的特征,很简单的算法都可以进行分类,不好的特征,再好的模型也很难有好的效果,所以如何进行特征的选择,对于机器学习来说是非常重要的

转载于:https://my.oschina.net/u/3455048/blog/1358310

spark GBT算法相关推荐

  1. Spark排序算法系列之(MLLib、ML)LR使用方式介绍(模型训练、保存、加载、预测)

    转载请注明出处:http://blog.csdn.net/gamer_gyt 博主微博:http://weibo.com/234654758 Github:https://github.com/thi ...

  2. 基于Spark ALS算法的个性化推荐

    今天来使用spark中的ALS算法做一个小推荐.需要数据的话可以点击查看初识sparklyr-电影数据分析,在文末点击阅读原文即可获取. 其实在R中还有一个包可以做推荐,那就是recommenderl ...

  3. java spark k-means算法

    配置 配置请看我的其他文章 点击跳转 spark官方文档 点击跳转官方文档 数据 训练数据 实体类 用了swagger和lombok 不需要的可以删掉 import io.swagger.annota ...

  4. java spark 主成分分析算法(pca)

    配置 配置请看我的其他文章 点击跳转 spark官方文档 点击跳转官方文档 数据 训练数据 代码 PCA算法的应用场景不是太明确,没做太多验证 实体类 用了swagger和lombok 不需要的可以删 ...

  5. spark(day06-spark算法、Spark Sql)

    案例 处理u.data文件用户id 电影id 用户打分要求基于u.data文件,建立推荐系统模型,为789号用户推荐10部电影建模时,k的取值10~50之间,迭代次数:5~20次之间 λ:0.01~0 ...

  6. Spark GraphX算法 - Aggregate Messages (aggregateMessages)算法

    1.官网地址 http://spark.apache.org/docs/latest/graphx-programming-guide.html#aggregate-messages-aggregat ...

  7. Spark GraphX算法 - Pregel算法

    1.官网地址 http://spark.apache.org/docs/latest/graphx-programming-guide.html#pregel-api 2.demo样例

  8. Spark GraphX算法 - Connected Components(连通分支)算法

    1.官网地址 http://spark.apache.org/docs/latest/graphx-programming-guide.html#connected-components 2.demo ...

  9. Spark GraphX算法 - PageRank算法

    1.官网地址: http://spark.apache.org/docs/latest/graphx-programming-guide.html#pagerank 2.demo样例 object P ...

  10. spark ML算法之线性回归使用

    前些天发现了一个巨牛的人工智能学习网站,通俗易懂,风趣幽默,忍不住分享一下给大家.点击跳转到网站:https://www.captainai.net/dongkelun 前言 本文是讲如何使用spar ...

最新文章

  1. ubuntu 挂载 exfat 格式 U盘 mount:unknown filesystem type ‘exfat‘
  2. 【C++】【四】企业链表
  3. 2007年教育学专业基础综合考试大纲
  4. Spring MVC,Thymeleaf,Spring Security应用程序中的CSRF保护
  5. oracle10数据库链接失败,Oracle10g出现Enterprise Manager 无法连接到数据库实例解决办法...
  6. java实体null值显示_java反射实现前端接收实体对象,去除“null”字符串(示例代码)...
  7. Angular中响应式表单 FormBuilder、FormControl 、FormGroup、FormArray、setControl、setValue用法总结
  8. 工作完成了,切勿激动,一定要先求证
  9. mysql可以考什么证_MySQL有没有什么比较权威的认证考试呢?
  10. 螺旋模型的概念简答题
  11. linux 生成p12证书,Linux下使用openssl制作CA及证书颁发
  12. 程序员工作交接文档怎么写_你认为程序员需不需要写文档?需要写哪些文档?...
  13. pic32用PICKIT3烧写bootloader
  14. 【LaTex】各种空格的实现(相对quad、qquad、\,、\:、\;、\!、endspace、thinspace、negthinspace绝对vspace和hspace膨胀hfill、vfill)
  15. 安装Xcode的方法
  16. 2008 r2 server sql 中文版补丁_Microsoft SQL Server 2008 r2 sp2补丁 64位 官方免费版
  17. 人生不该有如此压力,来吃下这口缓解焦虑的良药[50P]
  18. 消费贷款用途证明怎样提供
  19. 隐马尔可夫模型HMM
  20. Opencv实战 文字区域的提取

热门文章

  1. php文本框如何设置高度,更改文本框高度?
  2. 大学生动物介绍网页设计作品 dreamweaver作业静态HTML网页设计模板 保护动物网页作业制作
  3. DeepFace介绍
  4. 阿冰自己弄网站,利于开发的工具
  5. signature=cc8d613f503e9b933c233da06afc0fc6,襄阳市公安局交通警察支队违法车辆信息公告20210118...
  6. 幼儿抽象逻辑思维举例_幼教科目二 | 幼儿认知的发展(思维)
  7. word自动生成目录步骤之详细介绍,csdn首发!!!!!
  8. 金融反欺诈常用特征处理方法
  9. Python批量化实现SAR图像的海陆分割
  10. 大数据处理技术与人工智能技术