数据科学与机器学习案例之客户的信用风险与预测

数据科学与机器学习案例之信用卡欺诈识别(严重类失衡数据建模)

数据科学与机器学习案例之汽车目标客户销售策略研究

数据科学与机器学习案例之WiFi定位系统的位置预测

数据科学与机器学习案例之Stacking集成方法对鸢尾花进行分类

LightGBM算法使用教程

  • R语言lightgbm训练
    • 数据的输入
    • 训练、模型评估、预测
    • 回归与分类问题的lightgbm可视化
    • 多分类问题
    • 样本权重

Python使用lightgbm的说明以及例子指导

这里的链接是补充python的使用说明以及案例指导,让机器学习小白迅速上手,这里就不过多赘述。
下面详细说明R语言使用lightgbm的教程。

R语言lightgbm训练

数据的输入

训练数据的输入有三种方式:sparseMatrix、dense matrix、lgb.Dataset.

sparseMatrix:

> data(agaricus.train, package = "lightgbm")
> data(agaricus.test, package = "lightgbm")
> train <- agaricus.train
> test <- agaricus.test
> print("Training lightgbm with sparseMatrix")
[1] "Training lightgbm with sparseMatrix"
> bst <- lightgbm(
+     data = train$data
+     , label = train$label
+     , num_leaves = 4L
+     , learning_rate = 1.0
+     , nrounds = 2L
+     , objective = "binary"
+ )

dense matrix:

> print("Training lightgbm with Matrix")
[1] "Training lightgbm with Matrix"
> bst <- lightgbm(
+     data = as.matrix(train$data)
+     , label = train$label
+     , num_leaves = 4L
+     , learning_rate = 1.0
+     , nrounds = 2L
+     , objective = "binary"
+ )

lgb.Dataset:

> dtrain <- lgb.Dataset(
+     data = train$data
+     , label = train$label
+ )
> bst <- lightgbm(
+     data = dtrain
+     , num_leaves = 4L
+     , learning_rate = 1.0
+     , nrounds = 2L
+     , objective = "binary"
+ )

以上是lightgbm输入数据的三种方式.当然对于lightgbm算法打印模型的消息设置也有三种方式,下面来展示一下。

verbose = 0:


[LightGBM] [Warning] Auto-choosing row-wise multi-threading, the overhead of testing was 0.005476 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.

verbose = 1:

[LightGBM] [Info] Number of positive: 3140, number of negative: 3373
[LightGBM] [Warning] Auto-choosing row-wise multi-threading, the overhead of testing was 0.004427 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 214
[LightGBM] [Info] Number of data points in the train set: 6513, number of used features: 107
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.482113 -> initscore=-0.071580
[LightGBM] [Info] Start training from score -0.071580
[1]:    train's binary_logloss:0.198597
[2]:    train's binary_logloss:0.111535 

verbose = 2:


[LightGBM] [Info] Number of positive: 3140, number of negative: 3373
[LightGBM] [Debug] Dataset::GetMultiBinFromSparseFeatures: sparse rate 0.930600
[LightGBM] [Debug] Dataset::GetMultiBinFromAllFeatures: sparse rate 0.433362
[LightGBM] [Debug] init for col-wise cost 0.002494 seconds, init for row-wise cost 0.002800 seconds
[LightGBM] [Debug] col-wise cost 0.000429 seconds, row-wise cost 0.000184 seconds
[LightGBM] [Warning] Auto-choosing row-wise multi-threading, the overhead of testing was 0.003106 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Debug] Using Sparse Multi-Val Bin
[LightGBM] [Info] Total Bins 214
[LightGBM] [Info] Number of data points in the train set: 6513, number of used features: 107
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.482113 -> initscore=-0.071580
[LightGBM] [Info] Start training from score -0.071580
[LightGBM] [Debug] Trained a tree with leaves = 4 and max_depth = 3
[1]:    train's binary_logloss:0.198597
[LightGBM] [Debug] Trained a tree with leaves = 4 and max_depth = 3
[2]:    train's binary_logloss:0.111535 

训练、模型评估、预测

训练的过程需要手动调参并加以可视化,这里对于各个参数的调参就不加以展示了。直接展示调参的形式。

dtrain <- lgb.Dataset(data = train$data, label = train$label, free_raw_data = FALSE) # 训练集
dtest <- lgb.Dataset.create.valid(dtrain, data = test$data, label = test$label) # 测试集
valids <- list(train = dtrain, test = dtest) # 训练集与测试集的集合

备注lgb.train可以同时对训练集与测试集进行评估,并且评估的指标可设置多个.

可调的评估参数见:lightgbm可调参数全解
下面是可用于展示的评估参数,这里展示的是二分类的log损失,二分类的错误率,auc。

bst <- lgb.train(data = dtrain, num_leaves = 4L, learning_rate = 1.0, nrounds = 2L, valids = valids, eval = c("binary_error", "binary_logloss",'auc') # 二分类的log损失、错误率、auc, nthread = 2L, objective = "binary"
)

访问测试集与验证集的评估结果:

bst$record_evals[['eval']] # 嵌套列表
bst$record_evals[['test']] 

lgb.cv函数实现交叉验证:

nrounds <- 2L
param <- list(num_leaves = 4L, learning_rate = 1.0, objective = "binary"
)
lgb.cv(param, dtrain, nrounds, nfold = 5L, eval = "binary_error" # 交叉验证的评估指标
)

lgb.cv中的目标函数与评估函数可自定义:

logregobj <- function(preds, dtrain) {labels <- getinfo(dtrain, "label") # 读入dtrain数据中的labelpreds <- 1.0 / (1.0 + exp(-preds))grad <- preds - labelshess <- preds * (1.0 - preds)return(list(grad = grad, hess = hess))
}evalerror <- function(preds, dtrain) {labels <- getinfo(dtrain, "label")preds <- 1.0 / (1.0 + exp(-preds))err <- as.numeric(sum(labels != (preds > 0.5))) / length(labels)return(list(name = "error", value = err, higher_better = FALSE))
}lgb.cv(params = param, data = dtrain, nrounds = nrounds, obj = logregobj, eval = evalerror, nfold = 5L
)

备注:lgb.train中的obj(目标函数)与eval(评估函数)也可自定义,自定义方式请参考上述代码。

如有bug,请放弃食用或者重新安装Github更改的包:

git clone git@github.com:jameslamb/LightGBM.git -b fix/custom-eva
cd LightGBM
Rscript build_r.R

模型的预测:

> ptrain <- predict(bst, agaricus.train$data, rawscore = TRUE) # log odds
> ptrain1 <- predict(bst,agaricus.train$data) # 返回二分类的概率值
> head(ptrain)
[1]  0.9943657 -2.6300356 -2.6300356  0.9943657 -2.7839488 -2.6300356
> head(ptrain1)
[1] 0.72994936 0.06723022 0.06723022 0.72994936 0.05819774 0.06723022
> head(1 - ptrain1)
[1] 0.2700506 0.9327698 0.9327698 0.2700506 0.9418023 0.9327698
> log(0.72994936 / 0.2700506)
[1] 0.9943658

回归与分类问题的lightgbm可视化

数据仍然使用lightgbm自带的数据:agaricus.train agaricus.test.

> .diverging_palette <- c(
+   "#A50026", "#D73027", "#F46D43", "#FDAE61", "#FEE08B"
+   , "#D9EF8B", "#A6D96A", "#66BD63", "#1A9850", "#006837"
+ )
>
> .prediction_depth_plot <- function(df) {+   plot(
+     x = df$X
+     , y = df$Y
+     , type = "p"
+     , main = "Prediction Depth"
+     , xlab = "Leaf Bin"
+     , ylab = "Prediction Probability"
+     , pch = 19L
+     , col = .diverging_palette[df$binned + 1L]
+   )
+   legend(
+     "topright"
+     , title = "bin"
+     , legend = sort(unique(df$binned))
+     , pch = 19L
+     , col = .diverging_palette[sort(unique(df$binned + 1L))]
+     , cex = 0.7
+   )
+ }
>
> .prediction_depth_spread_plot <- function(df) {+   plot(
+     x = df$binned
+     , xlim = c(0L, 9L)
+     , y = df$Z
+     , type = "p"
+     , main = "Prediction Depth Spread"
+     , xlab = "Leaf Bin"
+     , ylab = "Logloss"
+     , pch = 19L
+     , col = .diverging_palette[df$binned + 1L]
+   )
+   legend(
+     "topright"
+     , title = "bin"
+     , legend = sort(unique(df$binned))
+     , pch = 19L
+     , col = .diverging_palette[sort(unique(df$binned + 1L))]
+     , cex = 0.7
+   )
+ }
>
> .depth_density_plot <- function(df) {+   plot(
+     x = density(df$Y)
+     , xlim = c(min(df$Y), max(df$Y))
+     , type = "p"
+     , main = "Depth Density"
+     , xlab = "Prediction Probability"
+     , ylab = "Bin Density"
+     , pch = 19L
+     , col = .diverging_palette[df$binned + 1L]
+   )
+   legend(
+     "topright"
+     , title = "bin"
+     , legend = sort(unique(df$binned))
+     , pch = 19L
+     , col = .diverging_palette[sort(unique(df$binned + 1L))]
+     , cex = 0.7
+   )
+ }

备注:可视化每个样本下分箱的数量与预测值、log 损失之间的关系。

params <- list(objective = "regression", metric = "l2")
valids <- list(test = dtest)
model <- lgb.train(params, dtrain, 50L, valids, min_data = 1L, learning_rate = 0.1, bagging_fraction = 0.1, bagging_freq = 1L, bagging_seed = 1L
)
new_data <- data.frame(X = rowMeans(predict(model, agaricus.test$data, predleaf = TRUE)) # 预测了每棵树上叶子节点的索引, Y = pmin(pmax(predict(model, agaricus.test$data) # 回归问题的预测值, 1e-15), 1.0 - 1e-15)
)
new_data$Z <- -1.0 * (agaricus.test$label * log(new_data$Y) + (1L - agaricus.test$label) * log(1L - new_data$Y)) # log lossnew_data$binned <- .bincode(x = new_data$X, breaks = quantile(x = new_data$X, probs = seq_len(9L) / 10.0), right = TRUE, include.lowest = TRUE
)
new_data$binned[is.na(new_data$binned)] <- 0L
.prediction_depth_plot(df = new_data)
.prediction_depth_spread_plot(df = new_data)
.depth_density_plot(df = new_data)



多分类问题

备注:使用lightgbm对鸢尾花数据进行分类.

> iris$Species <- as.numeric(as.factor(iris$Species)) - 1L
> train <- as.matrix(iris[c(1L:40L, 51L:90L, 101L:140L), ])
> test <- as.matrix(iris[c(41L:50L, 91L:100L, 141L:150L), ])
> dtrain <- lgb.Dataset(data = train[, 1L:4L], label = train[, 5L])
> dtest <- lgb.Dataset.create.valid(dtrain, data = test[, 1L:4L], label = test[, 5L])
> valids <- list(test = dtest)
>
>
> params <- list(objective = "multiclass", metric = "multi_error", num_class = 3L)
> model <- lgb.train(
+     params
+     , dtrain
+     , 100L
+     , valids
+     , min_data = 1L
+     , learning_rate = 1.0
+     , early_stopping_rounds = 10L
+ )
> my_preds <- predict(model, test[, 1L:4L]) # 返回向量
> my_preds <- predict(model, test[, 1L:4L], reshape = TRUE) # num_data * num_class Matrix

样本权重

> weights1 <- rep(1.0 / 100000.0, 6513L)
> weights2 <- rep(1.0 / 100000.0, 1611L)
> data(agaricus.train, package = "lightgbm")
> train <- agaricus.train
> dtrain <- lgb.Dataset(train$data, label = train$label, weight = weights1)
> data(agaricus.test, package = "lightgbm")
> test <- agaricus.test
> dtest <- lgb.Dataset.create.valid(dtrain, test$data, label = test$label, weight = weights2)
> valids <- list(test = dtest)
> params <- list(
+     objective = "regression"
+     , metric = "l2"
+     , device = "cpu"
+     , min_sum_hessian = 10.0
+     , num_leaves = 7L
+     , max_depth = 3L
+     , nthread = 1L
+ )
> model <- lgb.train(
+     params
+     , dtrain
+     , 50L
+     , valids
+     , min_data = 1L
+     , learning_rate = 1.0
+     , early_stopping_rounds = 10L
+ )
> weight_loss <- as.numeric(model$record_evals$test$l2$eval)
>
> plot(weight_loss) # Shows how poor the learning was: a straight line!
> params <- list(
+     objective = "regression"
+     , metric = "l2"
+     , device = "cpu"
+     , min_sum_hessian = 1e-4
+     , num_leaves = 7L
+     , max_depth = 3L
+     , nthread = 1L
+ )
> model <- lgb.train(
+     params
+     , dtrain
+     , 50L
+     , valids
+     , min_data = 1L
+     , learning_rate = 1.0
+     , early_stopping_rounds = 10L
+ )
> small_weight_loss <- as.numeric(model$record_evals$test$l2$eval)

总结:关于lightgbm的使用教程就到此为止,如果博客中有没有说明白的地方,欢迎大家评论区提问。

LightGBM使用教程相关推荐

  1. python机器学习案例系列教程——LightGBM算法

    分享一个朋友的人工智能教程.零基础!通俗易懂!风趣幽默!还带黄段子!大家可以看看是否对自己有帮助:点击打开 全栈工程师开发手册 (作者:栾鹏) python教程全解 安装 pip install li ...

  2. xgboost时间序列预测matlab,LightGBM和XGBoost实现时间序列预测(2019-04-02)

    LightGBM是最近最常见的一类算法,在kaggle比赛中经常被用来做预测和回归,由于性能比较好有着"倚天剑"的称号,而XGBoost则被称为屠龙刀.今天,我们就抛砖引玉,做一个 ...

  3. 【集成学习系列教程5】LightGBM

    文章目录 7 LightGBM 7.1 概述 7.2 LightGBM优化算法详解 7.2.1 GOSS算法 7.2.2 EFB算法 7.2.3 Histogram算法 7.2.4 Leaf-Wise ...

  4. LightGBM教程

    参数 参数 说明 boosting /boost/boosting_type 用于指定弱学习器的类型,默认值为 'gbdt',表示使用基于树的模型进行计算.还可以选择为 'gblinear' 表示使用 ...

  5. 开源!《AI 算法工程师手册》中文教程正式发布!

    作者 | 红色石头 转载自 AI有道(id:redstonewill) 最近红色石头在浏览网页的时候,偶然发现一份非常不错的 AI 资源,就是这本<AI 算法工程师手册> .本文将给大家推 ...

  6. R语言实战应用-lightgbm 算法优化:不平衡二分类问题(附代码)

    前言 本案例使用的数据为kaggle中"Santander Customer Satisfaction"比赛的数据.此案例为不平衡二分类问题,目标为最大化auc值(ROC曲线下方面 ...

  7. lightgbm简易评分卡制作

      LightGBM的意思是轻量级(light)的梯度提升机(GBM),其相对Xgboost具有训练速度快.内存占用低的特点.关于lgb针对xgb做的优化,后面想写一篇文章复习一下.本篇文章主要讲解如 ...

  8. Lightgbm with Hyperopt

    如何使用hyperopt对Lightgbm进行自动调参 之前的教程以及介绍过如何使用hyperopt对xgboost进行调参,并且已经说明了,该代码模板可以十分轻松的转移到lightgbm,或者cat ...

  9. wandb: 深度学习轻量级可视化工具入门教程

    本文,就来给大家介绍一款新型的机器学习可视化工具,能够让人工智能研发过程变得更加简单明了. wandb: 深度学习轻量级可视化工具入门教程 引言 wandb 验证数据可视化 自然语言处理 重要工具 极 ...

最新文章

  1. asp.net core mvc上传大文件解决方案
  2. python数据可视化的特点_Python数据可视化, 看这一篇就够了
  3. miller_rabin 证明与实现
  4. SQL Server数据库-限制返回行数
  5. 开源网络数据平面生态:软件正在吞食整个世界
  6. [机器学习] 机器学习中所说的“线性模型”是个什么东西?
  7. OpenCV C++ 常用功能
  8. 查看oracle执行计划
  9. mysql 备份任务_设置mysql 定时备份任务
  10. matlab保存矩阵为txt,matlab矩阵保存为txt
  11. Biztalk AS2开发经验总结
  12. 计算机主机检测不到耳机,win10电脑检测不到耳机的原因及处理方法
  13. debian7开机启动
  14. 室内定位技术之UWB篇
  15. MIUI——添加学校邮箱到电子邮件解决方案
  16. STM32L4系列单片机ADC通过内部参考电压精确计算输入电压
  17. 非关系型数据库之Redis【redis安装】
  18. 易基因|文献科普:DNA甲基化测序揭示DNMT3a在调控T细胞同种异体反应中的关键作用
  19. 知识产权创造美好生活
  20. DeprecationWarning: NEAREST is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampl

热门文章

  1. 以“共域流量”生态思维破解独立管理餐厅的“短命”之殇弘扬中华餐饮传播中国符号
  2. 复盘|阿里系产品怎样做好一次复盘
  3. 技术分享:如何用Solr搭建大数据查询平台
  4. javascripts再进
  5. 关于流量,不可不说的秘密
  6. 快来!礼物替你选好了:2022年神秘的程序员周历!
  7. ⚔疯狂输出⚔ Java中的继承。
  8. 京东大幅下调iPhone XS系列售价,最高降价1700元
  9. 如何删除我的电脑里的优酷影视库
  10. Thinkphp内核收卡网站源码/礼品卡回收兑换