
Decision Trees are popular Machine Learning algorithms used for both regression and classification tasks. Their popularity mainly arises from their interpretability and representability, as they mimic the way the human brain takes decisions.

决策树是流行的机器学习算法,用于回归和分类任务。 它们的流行主要源于它们的可解释性和可表示性,因为它们模仿人脑做出决策的方式。

However, to be interpretable, they pay a price in terms of prediction accuracy. To overcome this caveat, some techniques have been developed, with the goal of creating strong and robust models starting from ‘poor’ models. Those techniques are known as ‘ensemble’ methods (I discussed some of them in my previous article here).

但是,可以理解的是,它们为预测准确性付出了代价。 为了克服这一警告 ,已经开发了一些技术,目的是从“不良”模型开始创建强大而健壮的模型。 这些技术被称为“合奏”的方法(我讨论其中一些我以前的文章在这里 )。

In this article, I’m going to dwell on four different ensemble techniques, all having Decision Tree as base learner, with the aim of comparing their performances in terms of accuracy and training time. The four algorithms I’m going to use are:

在本文中,我将介绍四种不同的集成技术,所有这些技术都以决策树作为基础学习者,目的是比较它们在准确性和培训时间方面的表现。 我要使用的四种算法是:

  • Random Forest随机森林
  • Gradient Boosting梯度提升
  • XGBoostXGBoost
  • LightGBMLightGBM

To compare these methods’ performances, I initialized an artificial dataset as follows:


from sklearn.datasets import make_blobsfrom matplotlib import pyplotfrom pandas import DataFrame# generate 2d classification datasetX, y = make_blobs(n_samples=10000, centers=3, n_features=2)df = DataFrame(dict(x1=X[:,0], x2=X[:,1], label=y))df.head()

As you can see, our dataset contains observations having a vector of two predictors [x1,x2] and a categorical output with 3 classes [0,1,2].


The final aim is showing how the LightGBM overperforms (by far) the other algorithms.


随机森林 (Random Forest)

Random forest relies on the concept of bagging, that is: if we were able to train on different datasets multiple trees and then use an average (or, in case of classification, the majority vote) of their output to predict the label of a new observation, we would get more accurate results. We can achieve that by creating a series of datasets obtained as a bootstrapped version of the original one, and then train a bunch of classifiers.

随机森林依赖于装袋的概念,即:如果我们能够在不同的数据集上训练多棵树,然后使用它们的输出的平均值(或分类的话,则以多数表决)来预测新树的标签。观察,我们将获得更准确的结果。 我们可以通过创建一系列作为原始版本的自举版本获得的数据集,然后训练一堆分类器来实现。

Plus, Random Forest algorithm adds a further constraint: every time a tree is grown from a bootstrapped sample, the algorithm allows it to consider only a subset of size m of the entire covariates spaces of size p (with m<p). By doing so, each tree is independent of each other.

另外,随机森林算法增加了进一步的约束:每次从自举样本中生成一棵树时,该算法允许它仅考虑大小为p(m <p)的整个协变量空间中大小为m的子集。 这样,每棵树都彼此独立。

Let us see how it performs on our artificial data:


#random forestimport timestart = time.time()clf_rf = RandomForestClassifier(max_depth=2, random_state=0)from sklearn.model_selection import cross_val_scoreclf_rf = RandomForestClassifier()scores = cross_val_score(clf_rf, X, y, cv=5)acc_rf = scores.mean()#acc_rf#do somethingend = time.time()temp_rf = end-start

To evaluate model performances in terms of accuracy, I will use the cross-validation approach on the training set, partitioning it into 5 folds. I’ve stored the time and accuracy results into variables which will be revealed at the end of this article.

为了评估模型性能的准确性,我将对训练集使用交叉验证方法,将其分为5倍。 我将时间和准确性结果存储在变量中,这些变量将在本文结尾处显示。

梯度提升 (Gradient Boosting)

The idea behind boosting is building a series of trees, each of those being an updated version of the previous one. Basically, at each iteration, a tree is built on the dataset (X, r) rather than (X,y), where “r” indicates the residuals obtained by the previous tree. Then a shrunken version of this classifier is added to the previous one, and the procedure goes on up to the end of the loop (or when a certain breaking condition is reached).

Boosting背后的想法是构建一系列树,每个树都是前一棵的更新版本。 基本上,在每次迭代时,树都是在数据集(X,r)而不是(X,y)上构建的,其中“ r”表示前一棵树获得的残差。 然后,将这个分类器的缩小版本添加到前一个分类器中,然后该过程继续进行到循环结束(或在达到特定中断条件时)。

With Gradient Boosting, the updating of the classifier is done via gradient descent optimization procedure, which is used to approximate the residuals (to learn more about the functioning of this algorithm, here there is a very intuitive article).

使用Gradient Boosting,可通过梯度下降优化程序完成分类器的更新,该程序用于近似残差(要了解有关该算法功能的更多信息, 此处有一篇非常直观的文章)。

So let’s initialize also this algorithm and train it against our data:


from sklearn.ensemble import GradientBoostingClassifierstart = time.time()clf_gb = GradientBoostingClassifier(max_depth=2, random_state=0)scores = cross_val_score(clf_gb, X, y, cv=5)acc_gb = scores.mean()end = time.time()temp_gb = end-start

XGBoost (XGBoost)

XGboost is an “extreme” version of Gradient Boosting, in the sense that is more efficient, flexible, and portable. Among the features that make this algorithm that performant, we can quote its parallelization of tree construction (using all CPU cores) and its possibility of being distributed across different machines to train very large datasets. Plus, it is more accurate than standard Gradient Boosting.

XGboost是Gradient Boosting的“极端”版本,从某种意义上讲,它更高效,更灵活,更便携。 在使该算法具有出色性能的功能中,我们可以引用其对树结构的并行化(使用所有CPU内核),以及将其分布在不同机器上以训练非常大的数据集的可能性。 另外,它比标准的Gradient Boosting更准确。

All these features made it the favourite algorithm on Kaggle for a very long time, until new versions of Gradient Boosting have entered the market (among those, LightGBM).

所有这些功能使它成为Kaggle上最喜欢的算法已有很长时间了,直到新版本的Gradient Boosting进入市场(其中包括LightGBM)。

So let’s train our XGBoost and store its results.


#installing the package for xgboost!pip install xgboostimport xgboost as xgbstart = time.time()clf_xgb=xgb.XGBClassifier()scores = cross_val_score(clf_xgb, X, y, cv=5)acc_xgb = scores.mean()end = time.time()temp_xgb = end-start

Note: differently from Random Forest and Gradient Boosting Classifier, that were scikit-learn libraries, with XGBoost and, later on, LightGBM, we need to treat them as individual packages. Hence, we can easily install them via pip.

注意:与随机森林和梯度提升分类器(它们是scikit学习库)不同,它们具有XGBoost,后来又具有LightGBM,我们需要将它们视为单独的软件包。 因此,我们可以通过pip轻松安装它们

Now it’s time to train and evaluate the last ensemble algorithm and then compare all the obtained results.


LightGBM (LightGBM)

LightGBM is yet another gradient boosting framework that uses a tree-based learning algorithm. As its colleague XGBoost, it focuses on computational efficiency and high standard performance.

LightGBM是又一个使用基于树的学习算法的梯度增强框架。 作为其同事XGBoost,它专注于计算效率和高标准性能。

In recent times, LightGBM gathered incredible success in many Kaggle competitions, outperforming XGBoost in terms of both speeds of training and accuracy of predictions.


Let’s see whether it holds also for our artificial dataset.


! pip install lightgbmimport lightgbm as lgbparams = {    'boosting_type': 'gbdt',    'objective': 'binary',    'metric': 'binary_error', }lgb_train = lgb.Dataset(X, y, free_raw_data=False)start = time.time()scores = lgb.cv(        params,        lgb_train,        num_boost_round=100,        nfold=5,        stratified=False        )scoresend = time.time()temp_lgb = end-start

结论 (Conclusions)

Now let us see the results of our algorithms:


import pandas as pddata = dict([('LigthGBM', [acc_lgb, temp_lgb]), ('XGBoost', [acc_xgb*100, temp_xgb]), ('Random Forest', [acc_rf*100, temp_rf]),             ('Gradient Boosting', [acc_gb*100, temp_gb])])df = pd.DataFrame(data).T.rename(columns={0: 'Accuracy', 1: 'Training Time'})df
import matplotlib.pyplot as pltfig, axes = plt.subplots(figsize=(12,8),nrows=1, ncols=2, sharey=True)#initializing a function for a better interpretability of the #training timedef ret_time(temp):    minutes = round(temp//60, 0)    seconds = round(temp - 60*minutes, 0)    return [int(minutes), int(seconds)]#ax2=plt.subplot(2,2,2)ax1=df.sort_values(by='Accuracy', ascending=True)["Accuracy"].plot(ax=axes[0], kind='barh', logx=True, xlim=(0,102))ax1.text(96.15, 0, str(str(round(acc_rf*100,2))+'%'), fontsize=15)ax1.text(96.4, 1, str(str(round(acc_gb*100,2))+'%'), fontsize=15)ax1.text(96.44, 2, str(str(round(acc_xgb*100,2))+'%'), fontsize=15)ax1.text(100, 3, str(str(round(acc_lgb,2))+'%'), fontsize=15)ax1.set_title('Accuracy on 5-fold CV')ax2 = df.sort_values(by='Accuracy', ascending=True)["Training Time"].plot(ax=axes[1], kind='barh', xlim=(0,900), colormap='viridis')ax2.text(400, 0, str(str(ret_time(temp_rf)[0]) + 'm:'+str(ret_time(temp_rf)[1])+'s'), fontsize=15)ax2.text(700, 1, str(str(ret_time(temp_gb)[0]) + 'm:'+str(ret_time(temp_gb)[1])+'s'), fontsize=15)ax2.text(400, 2, str(str(ret_time(temp_xgb)[0]) + 'm:'+str(ret_time(temp_xgb)[1])+'s'), fontsize=15)ax2.text(100, 3, str(str(ret_time(temp_lgb)[0]) + 'm:'+str(ret_time(temp_lgb)[1])+'s'), fontsize=15)#ax1.text(700, 1, str(), fontsize=15)#ax1.text(400, 2, str(), fontsize=15)ax2.set_title('Training Time')

As you can see, not only the LightGBM is the model with the highest accuracy, but also is it the one with the lowest training time (by far).


Of course, to draw more consistent conclusions, one experiment is not enough. Plus, picking the best algorithm really depends on the task you are carrying on, as well as on the size (and nature) of the dataset.

当然,要得出更一致的结论,一个实验是不够的。 另外,选择最佳算法实际上取决于您要执行的任务以及数据集的大小(和性质)。

Nevertheless, LightGBM resulted in great performances in many Kaggle competitions and today is one of the preferred classifiers in the market.


I hope you enjoy this reading! If you are interested in the topic, as well as in further “extreme” versions of Gradient Boosting, I suggest to you the references below:

希望您喜欢阅读! 如果您对该主题以及Gradient Boosting的其他“高级”版本感兴趣,建议您参考以下内容:

翻译自: https://medium.com/dataseries/decision-tree-boosting-techniques-compared-5667bb2087ab




  • mybatis高级操作及源码分析(一)
  • 微服务项目之电商--19.ElasticSearch基本、高级查询和 过滤、结果过滤、 排序和聚合aggregations
  • Elasticsearch 搜索的高级功能学习
  • ElasticSearch 高级
  • 20天准备四级
  • 大学生简单个人静态HTML网页设计作品 DIV布局个人介绍网页模板代码 DW学生个人网站制作成品下载 HTML5期末大作业
  • web前端开发技术 web课程设计 网页规划与设计web期末作业设计网页
  • Bootstrap 与 Jackknife 笔记
  • 【学习笔记】计算机时代的统计推断(Bradley Efron and Trevor Hastie 著)
  • 没有我们的世界
  • 我一个回车干掉隐藏在身边已久的木马病毒
  • 我的世界版链游Titan Hunters,泰坦猎人研报实测
  • 现实世界的映射与超越:电子游戏的叙事研究
  • 我的世界末日之后无限法则服务器,Last Day Rules官方版
  • 世界末日来了别怕!五种新技术抱紧你
  • J.P. Morgan Executes and Clears CDS and IRS Trades Via Bloomberg Professional
  • 这是他本赛季第一张黄牌
  • OSChina 周三乱弹 ——新技能get 如何机智的关注大胸妹子。
  • 单元测试报connection is allready closed导致dailybuild中断的解决方案——类加载机制的应用...
  • sqoop的java操作,总结归纳,含代码
  • java sqoop api 导mysql数据到hdfs
  • [渝粤教育] 广东-国家-开放大学 21秋期末考试建设工程法规10221k2
  • 2021年安全员-C证(陕西省)考试总结及安全员-C证(陕西省)
  • 国家开放大学2021春1194建设监理题目
  • 行政科购入计算机一台,行政单位会计分录练习题.doc
  • 工程建设项目业务学习
  • 建设工程法规专科【2】
  • 建设工程法规专科【7】
  • 2021年安全员-B证(江西省)新版试题及安全员-B证(江西省)考试技巧
  • 2022安徽安全员B考试单选题库预测分享


  1. b树与b+树的区别_面试必考:B树、B树、B+树、B*树图文详解

    B树 B树又叫做二叉搜索树,倒状的树形结构.如下图所示 特点: 所有的非子夜节点最多拥有两个子节点树(左子树和右子树). 所有结点存储一个关键字. 节点的左右儿子,左边是比该节点小的,右边是比该节点大 ...

  2. LESSON 12.1-12.6 梯度提升树的基本思想梯度提升树的参数

    目录 一 梯度提升树的基本思想  1 梯度提升树 pk AdaBoost  2 GradientBoosting回归与分类的实现 二 梯度提升树的参数  1 迭代过程    1.1 初始预测结果

  3. 决策树原理实例(python代码实现)_决策树原理实例(python代码实现)

    决策数(Decision Tree)在机器学习中也是比较常见的一种算法,属于监督学习中的一种.看字面意思应该也比较容易理解,相比其他算法比如支持向量机(SVM)或神经网络,似乎决策树感觉"亲 ...

  4. b树与b+树的区别_一篇文章理清B树、B-树、B+树、B*树索引之间的区别与联系

    概述 相信对于B树.B-树.B+树.B*树索引这几个大家都很容易混淆,下面单独对这几个索引做下分类总结. B树 即二叉搜索树: 1.所有非叶子结点至多拥有两个儿子(Left和Right): 2.所有结 ...

  5. b树与b+树的区别_一文详解 B-树,B+树,B*树

    B-树 B-树是一种多路搜索树(并不一定是二叉的) 1970年,R.Bayer和E.mccreight提出了一种适用于外查找的树,它是一种平衡的多叉树,称为B树(或B-树.B_树). 一棵m阶B树(b ...

  6. 经济泡沫和泡沫经济的区别_投资绿色技术:绿色黄金还是绿色泡沫?

    经济泡沫和泡沫经济的区别 Recently, investments in projects that provide an opportunity not only to earn but also ...

  7. 大数据开发和python的区别_大数据技术和python开发工程师

    容易来说,从大数据的生命周期来看,无外乎四个方面:大数据采集.大数据预处理.大数据存储.大数据分析,共同组成了大数据生命周期里最核心的技术,下面分开来说: 一.大数据采集 大数据采集,即对各种来源的结 ...

  8. python算法的缺陷和不足_决策树基本概念及算法优缺点

    1. 什么是决策树 分类决策树模型是一种描述对实例进行分类的树形结构. 决策树由结点和有向边组成. 结点有两种类型: 内部结点和叶节点. 内部节点表示一个特征或属性, 叶节点表示一个类. 决策树(De ...

  9. 梯度提升树GBDT的理论学习与细节补充

    1. 写在前面 今天是梯度提升树GBDT的理论学习和细节补充, 之前整理过XGBOOST和Lightgbm, 在那里面提到了GBDT, 但是只是简单的一过, 并没有关注太多GBDT的细节, 所以这次借 ...


  1. 笨办法学C 练习22:栈、作用域和全局
  2. 海啸(二维前缀和/二维树状数组)
  3. echarts各种事件
  4. 动态参数与global和nonlocal
  5. 显示器选三星还是飞利浦_如何为飞利浦色相灯设置计时器
  6. [Java基础]注解概念
  7. 单片机c语言参考文献最新,[2018年最新整理]10个单片机C语言实例.doc
  8. 软件工程学习笔记(考试版)
  9. jquery的ajax查询数据库,jquery中使用ajax获取远程页面信息
  10. 2014全国计算机等级考试四级数据库工程师考试大纲,4月全国计算机等级考试四级数据库工程师笔试试卷(1)...
  11. s3c6410 RTC driver——读取实时时间信息 LDD3 ELDD 学习笔记
  12. Office在线预览-永中
  13. Windows电脑如何开启CPU虚拟化
  14. emc re 整改 超标_CE认证EMC测试不合格,如何整改 ;
  15. 3Idiots-2014-Kaggle 比赛源码走读
  16. 页面验证是否是真实有效的身份证号码
  17. 推荐系统入门(四):WideDeep(附代码)
  18. arcgis安装后重启提示 flexnet vendor daemon 交互式服务检测
  19. Java 集合之SortedSet和SortedMap
  20. 计算机快捷方式在哪儿,Windows电脑计算器快捷键在哪里打开及敬业签云便签在线计算器怎么使用...


  1. PMP考前冲刺题2022(正题)含解析
  2. 使用「语雀」搭建个人博客
  3. 硬件设计——感性负载防护
  4. html中相对位置与绝对位置
  5. 一个新技术与传统产业开始越来越深入融合的发展新趋势
  6. 1013---IBM X3850 X6 重新构建Raid5---过程记录
  7. MySQL数据库之数据库约束,一文带你了解
  8. 【金猿产品展】亚信科技“数据探索分析平台”——深挖数据价值,助客户高效管理和经营生产...
  9. Linux 批量修改文件名(前缀或后缀)
  10. 西门子PLC能否实时无线采集多处从站模拟量数据?