决策树和提升树的区别_决策树提升技术比较

决策树和提升树的区别

Decision Trees are popular Machine Learning algorithms used for both regression and classification tasks. Their popularity mainly arises from their interpretability and representability, as they mimic the way the human brain takes decisions.

决策树是流行的机器学习算法，用于回归和分类任务。它们的流行主要源于它们的可解释性和可表示性，因为它们模仿人脑做出决策的方式。

However, to be interpretable, they pay a price in terms of prediction accuracy. To overcome this caveat, some techniques have been developed, with the goal of creating strong and robust models starting from ‘poor’ models. Those techniques are known as ‘ensemble’ methods (I discussed some of them in my previous article here).

但是，可以理解的是，它们为预测准确性付出了代价。为了克服这一警告，已经开发了一些技术，目的是从“不良”模型开始创建强大而健壮的模型。这些技术被称为“合奏”的方法(我讨论其中一些我以前的文章在这里 )。

In this article, I’m going to dwell on four different ensemble techniques, all having Decision Tree as base learner, with the aim of comparing their performances in terms of accuracy and training time. The four algorithms I’m going to use are:

在本文中，我将介绍四种不同的集成技术，所有这些技术都以决策树作为基础学习者，目的是比较它们在准确性和培训时间方面的表现。我要使用的四种算法是：

Random Forest随机森林
Gradient Boosting梯度提升
XGBoostXGBoost
LightGBMLightGBM

To compare these methods’ performances, I initialized an artificial dataset as follows:

为了比较这些方法的性能，我初始化了一个人工数据集，如下所示：

from sklearn.datasets import make_blobsfrom matplotlib import pyplotfrom pandas import DataFrame# generate 2d classification datasetX, y = make_blobs(n_samples=10000, centers=3, n_features=2)df = DataFrame(dict(x1=X[:,0], x2=X[:,1], label=y))df.head()

As you can see, our dataset contains observations having a vector of two predictors [x1,x2] and a categorical output with 3 classes [0,1,2].

如您所见，我们的数据集包含具有两个预测变量[x1，x2]的向量和具有3个类别[0,1,2]的分类输出的观测值。

The final aim is showing how the LightGBM overperforms (by far) the other algorithms.

最终目标是展示LightGBM如何(到目前为止)优于其他算法。

随机森林 (Random Forest)

Random forest relies on the concept of bagging, that is: if we were able to train on different datasets multiple trees and then use an average (or, in case of classification, the majority vote) of their output to predict the label of a new observation, we would get more accurate results. We can achieve that by creating a series of datasets obtained as a bootstrapped version of the original one, and then train a bunch of classifiers.

随机森林依赖于装袋的概念，即：如果我们能够在不同的数据集上训练多棵树，然后使用它们的输出的平均值(或分类的话，则以多数表决)来预测新树的标签。观察，我们将获得更准确的结果。我们可以通过创建一系列作为原始版本的自举版本获得的数据集，然后训练一堆分类器来实现。

Plus, Random Forest algorithm adds a further constraint: every time a tree is grown from a bootstrapped sample, the algorithm allows it to consider only a subset of size m of the entire covariates spaces of size p (with m<p). By doing so, each tree is independent of each other.

另外，随机森林算法增加了进一步的约束：每次从自举样本中生成一棵树时，该算法允许它仅考虑大小为p(m <p)的整个协变量空间中大小为m的子集。这样，每棵树都彼此独立。

Let us see how it performs on our artificial data:

让我们看看它如何在人工数据上执行：

#random forestimport timestart = time.time()clf_rf = RandomForestClassifier(max_depth=2, random_state=0)from sklearn.model_selection import cross_val_scoreclf_rf = RandomForestClassifier()scores = cross_val_score(clf_rf, X, y, cv=5)acc_rf = scores.mean()#acc_rf#do somethingend = time.time()temp_rf = end-start

To evaluate model performances in terms of accuracy, I will use the cross-validation approach on the training set, partitioning it into 5 folds. I’ve stored the time and accuracy results into variables which will be revealed at the end of this article.

为了评估模型性能的准确性，我将对训练集使用交叉验证方法，将其分为5倍。我将时间和准确性结果存储在变量中，这些变量将在本文结尾处显示。

梯度提升 (Gradient Boosting)

The idea behind boosting is building a series of trees, each of those being an updated version of the previous one. Basically, at each iteration, a tree is built on the dataset (X, r) rather than (X,y), where “r” indicates the residuals obtained by the previous tree. Then a shrunken version of this classifier is added to the previous one, and the procedure goes on up to the end of the loop (or when a certain breaking condition is reached).

Boosting背后的想法是构建一系列树，每个树都是前一棵的更新版本。基本上，在每次迭代时，树都是在数据集(X，r)而不是(X，y)上构建的，其中“ r”表示前一棵树获得的残差。然后，将这个分类器的缩小版本添加到前一个分类器中，然后该过程继续进行到循环结束(或在达到特定中断条件时)。

With Gradient Boosting, the updating of the classifier is done via gradient descent optimization procedure, which is used to approximate the residuals (to learn more about the functioning of this algorithm, here there is a very intuitive article).

使用Gradient Boosting，可通过梯度下降优化程序完成分类器的更新，该程序用于近似残差(要了解有关该算法功能的更多信息，此处有一篇非常直观的文章)。

So let’s initialize also this algorithm and train it against our data:

因此，我们还要初始化此算法，并根据我们的数据对其进行训练：

from sklearn.ensemble import GradientBoostingClassifierstart = time.time()clf_gb = GradientBoostingClassifier(max_depth=2, random_state=0)scores = cross_val_score(clf_gb, X, y, cv=5)acc_gb = scores.mean()end = time.time()temp_gb = end-start

XGBoost (XGBoost)

XGboost is an “extreme” version of Gradient Boosting, in the sense that is more efficient, flexible, and portable. Among the features that make this algorithm that performant, we can quote its parallelization of tree construction (using all CPU cores) and its possibility of being distributed across different machines to train very large datasets. Plus, it is more accurate than standard Gradient Boosting.

XGboost是Gradient Boosting的“极端”版本，从某种意义上讲，它更高效，更灵活，更便携。在使该算法具有出色性能的功能中，我们可以引用其对树结构的并行化(使用所有CPU内核)，以及将其分布在不同机器上以训练非常大的数据集的可能性。另外，它比标准的Gradient Boosting更准确。

All these features made it the favourite algorithm on Kaggle for a very long time, until new versions of Gradient Boosting have entered the market (among those, LightGBM).

所有这些功能使它成为Kaggle上最喜欢的算法已有很长时间了，直到新版本的Gradient Boosting进入市场(其中包括LightGBM)。

So let’s train our XGBoost and store its results.

因此，让我们训练XGBoost并存储其结果。

#installing the package for xgboost!pip install xgboostimport xgboost as xgbstart = time.time()clf_xgb=xgb.XGBClassifier()scores = cross_val_score(clf_xgb, X, y, cv=5)acc_xgb = scores.mean()end = time.time()temp_xgb = end-start

Note: differently from Random Forest and Gradient Boosting Classifier, that were scikit-learn libraries, with XGBoost and, later on, LightGBM, we need to treat them as individual packages. Hence, we can easily install them via pip.

注意：与随机森林和梯度提升分类器(它们是scikit学习库)不同，它们具有XGBoost，后来又具有LightGBM，我们需要将它们视为单独的软件包。因此，我们可以通过pip轻松安装它们。

Now it’s time to train and evaluate the last ensemble algorithm and then compare all the obtained results.

现在是时候训练和评估最后一个集成算法，然后比较所有获得的结果了。

LightGBM (LightGBM)

LightGBM is yet another gradient boosting framework that uses a tree-based learning algorithm. As its colleague XGBoost, it focuses on computational efficiency and high standard performance.

LightGBM是又一个使用基于树的学习算法的梯度增强框架。作为其同事XGBoost，它专注于计算效率和高标准性能。

In recent times, LightGBM gathered incredible success in many Kaggle competitions, outperforming XGBoost in terms of both speeds of training and accuracy of predictions.

最近，LightGBM在许多Kaggle比赛中获得了令人难以置信的成功，在训练速度和预测准确性方面均胜过XGBoost。

Let’s see whether it holds also for our artificial dataset.

让我们看看它是否也适用于我们的人工数据集。

! pip install lightgbmimport lightgbm as lgbparams = {    'boosting_type': 'gbdt',    'objective': 'binary',    'metric': 'binary_error', }lgb_train = lgb.Dataset(X, y, free_raw_data=False)start = time.time()scores = lgb.cv(        params,        lgb_train,        num_boost_round=100,        nfold=5,        stratified=False        )scoresend = time.time()temp_lgb = end-start

结论 (Conclusions)

Now let us see the results of our algorithms:

现在让我们看一下算法的结果：

import pandas as pddata = dict([('LigthGBM', [acc_lgb, temp_lgb]), ('XGBoost', [acc_xgb*100, temp_xgb]), ('Random Forest', [acc_rf*100, temp_rf]),             ('Gradient Boosting', [acc_gb*100, temp_gb])])df = pd.DataFrame(data).T.rename(columns={0: 'Accuracy', 1: 'Training Time'})df

import matplotlib.pyplot as pltfig, axes = plt.subplots(figsize=(12,8),nrows=1, ncols=2, sharey=True)#initializing a function for a better interpretability of the #training timedef ret_time(temp):    minutes = round(temp//60, 0)    seconds = round(temp - 60*minutes, 0)    return [int(minutes), int(seconds)]#ax2=plt.subplot(2,2,2)ax1=df.sort_values(by='Accuracy', ascending=True)["Accuracy"].plot(ax=axes[0], kind='barh', logx=True, xlim=(0,102))ax1.text(96.15, 0, str(str(round(acc_rf*100,2))+'%'), fontsize=15)ax1.text(96.4, 1, str(str(round(acc_gb*100,2))+'%'), fontsize=15)ax1.text(96.44, 2, str(str(round(acc_xgb*100,2))+'%'), fontsize=15)ax1.text(100, 3, str(str(round(acc_lgb,2))+'%'), fontsize=15)ax1.set_title('Accuracy on 5-fold CV')ax2 = df.sort_values(by='Accuracy', ascending=True)["Training Time"].plot(ax=axes[1], kind='barh', xlim=(0,900), colormap='viridis')ax2.text(400, 0, str(str(ret_time(temp_rf)[0]) + 'm:'+str(ret_time(temp_rf)[1])+'s'), fontsize=15)ax2.text(700, 1, str(str(ret_time(temp_gb)[0]) + 'm:'+str(ret_time(temp_gb)[1])+'s'), fontsize=15)ax2.text(400, 2, str(str(ret_time(temp_xgb)[0]) + 'm:'+str(ret_time(temp_xgb)[1])+'s'), fontsize=15)ax2.text(100, 3, str(str(ret_time(temp_lgb)[0]) + 'm:'+str(ret_time(temp_lgb)[1])+'s'), fontsize=15)#ax1.text(700, 1, str(), fontsize=15)#ax1.text(400, 2, str(), fontsize=15)ax2.set_title('Training Time')

As you can see, not only the LightGBM is the model with the highest accuracy, but also is it the one with the lowest training time (by far).

如您所见，LightGBM不仅是精度最高的模型，而且还是训练时间最短的模型(到目前为止)。

Of course, to draw more consistent conclusions, one experiment is not enough. Plus, picking the best algorithm really depends on the task you are carrying on, as well as on the size (and nature) of the dataset.

当然，要得出更一致的结论，一个实验是不够的。另外，选择最佳算法实际上取决于您要执行的任务以及数据集的大小(和性质)。

Nevertheless, LightGBM resulted in great performances in many Kaggle competitions and today is one of the preferred classifiers in the market.

尽管如此，LightGBM在许多Kaggle比赛中都取得了出色的成绩，如今已成为市场上首选的分类器之一。

I hope you enjoy this reading! If you are interested in the topic, as well as in further “extreme” versions of Gradient Boosting, I suggest to you the references below:

希望您喜欢阅读！如果您对该主题以及Gradient Boosting的其他“高级”版本感兴趣，建议您参考以下内容：

翻译自: https://medium.com/dataseries/decision-tree-boosting-techniques-compared-5667bb2087ab

决策树和提升树的区别

查看全文

http://www.taodudu.cc/news/show-4522212.html

mybatis高级操作及源码分析（一）
微服务项目之电商--19.ElasticSearch基本、高级查询和过滤、结果过滤、排序和聚合aggregations
Elasticsearch 搜索的高级功能学习
ElasticSearch 高级
20天准备四级
大学生简单个人静态HTML网页设计作品 DIV布局个人介绍网页模板代码 DW学生个人网站制作成品下载 HTML5期末大作业
web前端开发技术 web课程设计网页规划与设计web期末作业设计网页
Bootstrap 与 Jackknife 笔记
【学习笔记】计算机时代的统计推断（Bradley Efron and Trevor Hastie 著）
没有我们的世界
我一个回车干掉隐藏在身边已久的木马病毒
我的世界版链游Titan Hunters，泰坦猎人研报实测
现实世界的映射与超越：电子游戏的叙事研究
我的世界末日之后无限法则服务器,Last Day Rules官方版
世界末日来了别怕！五种新技术抱紧你
J.P. Morgan Executes and Clears CDS and IRS Trades Via Bloomberg Professional
这是他本赛季第一张黄牌
OSChina 周三乱弹 ——新技能get 如何机智的关注大胸妹子。
单元测试报connection is allready closed导致dailybuild中断的解决方案——类加载机制的应用...
sqoop的java操作，总结归纳，含代码
java sqoop api 导mysql数据到hdfs
[渝粤教育] 广东-国家-开放大学 21秋期末考试建设工程法规10221k2
2021年安全员-C证（陕西省）考试总结及安全员-C证（陕西省）
国家开放大学2021春1194建设监理题目
行政科购入计算机一台,行政单位会计分录练习题.doc
工程建设项目业务学习
建设工程法规专科【2】
建设工程法规专科【7】
2021年安全员-B证（江西省）新版试题及安全员-B证（江西省）考试技巧
2022安徽安全员B考试单选题库预测分享