sklearn之feature_importance

一、随机森林【原文链接：https://blog.csdn.net/zjuPeco/article/details/77371645】

随机森林的算法可以用如下几个步骤概括：

用有抽样放回的方法（bootstrap）从样本集中选取n个样本作为一个训练集
用抽样得到的样本集生成一棵决策树。在生成的每一个结点：
随机不重复地选择d个特征
利用这d个特征分别对样本集进行划分，找到最佳的划分特征（可用基尼系数、增益率或者信息增益判别）
重复步骤1到步骤2共k次，k即为随机森林中决策树的个数。
用训练得到的随机森林对测试样本进行预测，并用票选法决定预测的结果。
一个数据集中往往有成百上前个特征，如何在其中选择比结果影响最大的那几个特征，以此来缩减建立模型时的特征数是我们比较关心的问题。这样的方法其实很多，比如主成分分析，lasso等等。这里我们要介绍的是用随机森林来对进行特征筛选。

说白了就是看看每个特征在随机森林中的每颗树上做了多大的贡献，然后取个平均值，最后比一比特征之间的贡献大小。
from sklearn.cross_validation import train_test_split
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
x, y = df.iloc[:, 1:].values, df.iloc[:, 0].values
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 0)
feat_labels = df.columns[1:]
forest = RandomForestClassifier(n_estimators=10000, random_state=0, n_jobs=-1)
forest.fit(x_train, y_train)

importances = forest.feature_importances_
indices = np.argsort(importances)[::-1]
for f in range(x_train.shape[1]):
print("%2d) %-*s %f" % (f + 1, 30, feat_labels[indices[f]], importances[indices[f]]))

如果要筛选出重要性比较高的变量的话：

threshold = 0.15
x_selected = x_train[:, importances > threshold]
x_selected.shape
(124, 3)
帮我们选好了3个重要性大于0.15的特征

原文链接：

(8条消息) 利用随机森林对特征重要性进行评估（含实例+代码讲解）_旅途中的宽~的博客-CSDN博客_利用随机森林对特征重要性进行评估

导入数据

import pandas as pd
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data'
wine_data = pd.read_csv(url, header = None) #X

添加wine_data列名

wine_data.columns = ['Class label', 'Alcohol', 'Malic acid', 'Ash',
'Alcalinity of ash', 'Magnesium', 'Total phenols',
'Flavanoids', 'Nonflavanoid phenols', 'Proanthocyanins',
'Color intensity', 'Hue', 'OD280/OD315 of diluted wines', 'Proline']

import numpy as np
np.unique(wine_data['Class label'])
【a = np.unique(A)

对于一维数组或者列表，unique函数去除其中重复的元素，并按元素由大到小返回一个新的无元素重复的元组或者列表】

检查一下数据是否有空数组：

wine_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 178 entries, 0 to 177
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Class label 178 non-null int64
1 Alcohol 178 non-null float64
2 Malic acid 178 non-null float64
3 Ash 178 non-null float64
4 Alcalinity of ash 178 non-null float64
5 Magnesium 178 non-null int64
6 Total phenols 178 non-null float64
7 Flavanoids 178 non-null float64
8 Nonflavanoid phenols 178 non-null float64
9 Proanthocyanins 178 non-null float64
10 Color intensity 178 non-null float64
11 Hue 178 non-null float64
12 OD280/OD315 of diluted wines 178 non-null float64
13 Proline 178 non-null int64
dtypes: float64(11), int64(3)
除去class label之外共有13个特征，数据集的大小为178

训练：

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
x, y = wine_data.iloc[:, 1:].values, wine_data.iloc[:, 0].values
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 0)
feat_labels = df.columns[1:]
forest = RandomForestClassifier(n_estimators=10000, random_state=0, n_jobs=-1)
forest.fit(x_train, y_train)

结果：

1) 10 0.182483
2) 13 0.158610
3) 7 0.150948
4) 12 0.131987
5) 1 0.106589
6) 11 0.078243
7) 6 0.060718
8) 4 0.032033
9) 2 0.025400
10) 9 0.022351
11) 5 0.022078
12) 8 0.014645
13) 3 0.013916
筛选出重要性比较高的变量：

threshold = 0.15
x_selected = x_train[:, importances > threshold]
x_selected

选好了三列数据

参考文章：(9条消息) Python每日一记42>>>机器学习中特征重要性feature_importances__教练我想学编程的博客-CSDN博客_.feature_importances_

利用 .to_csv 保存原始的特征重要性

特征重要性绘图：

参考文章：(8条消息) 随机森林计算特征重要性_随机森林中计算特征重要性的3种方法_weixin_26752765的博客-CSDN博客

sklearn之feature_importance_参数的学习相关推荐

Lesson 8.1Lesson 8.2 决策树的核心思想与建模流程CART分类树的建模流程与sklearn评估器参数详解
Lesson 8.1 决策树的核心思想与建模流程从本节课开始,我们将介绍经典机器学习领域中最重要的一类有监督学习算法--树模型(决策树). 可此前的聚类算法类似,树模型也同样不是一个模型,而是一类模 ...
sklearn.cluster.KMeans参数介绍
转载自:https://blog.csdn.net/weixin_41724761/article/details/89786414,本文只做个人记录学习使用,版权归原作者所有. sklearn.cl ...
Mysql一些重要配置参数的学习与整理（一）
2019独角兽企业重金招聘Python工程师标准>>> 原文地址:Mysql一些重要配置参数的学习与整理(一) 最近一直在进行mysql的学习,主要是学习mysql的一些配置参数,了 ...
python 超参数_OpenCV python sklearn随机超参数搜索的实现
""" 房价预测数据集使用sklearn执行超参数搜索 """ import matplotlib as mpl import matpl ...
Mysql一些重要配置参数的学习与整理（二）
2019独角兽企业重金招聘Python工程师标准>>> 原文地址:Mysql一些重要配置参数的学习与整理(二) 上一篇,Mysql一些重要配置参数的学习与整理(一)中,我们了解和学习 ...
长尾序列用户行为建模中可转移参数的学习(KDD-2020)
本文给大家分享的是阿里和浙大合作发表在KDD-2020上的文章<Learning Transferrable Parameters for Long-tailed Sequential User ...
sklearn.cluster.KMeans 参数介绍
sklearn.cluster.KMeans 参数介绍为什么要介绍sklearn这个库里的kmeans? 这个是现在python机器学习最流行的集成库,同时由于要用这个方法,直接去看英文文档既累又浪 ...
sklearn中的Pipline(流水线学习器)
简介管道机制实现了对全部步骤的流式化封装和管理(streaming workflows with pipelines). 管道机制(也有人翻译为流水线学习器?这样翻译可能更有利于后面内容的理解)在机 ...
sklearn输出模型参数_如何使用sklearn优雅地进行数据挖掘？
关注上方"数据挖掘工程师",选择星标, 关键时间,第一时间送达! 文章来自:天池大数据科研平台一.使用sklearn数据挖掘 ‍‍ 1.数据挖掘的步骤数据挖掘通常包括数据采集, ...

sklearn之feature_importance_参数的学习

sklearn之feature_importance_参数的学习相关推荐

最新文章

热门文章