上一篇博文用随机森林实现了发欺诈模型的构建，但随机森林隶属于集成学习的一种，属于黑箱算法，为了增强模型的解释性，本篇博文采用同样的数据适用决策树算法再次构建模型，并输出可视化二叉树形图以及规则文本，并对决策树输出规则文本进行解析，形成sql语句。这样的话决策树每个分支下的客户群规则画像就变得一目了然，并可以使用解析后的sql语句直接运行在数据库。

具体的数据加载、数据清洗及预处理、特征工程、数据抽样及拆分等过程见博主上一篇博文：
python随机森林算法实现反欺诈案例完整建模流程

……（续）
接上一篇博文模型验证及之前程序…

决策树分类–未剪枝

模型训练

from sklearn import tree
def Model_Train(x_train,y_train):model = tree.DecisionTreeClassifier()   #model = tree.DecisionTreeClassifier(criterion='entropy')  使用信息熵作为划分标准，对决策树进行训练model.fit(x_train, y_train)return modelmodel_tree = Model_Train(X_train,y_train)

输出各指标影响力
系数反映每个特征的影响力。越大表示该特征在分类中起到的作用越大

print(model_tree.feature_importances_)

模型评估

#进行模型评估，计算出相应的准确率、召回率和F值
y_pred_rf = model_tree.predict(X_test)
# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test, y_pred_rf)
np.set_printoptions(precision=2)
print(cnf_matrix)
print("Precision metric in the testing dataset: ", (cnf_matrix[1,1]/(cnf_matrix[0,1] + cnf_matrix[1,1])).round(4))
print("Recall metric in the testing dataset: ", (cnf_matrix[1,1]/(cnf_matrix[1,0] + cnf_matrix[1,1])).round(4))

决策树可视化二叉树

import pydotplus
play_feature_E = 'TOTOL_7_ZJ_CNT', 'H_MAX_CIRCLE', 'CHG_CELLS', 'ZJ_CNT_RATE', 'TOTOL_7_ZJ_DUR', 'ZHANBI', 'CORP_USER_NAME_家庭客户', 'TERM_PRICE_未识别', 'WEEK_CNT', 'DIS_OPP_HOME_NUM', 'MIX_CDSC_FLG_0.0', 'ALL_LL_DUR', 'TOTAL_DIS_BJ_NUM_RATE', 'ALL_LL_USE', 'MIX_CDSC_FLG_1.0', 'TOTOL_7_BJ_D_DUR', 'BJ_LOCAL_CNT', 'ACT_DAY_RATE', 'ZJ_TOTAL_DURATION', 'CUST_ASSET_CNT', 'ZJ_DURATION_RATIO_0_15', 'AMT', 'ZJ_AVG_DURATION', 'GENDER_未识别', 'ZJ_DURATION_30_60_CNT', 'DURATION_RATIO_0_15', 'ZJ_DURATION_RATIO_15_30'
play_class =  'no','yes'dot_data = tree.export_graphviz(model_tree, out_file = None, feature_names = play_feature_E, class_names = play_class,filled = True, rounded = True, special_characters = True)
graph = pydotplus.graph_from_dot_data(dot_data)
graph.write_pdf('tree_model_fsr_output101.pdf')

完整树：

完整树局部：

决策树分类–预剪枝

模型训练

from sklearn import tree
def Model_Train(x_train,y_train):model = tree.DecisionTreeClassifier(min_samples_split = 30,min_samples_leaf = 30) model.fit(x_train, y_train)return modelmodel_tree = Model_Train(X_train,y_train)

其中：

min_samples_split：当对一个内部结点划分时，要求该结点上的最小样本数，默认为2。
min_samples_leaf：设置叶子结点上的最小样本数，默认为1。当尝试划分一个结点时，只有划分后其左右分支上的样本个数不小于该参数指定的值时，才考虑将该结点划分，换句话说，当叶子结点上的样本数小于该参数指定的值时，则该叶子节点及其兄弟节点将被剪枝。在样本数据量较大时，可以考虑增大该值，提前结束树的生长。

本次案例主要采用min_samples_split、min_samples_leaf两个参数来对决策树进行剪枝。

模型评估

#进行模型评估，计算出相应的准确率、召回率和F值
y_pred_rf = model_tree.predict(X_test)
# 生成混淆矩阵
# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test, y_pred_rf)
np.set_printoptions(precision=2)
print(cnf_matrix)
print("Precision metric in the testing dataset: ", (cnf_matrix[1,1]/(cnf_matrix[0,1] + cnf_matrix[1,1])).round(4))
print("Recall metric in the testing dataset: ", (cnf_matrix[1,1]/(cnf_matrix[1,0] + cnf_matrix[1,1])).round(4))

对比预剪枝决策树模型和未剪枝决策树模型评估结果可以看出，预剪枝后，模型查准率有所提升，但查全率有所下降，说明决策树剪枝后，对新数据的预测泛性更好，不易过拟合。

输出规则文本及可视化图

tree.export_graphviz(model_tree,out_file = 'tree_rule102.txt',feature_names = play_feature_E,class_names=['0', '1'],filled=True,node_ids=True,rounded=True,special_characters=True)

import pydotplus
play_feature_E = 'TOTOL_7_ZJ_CNT', 'H_MAX_CIRCLE', 'CHG_CELLS', 'ZJ_CNT_RATE', 'TOTOL_7_ZJ_DUR', 'ZHANBI', 'CORP_USER_NAME_家庭客户', 'TERM_PRICE_未识别', 'WEEK_CNT', 'DIS_OPP_HOME_NUM', 'MIX_CDSC_FLG_0.0', 'ALL_LL_DUR', 'TOTAL_DIS_BJ_NUM_RATE', 'ALL_LL_USE', 'MIX_CDSC_FLG_1.0', 'TOTOL_7_BJ_D_DUR', 'BJ_LOCAL_CNT', 'ACT_DAY_RATE', 'ZJ_TOTAL_DURATION', 'CUST_ASSET_CNT', 'ZJ_DURATION_RATIO_0_15', 'AMT', 'ZJ_AVG_DURATION', 'GENDER_未识别', 'ZJ_DURATION_30_60_CNT', 'DURATION_RATIO_0_15', 'ZJ_DURATION_RATIO_15_30'
play_class =  'no','yes'dot_data = tree.export_graphviz(model_tree, out_file = None, feature_names = play_feature_E, class_names = play_class,filled = True, rounded = True, special_characters = True)
graph = pydotplus.graph_from_dot_data(dot_data)
graph.write_pdf('tree_model_fsr_output102.pdf')

完整树：

完整树局部：

预剪枝决策树二叉树相比未剪枝决策树二叉树，可以看出二叉树包括树深和分叉都要单薄很多，还可以继续调节min_samples_split、min_samples_leaf两个参数来对决策树进行剪枝程度的控制。

决策树输出规则解析

不管是决策时输出二叉树还是文本规则都是原始规则，不够直观，下面以预剪枝决策树输出规则为例进行解析，生成我们通常容易理解的语言，比如sql语句。

第一步：读取文件，写入对应中间结果表

filepath="tree_rule102.txt"
dict_node = {}
label = []
value_pool = []
#list_node = []
list_direction = []
dict_direction = {}
f_w = open(filepath,'r',encoding='UTF-8')
line = f_w.readline()
while line:if line.startswith('digraph'):line = f_w.readline() continueif line.startswith('node'):line = f_w.readline() continueif line.startswith('edge'):line = f_w.readline() continueif line.startswith('}'):line = f_w.readline() continueif line.find('label=<node')>-1:#list_node.append(line)pos = line.index('[')key = line[0:pos].strip()content = line[pos:].strip()dict_node[key] = content if line.find('->') > -1:if line.find('[labeldistance') > -1:pos = line.index('[')line = line[0:pos]temp = line.split('->')else:temp = line.replace(";","").split('->')    key = int(temp[0].strip())value = temp[1].strip()list_direction.append(str(key)+","+str(value))if key in dict_direction:temp_value = dict_direction[key]value = temp_value+","+valuedict_direction[key] = valueline = f_w.readline()try:if '&le' in line:continueelse:lb = line[line.index('class =')+8: line.index('>, fillcolor')]    label.append(lb)except ValueError:continue try:if '&le' in line:continueelse:value_tmp = line[line.index('value =')+8: line.index('<br/>class')]    value_tmp = eval(value_tmp)value_pool.append(value_tmp)except ValueError:continuef_w.close()

第二步：装入全路径树字典

按照事先已知的节点方向列表（顺序），定义树路径字典，根据顺序动态拼装装入到全路径树字典

dict_tree ={}
for j in list_direction:dict_tree[j] = jtemp = j.split(',')#print("j="+j)for key in dict_tree.keys(): # 01 02 23key_temp = key.split(',')if  key_temp[0] != temp[0] :if key_temp[1] == temp[0]:#print("key="+key)temp_node = dict_tree.get(key)+","+temp[1]dict_tree[j] = temp_nodeelse:passlist_result = [] #取全路径树字典value, 放入list_result
for value in dict_tree.values():list_result.append(value) #根据生成的全路径列表，去掉路径包含关系子项
list_del = [] #确定要删除的子项
for v in range(len(list_result)):for x in range(len(list_result)):if list_result[v]==list_result[x]:continue#print("list_result[x]="+list_result[x])#print("list_result[v]="+list_result[v])if list_result[x].find(list_result[v])>-1:list_del.append(list_result[v])x=vbreak#定义中间结果，过滤list_result中的需求删除的子项，写入到list_response
list_response=[] #最终树路径结果表
for item in list_result:#print(item)#print(item in list_del)if item not in list_del:list_response.append(item)#解析dict_node节点字典，确定节点编号对应的内容
node_dict = {}#节点编号-内容表
for keys,values in dict_node.items():key =keysvalue  = values.split('<br/>')if len(value)>0:if value[1].find('gini')==-1: #非叶子结点#print(value[1])node_dict[keys] = value[1]else:#print("values is null")pass#根据树路径结果表 和 节点关系字典dict_direction ，确定节点的左右
result = []
for  item in list_response:#print("item="+item)temp = item.split(',')end = len(temp)start_pos = 0result_end = [] #按照树路径结果表定义字典，进行输出while start_pos+1 < end:str_sub = temp[start_pos:start_pos+2] #将节点顺序二二组合，根据字典确认符号关系(需要多次遍历节点关系字典，考虑那里可以优化)start_pos += 1node = int(str_sub[0])next_node = int(str_sub[1])if  node in  dict_direction:values = dict_direction.get(node).split(',')node_value = node_dict.get(str(node))if next_node == int(values[0]): #根据节点关系字典dict_direction，因为是二叉树，所以关系表中对应的关系为左[0]右[1]，#print(str(node)+"<="+values[0])node_value = node_value.replace("&le;","<=")if next_node == int(values[1]):#print(str(node)+">"+values[1]) node_value = node_value.replace("&le;",">")result_end.append(node_value)    else:passresult.append(result_end)

第三步：union处理

a = [[i, j] for i, j in zip(label, result) if i == '1']
b = [' and '.join(i[1]) for i in a]
c = ' and '.join(b)for i in range(len(b)):temp = b[i].split('and')remove_dup_col_dict = {}for item  in temp:#print(item.strip())if item.find('<=')>0:temp_str = item.split('<=')col_str = temp_str[0].strip()temp_key = col_str+" <="if temp_key in remove_dup_col_dict.keys():temp_value = remove_dup_col_dict[temp_key]if temp_value > temp_str[1].strip():remove_dup_col_dict[temp_key] = temp_str[1].strip()else:remove_dup_col_dict[temp_key] = temp_str[1].strip()    if item.find('>')>0:temp_str = item.split('>')col_str = temp_str[0].strip()temp_key = col_str+" >"if temp_key in remove_dup_col_dict.keys():temp_value = remove_dup_col_dict[temp_key]if temp_value < temp_str[1].strip():remove_dup_col_dict[temp_key] = temp_str[1].strip()else:remove_dup_col_dict[temp_key] = temp_str[1].strip()     if len(remove_dup_col_dict)>0:temp_str=""for  key,value  in remove_dup_col_dict.items():temp = key +" "+valuetemp_str = temp_str  +temp + ' and 'temp_str=temp_str[:-4]

第四步：拼接SQL

# -----------------------最后把结果放在了temp_str当中---------------------
end = '0'
sql_start = 'select case '
sql_end = 'else ' + end + ' end;'result_sql = []
result_sql.append(sql_start)for i in range(len(result)):sql_ste = 'when ' + ' and '.join(result[i]) + ' then ' + label[i] + '\n'result_sql.append(sql_ste)    result_sql.append(sql_end)

第五步：将规则读取出来并读入txt

with open('out_sql102.txt', 'w') as f:f.writelines(result_sql)

输出txt文档局部：

python决策树及规则解析（真实案例完整流程）相关推荐

python情感分析（真实案例完整流程）
情感分析:又称为倾向性分析和意见挖掘,它是对带有情感色彩的主观性文本进行分析.处理.归纳和推理的过程,其中情感分析还可以细分为情感极性(倾向)分析,情感程度分析,主客观分析等. 情感极性分析的目的是对 ...
r语言kmeans聚类（真实案例完整流程）
K-means介绍 k-means算法简介: K-means算法是IEEE 2006年ICDM评选出的数据挖掘的十大算法中排名第二的算法,排名仅次于C4.5算法.K-means算法的思想很简单,简单来 ...
【Windows 逆向】使用 CE 工具挖掘关键数据内存真实地址 ( 完整流程演示 | 查找临时内存地址 | 查找真实指针地址 )
文章目录前言一.查找子弹数据临时内存地址二.查找子弹数据真实地址前言在上一篇博客 [Windows 逆向]使用 CE 工具挖掘关键数据内存真实地址 ( CE 找出子弹数据内存地址是临时地址 ...
python决策树生成规则_如何从scikit-learn决策树中提取决策规则？
我创建了自己的函数来从sklearn创建的决策树中提取规则: import pandas as pd import numpy as np from sklearn.tree import Decis ...
【Windows 逆向】使用 CE 工具挖掘关键数据内存真实地址 ( 查找子弹数据的动态地址 | 查找子弹数据的静态地址 | 静态地址分析 | 完整流程 ) ★
文章目录前言一.查找子弹数据临时内存地址二.查找子弹数据的静态地址 1.调试内存地址 05A59544 获取基址 05A59478 2.通过搜索基址 05A59478 获取内存地址 0E1DC1 ...
真实案例解析OO理论与实践
一.导言为什么要写这个系列 "OO都是一个已经被讨论烂的话题了,还有什么可写的!" 不知当你看到文章标题时,是不是有这种疑问,或者鄙夷.不错,OO从诞生到现在经历了不短的岁月,与 ...
python自动化办公都能做什么菜-Python 让我再次在女同学面前长脸了！（真实案例）...
原标题:Python 让我再次在女同学面前长脸了!(真实案例) 事情是经过这样的: 晚上在家王者的时候,微信突然弹出了一条好友添加提醒,一看昵称,居然是我们大学的班花!!! 这真是奇怪了,我之前连班花 ...
OOAD实践之路——真实案例解析OO理论与实践（五、需求分析之前的故事）
查看本系列全部文章: <OOA&D实践之路--真实案例解析OO理论与实践>索引贴高质量软件的第一要素到目前为止,我们做了很多工作,但是我一直在强调这些都还不是需求分 ...
Python自动化-APPium原理解析与实际测试案例分享
目录结构一.Appium概述 Appium架构原理运行原理 1)Appium服务器 2)Bootstrap.jar 3)Appium客户端二.Appium组件三.Appium环境搭建 Node ...

python决策树及规则解析（真实案例完整流程）

决策树分类–未剪枝

模型训练

模型评估

决策树可视化二叉树

决策树分类–预剪枝

模型训练

模型评估

输出规则文本及可视化图

决策树输出规则解析

第一步：读取文件，写入对应中间结果表

第二步：装入全路径树字典

第三步：union处理

第四步：拼接SQL

第五步：将规则读取出来并读入txt

python决策树及规则解析（真实案例完整流程）相关推荐

最新文章

热门文章