机器学习规则学习

In this article we are going to cover:

在本文中,我们将介绍:

  • Interpretable machine learning, correlation vs. causation, use cases可解释的机器学习,相关性与因果关系,用例
  • A powerful python package combining prediction and causal inference, an end-to-end Action Rules Discovery model

    强大的python程序包,结合了预测和因果推理,端到端的“动作规则发现”模型

Links to my other articles

链接到我的其他文章

  1. Deep Kernels and Gaussian Processes

    深核和高斯过程

  2. Custom Loss Functions in TensorFlow

    TensorFlow中的自定义损失函数

  3. Prediction and Inference with Boosted Trees

    Boosted Trees的预测和推断

  4. Softmax classification

    Softmax分类

  5. Climate analysis

    气候分析

Introduction

介绍

Suppose you are a Data Magician working at a business and were tasked to come up with a churn prediction model, to predict which customers are at risk of unsubscribing from the service that your business offers. You quickly spin up a deep neural network on some P100 GPUs on Colab Pro and get some high prediction accuracies, have a kombucha and then call it a day. The next afternoon your boss calls you over a Zoom and says that although the prediction capabilities are useful, she needs to know what factors contribute to the churn probability and by how much.

假设您是一家业务部门的数据魔术师,并且被要求提出一个客户流失预测模型,以预测哪些客户有可能无法订阅您的业务所提供的服务。 您可以在Colab Pro的某些P100 GPU上快速建立一个深层神经网络,并获得较高的预测精度,进行康普茶,然后将其称为“一天”。 第二天下午,您的老板在Zoom上给您打电话,并说尽管预测功能很有用,但她需要知道哪些因素会导致客户流失几率以及影响多少。

You revise the model and instead of deep neural nets, go with ensembles of LightGBMs combined with Shapley values computed over the individual boosted trees to give factor inference outputs like the following:

您修改模型,而不是深层神经网络,而要使用LightGBM的合体,并结合在单个增强树上计算出的Shapley值,以提供如下的因子推断输出:

https://github.com/slundberg/shap under license to Haihan Lanhttps://github.com/slundberg/shap已获得Haihan Lan的许可

You have another kombucha, satisfied at having discovered (quite robustly too) what factors the model thinks contribute to churn, and call it a day.

您还有另一个康普茶,对发现(也非常有力地)模型认为导致客户流失的因素感到满意,并将其称为“一天”。

Just before you get off Friday afternoon, your boss Zoom calls you again and tells you that although the factor inference insights are great, we still don’t know if these factors actually cause churn or retention. She reminds you that correlation does not imply causation and that she can’t tell which factors suggested by the model are actually causative vs being coincidences or spurious correlations.

就在星期五下午下班前,老板Zoom再次打电话给您,并告诉您,尽管因素推断的见解很棒,但我们仍然不知道这些因素是否真正导致客户流失或保留。 她提醒您,相关性并不表示因果关系,并且她无法说出模型建议的哪些因素实际上是因果性,巧合或虚假的相关性。

An example of a spurious correlation
虚假相关的一个例子

Being the obsessive Data Magician you are, you scour the interwebs and stumble upon a hidden gem that does everything your boss asked for.

身为痴迷的数据魔术师,您会搜寻网络并偶然发现一个隐藏的宝石,它可以完成老板所要求的一切。

There is a branch of supervised machine learning called Uplift modelling that deals with answering questions like “How much will X intervention/action affect outcome Y?” given a dataset of historical records that contain data regarding intervention X and outcome Y. The effect of X on Y is a quantity (typically a percentage if Y is a probability) called uplift. In this article we are going to briefly cover a package called actionrules created by [1] and how to apply it to discover action rules and quantify their impact on an outcome.

监督机器学习的一个分支称为Uplift建模,用于回答诸如“ X干预/行动将对结果Y产生多大影响?”之类的问题。 给定历史记录的数据集,其中包含有关干预X和结果Y的数据。X对Y的影响是一个称为uplift的数量(如果Y是概率,则通常为百分比)。 在本文中,我们将简要介绍由[1]创建的名为actionrules的程序包,以及如何将其应用于发现操作规则并量化其对结果的影响。

Action Rules

行动规则

We will cover briefly some definitions regarding classification rules, actions rules, support, confidence and look at how uplift is estimated.

我们将简要介绍有关分类的一些定义 规则行动 规则支持信心 并看看如何估计隆起。

Classification rules r_n are defined as:

分类规则r_n定义为:

r_n =[(X_1, n ∧ X_2, n ∧ … ∧ X_m, n ) → Y_n]

r _n = [(X_1,n X_2,n…X_m,n)→Y_n]

Where the tuple (X_1, n ∧ X_2, n ∧ … ∧ X_m, n ) are particular values X_m from n input columns. This tuple is called the antecedent or ant and the outcome Y_n is called the consequent. For For example:

其中元组(X_1,n X_2,n…X_m,n)是来自n个输入列的特定值X_m。 该元组称为先行或蚂蚁,结果Y_n称为结果。 例如:

[(Age= 55 ∧ Smoking = Yes ∧ Weight = 240 lbs) → risk of heart disease =Yes]

[(年龄= 55∧吸烟=是∧体重= 240磅)→患心脏病的风险=是]

is a classification rule. Classification rules are further quantified by two numbers called support and confidence. The support is defined as

是分类规则。 分类规则通过称为支持置信度的两个数字进一步量化。 支持定义为

sup(ant → Yn) := number of rules (ant → Yn)

sup(ant→Yn):=规则数(ant→Yn)

which is a the number of classification rules matching the condition (ant → Yn), or rules matching both the antecedent and the consequent. The confidence is defined as

这是与条件(ant→Yn)匹配的分类规则的数量,或者与前件和后件都匹配的规则。 置信度定义为

conf(ant → Yn) =sup(ant → Yn)/sup(ant)

conf(ant→Yn)= sup(ant→Yn)/ sup(ant)

or the support of (ant → Yn) divided by the total number of rules with a matching antecedent only.

或(ant→Yn)的支持除以仅具有匹配前提条件的规则总数。

Action rules are an extension of classification rules:

动作规则是分类规则的扩展:

a_n= [ f ∧ (X→X’)] → (Y→Y’)

一个_n = [F∧(X→X ')]→(Y→Y')

where f is a set of fixed or unchangeable attributes. For action rules we consider the conjunction of fixed attributes and a change of non-fixed or flexible attributes from initial set X to X’ will change the outcome Y to Y’. A concrete example is

其中f是一组固定或不变的属性。 对于动作规则,我们考虑固定属性的结合,并且将非固定或灵活属性从初始集合X更改为X'会将结果Y更改为Y'。 一个具体的例子是

[(Age= 55 ∧ ( Smoking = Yes → No ∧ Weight = 240 lbs → 190lbs) → risk of heart disease =No]

[(年龄= 55岁(吸烟=是→否∧体重= 240磅→190磅)→心脏病风险=否]

Meaning in principle if our subject with age fixed, quit smoking and lost weight due to diet/exercise they would no longer be at risk of heart disease. We can again use confidences and supports to quantify the quality of the action rules discovered by some method. The support for an action rules takes into account the two classification rules r_1 = (X→Y) and r_2 = (X’→Y’)that constitute the action rule and is defined as

从原则上讲,如果我们的受试者年龄固定,戒烟并且由于饮食/锻炼而体重减轻,他们将不再有患心脏病的风险。 我们可以再次使用置信度和支持来量化通过某种方法发现的动作规则的质量。 对动作规则的支持考虑了构成动作规则的两个分类规则r _1 =(X→Y)和r _2 =(X'→Y'),定义为

sup(a_n) = min(sup(r_1), sup(r_2))

sup( a _n)= min(sup( r _1),sup( r _2))

and confidence of the action rule is defined as

动作规则的置信度定义为

conf(a_n) = conf(r_1) * conf(r_2).

conf( a _n)= conf( r _1)* conf( r _2)。

Intuitively the support of an action rule can only be as much as the minimum support of one of its classification rules and the confidence of the action will be smaller than or equal to either confidences of the classification rules.

直观上,操作规则的支持只能与其分类规则之一的最小支持一样多,并且操作的置信度将小于或等于分类规则的任一个置信度。

Finally, the uplift is defined as

最后,隆起定义为

uplift = P(outcome | treatment)-P(outcome | no treatment).

隆起= P(结果|治疗)-P(结果|未治疗)。

Any uplift model in general will attempt to estimate the two conditional probabilities above.

一般而言,任何隆升模型都会尝试估算上述两个条件概率。

Code Example

代码示例

Details of the action rules discovery algorithms can be found in the source [1] but briefly, the actionrules package incorporates a heuristic based classification and action rule discovery algorithm, running in a supervised manner (meaning we must have/specify the targets or outcome labels). We will run the action rules model on a toy customer churn dataset from a Telecom company called telco.csv.

可以在源代码中找到有关动作规则发现算法的详细信息[1],但简要地说,动作规则包结合了基于启发式的分类和动作规则发现算法,以监督的方式运行(这意味着我们必须具有/指定目标或结果标签) )。 我们将在一家名为telco.csv的电信公司的玩具客户流失数据集上运行操作规则模型。

https://github.com/hhl60492/actionrules/blob/master/notebooks/data/telco.csv

https://github.com/hhl60492/actionrules/blob/master/notebooks/data/telco.csv

According to the Kaggle dataset page [2]:

根据Kaggle数据集页面[2]:

The data set includes information about:

数据集包含有关以下信息:

  • Customers who left within the last month — the column is called Churn在上个月内离开的客户-该列称为“客户流失”
  • Services that each customer has signed up for — phone, multiple lines, internet, online security, online backup, device protection, tech support, and streaming TV and movies每个客户已注册的服务-电话,多条线路,互联网,在线安全,在线备份,设备保护,技术支持以及流电视和电影
  • Customer account information — how long they’ve been a customer, contract, payment method, paperless billing, monthly charges, and total charges客户帐户信息-他们成为客户的时间,合同,付款方式,无纸化账单,每月费用和总费用
  • Demographic info about customers — gender, age range, and if they have partners and dependents有关客户的人口统计信息-性别,年龄范围以及他们是否有伴侣和受抚养人

First we install the actionrules package using the console:

首先,我们使用控制台安装actionrules软件包:

pip install actionrules-lukassykora# you can also call the following command in a Jupyter Notebook# !pip install actionrules-lukassykora

Next import the relevant packages

接下来导入相关包

import pandas as pdfrom actionrules.actionRulesDiscovery import ActionRulesDiscovery

Read in the dataset and check the head

读入数据集并检查头部

dataFrame = pd.read_csv(“telco.csv”, sep=”;”)dataFrame.head()

And now instantiate the action rules model and run a fit of the model on the data

现在实例化动作规则模型,并对数据运行模型拟合

import timeactionRulesDiscovery = ActionRulesDiscovery()actionRulesDiscovery.load_pandas(dataFrame)start = time.time()# define the stable and flexible attributesactionRulesDiscovery.fit(stable_attributes = [“gender”, “SeniorCitizen”, “Partner”], flexible_attributes = [“PhoneService”,  “InternetService”,  “OnlineSecurity”,  “DeviceProtection”,  “TechSupport”, “StreamingTV”, ], consequent = “Churn”, # outcome column conf=60, # predefined List of confs for classification rules supp=4, # predefined List of supports for classification rules desired_classes = [“No”], # outcome class is_nan=False, is_reduction=True, min_stable_attributes=1, # min stable attributes in antecedent min_flexible_attributes=1 # min flexible attributes in antecedent )end = time.time()print(“Time: “ + str(end — start) + “s”)

The run took approximately 9 seconds on a MacBook Pro 1.4 GHz Intel Core i5

在MacBook Pro 1.4 GHz Intel Core i5上运行大约9秒钟

Next we count the number of action rules discovered:

接下来,我们计算发现的操作规则的数量:

print(len(actionRulesDiscovery.get_action_rules()))

And it turns out there were 8 action rules discovered. Let’s now take a look at what the actual rules are:

事实证明,发现了8条动作规则。 现在让我们看一下实际规则是什么:

for rule in actionRulesDiscovery.get_action_rules_representation():    print(rule)    print(“ “)

An example of one of the rules we discovered

我们发现的其中一项规则的示例

r = [(Partner: no) ∧ (InternetService: fiber optic → no) ∧ (OnlineSecurity: no → no internet service) ∧ (DeviceProtection: no → no internet service) ∧ (TechSupport: no → no internet service) ] ⇒ [Churn: Yes → No] with support: 0.06772682095697856, confidence: 0.5599898610564512 and uplift: 0.05620874238092184.

We’ve discovered an interesting phenomenon where telecom customers who are single (partner = No) with no internet service are less likely to churn by approximately 5.6%, with decently high support of 6.7% and confidence of 55%.

我们发现了一个有趣的现象,即没有互联网服务的单身(合作伙伴=否)电信客户的流失率降低了约5.6%,其中较高的支持率为6.7%,信心为55%。

This suggests from a business perspective that perhaps we need to tone down on aggressively marketing add-on internet services to single customers (Partner: no) in the short term to reduce churn, but will need to eventually design a better marketing strategy targeting that demographic in the future, as we still want to sell as much additional internet services as possible as a telecom company.

从业务角度来看,这可能表明我们可能需要在短期内积极降低向单个客户营销附加互联网服务的可能性(合作伙伴:否),以减少客户流失,但最终需要针对该人群设计更好的营销策略将来,由于我们仍想像电信公司一样出售尽可能多的其他互联网服务。

The notebook with the code example and results above is here:

上面带有代码示例和结果的笔记本在这里:

https://github.com/hhl60492/actionrules/blob/master/notebooks/Telco%20-%20Action%20Rules.ipynb

https://github.com/hhl60492/actionrules/blob/master/notebooks/Telco%20-%20Action%20Rules.ipynb

Feel free to play around with the flexible attributes and the confidence and support minimums, as modifying those hyper parameters can give different results.

随意使用灵活的属性,置信度和支持最小值,因为修改这些超级参数可以得出不同的结果。

Conclusion

结论

We saw how classification rules and action rules were defined, confidence and support values for each, the difference between fixed and flexible attributes and an example of how the action rules can be modelled using supervised learning and the actionrules Python package.

我们看到了如何定义分类规则和动作规则,它们各自的置信度和支持值,固定属性和灵活属性之间的区别,以及如何使用监督学习和Actionrules Python软件包对动作规则进行建模的示例。

Finding action rules with high support, confidence and high uplift can give business stakeholders new insight on which actions to take in order to maximize a certain outcome.

寻找具有高度支持,信心和高度提升的行动规则,可以为业务利益相关者提供新的见解,使其可以采取哪些行动以最大程度地实现特定结果。

[1] Sýkora, Lukáš, and Tomáš Kliegr. “Action Rules: Counterfactual Explanations in Python.” RuleML Challenge 2020. CEUR-WS. http://ceur-ws.org/Vol-2644/paper36.pdf

[1]Sýkora,Lukáš和TomášKliegr。 “操作规则:Python中的反事实解释。” RuleML Challenge2020。CEUR-WS。 http://ceur-ws.org/Vol-2644/paper36.pdf

[2] https://www.kaggle.com/blastchar/telco-customer-churn

[2] https://www.kaggle.com/blastchar/telco-customer-churn

翻译自: https://towardsdatascience.com/action-rules-discovery-using-machine-learning-1cba6cd680d7

机器学习规则学习


http://www.taodudu.cc/news/show-4064311.html

相关文章:

  • 华为5G旗舰Mate30来了!这款5G“重磅炸弹”意义何在?
  • 阿斯加德心灵危机java,雷神的姐姐在阿斯加德不灭的情况下能不能刚灭霸
  • 迷你西游最新服务器是哪个,《迷你西游》公测增开服务器公告
  • 天刀计算机中丢失,天涯明月刀手游失踪白兔奇遇任务攻略
  • 九阴真经战无不胜服务器位置,九阴真经新服“战无不胜”
  • 狐狸盗宝记
  • 3.17服务器维护,2016年3月17日服务器停机维护公告
  • 迷你西游最新服务器是哪个,《迷你西游》新开服务器公告
  • 迷你西游最新服务器是哪个,迷你西游公测新开服务器“万佛朝宗”公告
  • 天龙八部手游服务器维护公告,-天龙八部手游-详情页-官方网站-天龙八部官方唯一正版3DMMORPG武侠手游...
  • 苹果服务器维护2017.12,2017年12月28日维护公告
  • 小学计算机应用计划,小学计算机教学计划
  • 计算机信函 教案模板,一年级信息技术课教案模板三篇
  • 一年级有计算机教学吗,一年级信息技术教学计划范文
  • 梦三花重金修改服务器,3月6日一梦江湖游戏更新公告
  • 2019年小学计算机室管理制度,2019年小学信息技术教师工作计划表
  • Gradle 插件 + ASM 实战 - 监控图片加载告警
  • 图片占位符placehold.it
  • 一键下载优美图库图片(附源码了哦)
  • nodejs项目(基于Express)——为上传的图片贴上国旗图标(使用gm)并返回图片位置
  • Python骚操作 | 川普的嘴,骗人的鬼!
  • wpf图片定点缩放
  • 关于数学计算机手抄报简单的,数学手抄报简单又漂亮图片
  • 带你从头到尾梳理大图片加载OOM处理问题
  • 惊呆了!监控也会骗人了,视频对象一秒删除
  • TabHost眼睛会骗人
  • 用Vue实现一个简单的图片轮播
  • CSS3实现骗人版无缝轮播图
  • react中请求网络图片加载不出来的问题 解决
  • 骗人的伎俩

机器学习规则学习_使用机器学习发现动作规则相关推荐

  1. 机器学习 量子_量子机器学习:神经网络学习

    机器学习 量子 My last articles tackled Bayes nets on quantum computers (read it here!), and k-means cluste ...

  2. 机器学习 预测模型_使用机器学习模型预测心力衰竭的生存时间-第一部分

    机器学习 预测模型 数据科学 , 机器学习 (Data Science, Machine Learning) 前言 (Preface) Cardiovascular diseases are dise ...

  3. 机器学习 生成_使用机器学习的Midi混搭生成独特的乐谱

    机器学习 生成 AI Composers present ideas to their human partners. People can then take certain elements an ...

  4. 机器学习回归预测_通过机器学习回归预测高中生成绩

    机器学习回归预测 Introduction: The applications of machine learning range from games to autonomous vehicles; ...

  5. 小时转换为机器学习特征_通过机器学习将pdf转换为有声读物

    小时转换为机器学习特征 This project was originally designed by Kaz Sato. 该项目最初由 Kaz Sato 设计 . 演示地址 I made this ...

  6. python 机器学习管道_构建机器学习管道-第1部分

    python 机器学习管道 Below are the usual steps involved in building the ML pipeline: 以下是构建ML管道所涉及的通常步骤: Imp ...

  7. 机器学习与知识发现_在机器学习中重新“发现”量子力学

    量子力学是一个描写各种微观现象的理论.像其它物理理论一样,量子力学并不是这些实验现象的直接反映.人类引入了一些革命性的抽象概念,如波函数.态叠加等等.通过这些革命性的概念,加上线性代数的数学基础,人类 ...

  8. 机器学习算法如何应用于控制_将机器学习算法应用于NBA MVP数据

    机器学习算法如何应用于控制 A step-by-step tutorial in R R中的分步教程 1引言 (1 Introduction) This blog makes up the Machi ...

  9. 机器学习经典算法实践_服务机器学习算法的系统设计-不同环境下管道的最佳实践

    机器学习经典算法实践 "Eureka"! While working on a persistently difficult-to-solve problem, you disco ...

  10. 27个机器学习图表翻译_使用机器学习的信息图表信息组织

    27个机器学习图表翻译 Infographics are crucial for presenting information in a more digestible fashion to the ...

最新文章

  1. 2021年大数据Hive(三):手把手教你如何吃透Hive数据库和表操作(学会秒变数仓大佬)
  2. 《数据库系统概念》9-附加关系运算
  3. P2668 斗地主 dp+深搜版
  4. 关注微信公众号使其自动发送欢迎你关注消息
  5. hadoop 依赖式job_每天一学:一个轻量级分布式任务调度框架 XXL-JOB
  6. 计算机函数left的用法,excel中的left函数怎么使用呢?
  7. java 代码效率_提高代码性能效率总结(一)--Java
  8. GNU ARM汇编--(五)中断汇编之嵌套中断处理
  9. 详解 —— HTTP协议
  10. 中班音乐计算机反思,中班音乐游戏打字机教案反思
  11. mysql计算百分比_mysql – 如何计算百分比?
  12. python 爬虫+写入excel 小案例
  13. 乐乐音乐4.0简洁版
  14. Python:利用matplotlib库绘制统计图(饼图、直方图、散点图、极坐标图和网格图)
  15. keep-alive上加v-if导致缓存失效
  16. C++学习 11.18.19
  17. dlib.get_frontal_face_detector()函数返回值
  18. 2015阿里巴巴实习生招聘笔试题,带答案,欢迎一起来讨论哇!
  19. TortoiseSVN客户端使用教程
  20. DID会固定年份吗_双重差分(DID)操流程及代码

热门文章

  1. CDN是什么?使用CDN有什么优势?
  2. 天才作文-不知道有没有人发过 很有才
  3. 迪赛智慧数——折线图(面积折线图):各年龄段员工离职率
  4. 贴吧顶贴_一看就懂一学就会的技术,实战干货分享-万能的小胡
  5. pycharm引用pyd文件
  6. Jquery UI常用插件
  7. stay foolish, stay hungry
  8. android地图方位角,根据两点经纬度,计算距离、方位角
  9. word设置标题自动编号
  10. 0-glusterfs: failed to set volfile server: File exists