机器学习——决策树(ID3)的实现

相关定义的补充：
熵
在信息论与概率统计中，熵（entropy）是表示随机变量不确定性的度量。设XXX是一个取有限个值的离散随机变量，其概率分布为：P(X=xi)=pi,i=1,2,⋯,nP(X=x_i)=p_i,i=1,2,\cdots ,nP(X=xi)=pi,i=1,2,⋯,n，则随机变量XXX的熵定义为：
H(X)=−∑i=1npilogpi，（其中，对数以2为底或e为底）H(X)=-\sum\limits_{i=1}^np_i\ log\ p_i，（其中，对数以2为底或e为底） H(X)=−i=1∑npi log pi，（其中，对数以2为底或e为底）

熵的单位分别称做比特(bit)或纳特(nat)。熵越大，随机变量的不确定性就越大。

条件熵
条件熵（conditional entropy）H(Y∣X)H(Y|X)H(Y∣X)表示在已知随机变量XXX的条件下随机变量YYY的不确定性，定义为XXX给定条件下YYY的条件概率分布的熵对XXX的数学期望：
H(Y∣X)=∑i=1npiH(Y∣X=xi),(其中，pi=P(X=xi),i=1,2,⋯,n.)H(Y|X)=\sum\limits_{i=1}^np_iH(Y|X=x_i),(其中，p_i=P(X=x_i),i=1,2,\cdots,n.) H(Y∣X)=i=1∑npiH(Y∣X=xi),(其中，pi=P(X=xi),i=1,2,⋯,n.)

信息增益
信息增益（information gain）表示得知特征XXX的信息而使得类YYY的信息的不确定性减少的程度。特征AAA对训练数据集DDD的信息增益g(D,A)g(D,A)g(D,A)，定义为集合DDD的信息熵H(D)H(D)H(D)与特征AAA给定条件下DDD的经验条件熵H(D∣A)H(D|A)H(D∣A)之差，即：
g(D,A)=H(D)−H(D∣A)g(D,A)=H(D)-H(D|A) g(D,A)=H(D)−H(D∣A)
对于数据集DDD而言，信息增益依赖于特征，不同的特征往往具有不同的信息增益。信息增益大的特征具有更强的分类能力。

信息增益准则选择特征的方法：对训练数据集DDD，计算其每个特征的信息增益，并比较他们的大小，选择信息增益最大的特征。

ID3算法的核心是在决策树各个结点上应用信息增益准则选择特征，递归的构建决策树。

构建过程：从根结点出发，对结点计算所有特征的信息增益，选择信息增益最大的特征作为结点的特征，由该特征的不同取值建立子结点；再对子结点递归调用以上方法，直到所有特征信息增益均很小或没有特征可以选择为止。

相关数据的定义：

 public class ID3 {/*** The data set.*/Instances dataset;/*** Is this data set pure (only one label)?*/boolean pure;/*** The number of classes. For binary classification it is 2.*/int numClasses;/*** Available instances. Other instances do not belong this branch.*/int[] availableInstances;/*** Available attributes. Other attributes have been selected in the path from* root.*/int[] availableAttributes;/*** The selected attribute.*/int splitAttribute;/*** The children nodes.*/ID3[] children;/*** The label. Inner nodes also have a label. For example, <outlook = sunny,* humidity = high> never appear in the training data, but <humidity = high> is* valid in other cases.*/int label;/*** The prediction,including queried and predicted labels.*/int[] predicts;/*** Small block cannot be split further.*/static int smallBlockThreshold = 3;

第一个构造函数，当读文件初始化数据的时候调用：

 /************************ The constructor.** @param paraFilename The given file.**********************/public ID3(String paraFilename) {dataset = null;try {FileReader fileReader = new FileReader(paraFilename);dataset = new Instances(fileReader);fileReader.close();} catch (Exception ee) {System.out.println("Cannot read the file: " + paraFilename + "\r\n" + ee);System.exit(0);} // Of trydataset.setClassIndex(dataset.numAttributes() - 1);numClasses = dataset.numClasses();availableInstances = new int[dataset.numInstances()];for (int i = 0; i < availableInstances.length; i++) {availableInstances[i] = i;} // Of for iavailableAttributes = new int[dataset.numAttributes()];for (int i = 0; i < availableAttributes.length; i++) {availableAttributes[i] = i;} // Of for i// Initialize.children = null;// Determine the label by simple voting.label = getMajorityClass(availableAttributes);// Determine whether or not it is pure.pure = pureJudge(availableInstances);}// Of the constructor

第二个构造函数，当递归构建决策树时使用：

 /************************ The constructor.** @param paraDataset             The given data set.* @param paraAvailableInstances  The available instances.* @param paraAvailableAttributes The available attributes.**********************/public ID3(Instances paraDataset, int[] paraAvailableInstances, int[] paraAvailableAttributes) {// Copy its reference instead of clone the available instances.dataset = paraDataset;availableInstances = paraAvailableInstances;availableAttributes = paraAvailableAttributes;// Initializechildren = null;// Determine the label by simple voting.label = getMajorityClass(availableInstances);// Determine whether or not it is pure.pure = pureJudge(availableInstances);}// Of the second constructor

判断当前结点划分的数据集是否为同一类，如果是，那么就不用继续分支了。

 /*********************** Is the given block pure?** @param paraBlock The block.* @return True if pure.**********************/public boolean pureJudge(int[] paraBlock) {pure = true;for (int i = 0; i < paraBlock.length; i++) {if (dataset.instance(paraBlock[i]).classValue() != dataset.instance(paraBlock[0]).classValue()) {pure = false;break;} // Of if} // Of for ireturn pure;}// Of pureJudge

投票选取当前结点中个数最多的类来代表当前结点的类。

 /*********************** Compute the majority class of the given block for voting.** @param paraBlock The block.* @return The majority class.**********************/public int getMajorityClass(int[] paraBlock) {int[] tempClassCounts = new int[dataset.numClasses()];for (int i = 0; i < paraBlock.length; i++) {tempClassCounts[(int) dataset.instance(paraBlock[i]).classValue()]++;} // Of for i;int resultMajorityClass = -1;int tempMaxCount = -1;for (int i = 0; i < tempClassCounts.length; i++) {if (tempClassCounts[i] > tempMaxCount) {resultMajorityClass = i;tempMaxCount = tempClassCounts[i];} // Of if} // Of for ireturn resultMajorityClass;}// Of getMajorityClass

选择信息增益最大(条件熵最小)的特征作为当前结点的特征：

 /*********************** Select the best attribute.** @return The best attribute index.**********************/public int selectBestAttribute() {splitAttribute = -1;double tempMinimalEntropy = 10000;double tempEntropy;for (int i = 0; i < availableAttributes.length; i++) {tempEntropy = conditionalEntropy(availableAttributes[i]);if (tempMinimalEntropy > tempEntropy) {tempMinimalEntropy = tempEntropy;splitAttribute = availableAttributes[i];} // Of if} // Of for ireturn splitAttribute;}// Of selectBestAttribute

计算特征下的条件熵：

 /*********************** Compute the conditional entropy of an attribute.* * @param paraAttribute The given attribute.* @return The entropy.**********************/public double conditionalEntropy(int paraAttribute) {// Step 1. Statistics.int tempNumClasses = dataset.numClasses();int tempNumValues = dataset.attribute(paraAttribute).numValues();int tempNumInstances = availableAttributes.length;double[] tempValueCounts = new double[tempNumValues];double[][] tempCountMatrix = new double[tempNumValues][tempNumClasses];int tempClass, tempValue;for (int i = 0; i < tempNumClasses; i++) {tempClass = (int) dataset.instance(availableInstances[i]).classValue();tempValue = (int) dataset.instance(availableInstances[i]).value(paraAttribute);tempValueCounts[tempValue]++;tempCountMatrix[tempValue][tempClass]++;} // Of for i// Step 2.double reslutEntropy = 0;double tempEntropy, tempFraction;for (int i = 0; i < tempNumValues; i++) {if (tempValueCounts[i] == 0) {continue;} // Of iftempEntropy = 0;for (int j = 0; j < tempNumClasses; j++) {tempFraction = tempCountMatrix[i][j] / tempValueCounts[i];if (tempFraction == 0) {continue;} // Of iftempEntropy += -tempFraction * Math.log(tempFraction);} // Of for jreslutEntropy += tempValueCounts[i] / tempNumInstances * tempEntropy;} // Of for ireturn reslutEntropy;}// Of conditionalEntropy

选择特征后，需要将数据集根据特征值划分成多个子数据集，用于根据各个子数据集递归构建子结点：

 /*********************** Split the data according to the given attribute.** @return The blocks.**********************/public int[][] splitData(int paraAttribute) {int tempNumValues = dataset.attribute(paraAttribute).numValues();int[][] resultBlocks = new int[tempNumValues][];int[] tempSizes = new int[tempNumValues];// First scan to count the size of each block.int tempValue;for (int i = 0; i < availableInstances.length; i++) {tempValue = (int) dataset.instance(availableInstances[i]).value(paraAttribute);tempSizes[tempValue]++;} // Of for i// Allocate space.for (int i = 0; i < tempNumValues; i++) {resultBlocks[i] = new int[tempSizes[i]];} // Of for i// Second scan to fill;Arrays.fill(tempSizes, 0);for (int i = 0; i < availableInstances.length; i++) {tempValue = (int) dataset.instance(availableInstances[i]).value(paraAttribute);// Copy data.resultBlocks[tempValue][tempSizes[tempValue]] = availableInstances[i];tempSizes[tempValue]++;} // Of for ireturn resultBlocks;}// Of splitData

构建决策树：

 /*********************** Build the tree recursively.**********************/public void buildTree() {if (pureJudge(availableInstances)) {return;} // Of ifif (availableInstances.length <= smallBlockThreshold) {return;} // Of ifselectBestAttribute();int[][] tempSubBlocks = splitData(splitAttribute);children = new ID3[tempSubBlocks.length];// Construct the remaining attribute set.int[] tempRemainingAttributes = new int[availableAttributes.length - 1];for (int i = 0; i < availableAttributes.length; i++) {if (availableAttributes[i] < splitAttribute) {tempRemainingAttributes[i] = availableAttributes[i];} else if (availableAttributes[i] > splitAttribute) {tempRemainingAttributes[i - 1] = availableAttributes[i];} // Of if} // Of for i// Construct children.for (int i = 0; i < children.length; i++) {if ((tempSubBlocks[i] == null) || (tempSubBlocks[i].length == 0)) {children[i] = null;continue;} else {children[i] = new ID3(dataset, tempSubBlocks[i], tempRemainingAttributes);// Important code: do this recursively.children[i].buildTree();} // Of if} // Of for i}// Of buildTree

以实验数据weather.arff为例：

计算Outlook属性的条件熵为例：

用同样的方式计算其它属性下的条件熵，选择条件熵最小的特征：

然后递归建立子树结点：

再继续向下建立子树：

完成children[0]的子树建立后，回到函数调用的地方，开始建立children[1]：

创建根结点的最后一个子树：

和children[0]同样，需要建立子树的子树：

建树完成：

分类和测试：
分类的过程就是根据当前结点的特征向下层遍历，直到叶节点或结点为null。

1：children == null：说明当前结点为叶子结点，直接返回当前结点的标签；
2：当前特征的某个属性值指向的结点为null：返回当前结点的标签；

 /*********************** Classify an instance.** @param paraInstance The given instance.* @return The prediction.**********************/public int classify(Instance paraInstance) {if (children == null) {return label;} // Of ifID3 tempChild = children[(int) paraInstance.value(splitAttribute)];if (tempChild == null) {return label;} // Of ifreturn tempChild.classify(paraInstance);}// Of classify/*********************** Test on a testing set.** @param paraDataset The given testing data.* @return The accuracy.**********************/public double test(Instances paraDataset) {double tempCorrect = 0;for (int i = 0; i < paraDataset.numInstances(); i++) {if (classify(paraDataset.instance(i)) == (int) paraDataset.instance(i).classValue()) {tempCorrect++;} // Of if} // Of for ireturn tempCorrect / paraDataset.numInstances();}// Of test/*********************** Test on the training set.** @return The accuracy.**********************/public double selfTest() {return test(dataset);}// Of selfTest

以第一个实例的分类测试过程为例：
首先进入根结点，由于Outlook==Suny（代码中的体现：children[(int) paraInstance.value(0)]），会进入特征为Humidity的结点（children[0]），然后Humidity==High（代码中的体现：children[(int) paraInstance.value(2)]），最后到达叶子结点，返回叶子结点的label。

为了能更美观的打印决策树，我重新写了一个方法：

 /*********************** Print the tree.** @param paraString The given string.**********************/public void printDecisionTree(String paraString) {if (children == null) {System.out.println(paraString + "class: " + label + "\r\n");return;}for (int i = 0; i < children.length; i++) {String tempAttributeName = dataset.attribute(splitAttribute).name();String tempString = tempAttributeName + ": " + dataset.attribute(splitAttribute).value(i) + "----> ";if (children[i] == null) {System.out.println(paraString + tempString + "class: " + label + "\r\n");} else {children[i].printDecisionTree(paraString + tempString);} // Of if} // Of for i}// Of printDecisionTree

运行结果：

机器学习——决策树(ID3)的实现相关推荐

机器学习决策树ID3
一.决策树(ID3算法) 1. 决策树的基本认识决策树是一种依托决策而建立起来的一种树.在机器学习中,决策树是一种预测模型,代表的是一种对象属性与对象值之间的一种映射关系,每一个节点代表某个对象,树 ...
机器学习决策树 ID3
构造决策树时需要解决的第一个问题是:当前数据集中哪个特征在划分数据分类时起决定性作用. 划分数据集的原则:将无序的数据变得更加有序信息增益:在划分数据集前后信息发生的变化获得信息增益最大的特征就是 ...
机器学习算法—决策树(ID3)算法
机器学习--决策树(ID3)算法 1.决策树(ID3)算法 1.1 算法引入我们首先以一个分类问题开始,假设我们有这样一份样本数据: 我们的目标是想通过色泽.根蒂.敲声.纹理.脐部.触感来判断这是不 ...
机器学习算法 04 —— 决策树(ID3、C4.5、CART，剪枝，特征提取，回归决策树)
文章目录系列文章决策树 1 决策树算法简介 2 决策树分类的原理 2.1 信息熵 2.2 决策树划分依据-信息增益(ID3) 2.3 决策树划分依据-信息增益率(C4.5) 2.4 决策树划分依据 ...
python决策树id3算法_python实现决策树ID3算法
一.决策树概论决策树是根据训练数据集,按属性跟类型,构建一棵树形结构.可以按照这棵树的结构,对测试数据进行分类.同时决策树也可以用来处理预测问题(回归). 二.决策树ID3的原理有多种类型的决策树 ...
++代码实现感知机的原理_决策树ID3原理及R语言python代码实现（西瓜书）
决策树ID3原理及R语言python代码实现(西瓜书) 摘要: 决策树是机器学习中一种非常常见的分类与回归方法,可以认为是if-else结构的规则.分类决策树是由节点和有向边组成的树形结构,节点表示特 ...
3.1 机器学习 --- 决策树
3.1 机器学习 - 决策树一金融评分卡模型做好后尽量快的上线,以符合要求的模型,尽快上线. 超参数赋予权重,使得模型的参数不一致,而达到均衡样本数据二决策树原理 1. 找女朋友性别:男 ...
[机器学习]决策树选西瓜
文章目录一.决策树 1.画法 2.决策树的剪枝 3.挑西瓜决策树 3.1利用信息增益选择最优划分属性二.sk-learn库对西瓜数据集,分别进行ID3.C4.5和CART的算法代码实现 1.ID3 ...
决策树ID3代码（Python）
决策树ID3代码(Python) 上一节介绍了决策树的递归算法框架(create),这一节将用Python语言实现一个小程序.代码和数据来源于<机器学习实战>Peter Harringto ...
机器学习 | 决策树原理剪枝连续值缺失值处理
机器学习 | 决策树最近学习了西瓜书决策树一章,简单整理了笔记,后续会继续补充给文章目录机器学习 | 决策树一.基本信息论 1. 信息量 2. 信息熵二.决策树常用算法 1.ID3 2.C4 ...

机器学习——决策树(ID3)的实现

机器学习——决策树(ID3)的实现相关推荐

最新文章

热门文章