机器学习常见模型英文介绍

模型索引

常见学术名词及翻译
Naive bayes
Decision tree
Generalisation and Evaluation
Linear regression
Logistics regression
Regularisation
SVM
PCA
K-means
GMM
KNN
Hierarchical Agglomerative Clustering

常见学术名词及翻译

independent and identically distributed独立同分布

dataset数据集

denote表示

Principal Components Analysis主要成分分析

Logistic Regression逻辑回归

k-Nearest Neighbour Method k邻近算法

Clustering聚类

Linear regression线性回归

eigenvector特征向量

eigenvalue特征值

bias偏置常数

maximum likelihood极大似然

gradient descent梯度下降

prior probabilities先验概率

hyperplane超平面

Information entropy信息熵

Naive bayes

prior probability: p(c1)p(c_1)p(c1), p(c2)p(c_2)p(c2).

equation: p(c1∣x)=p(c1)p(x∣c1)p(x)=p(c1)p(x∣c1)p(c1)p(x∣c1)+p(c2)p(x∣c2)p(c_1|x) = \frac{p(c_1)p(x|c_1)}{p(x)}=\frac{p(c_1)p(x|c_1)}{p(c_1)p(x|c_1)+p(c_2)p(x|c_2)}p(c1∣x)=p(x)p(c1)p(x∣c1)=p(c1)p(x∣c1)+p(c2)p(x∣c2)p(c1)p(x∣c1)

Decision tree

Information entropy calculation:

Entropy equation: E=−∑iCpilog⁡2piE = -\sum\limits_i^Cp_i\log_2p_iE=−i∑Cpilog2pi

Example: in one split we have 1 red, 2 blues, 3 greens points
E=−(16log⁡(16)+26log⁡(26)+36log⁡(36))=1.47E = -(\frac{1}{6}\log(\frac{1}{6})+\frac{2}{6}\log(\frac{2}{6})+\frac{3}{6}\log(\frac{3}{6}))=1.47E=−(61log(61)+62log(62)+63log(63))=1.47

if we get 3 blues in one split(only blue)
E=−(1log⁡(1))=0E = -(1\log(1))=0E=−(1log(1))=0

Information gain:

consider the before split:
E=−(510log⁡(510)+510log⁡(510))=1E = -(\frac{5}{10}\log(\frac{5}{10})+\frac{5}{10}\log(\frac{5}{10}))=1E=−(105log(105)+105log(105))=1

and consider the latter split:
the left split: E=0E = 0E=0, the right split: E=−(16log⁡(16)+56log⁡(56))=0.65E = -(\frac{1}{6}\log(\frac{1}{6})+\frac{5}{6}\log(\frac{5}{6}))=0.65E=−(61log(61)+65log(65))=0.65
in final calculation, we need to add weigh to these two splits:
Esplit=410∗0+610∗0.65=0.39E_{split} = \frac{4}{10}*0+\frac{6}{10}*0.65 = 0.39Esplit=104∗0+106∗0.65=0.39

So the Gain=1−0.39=0.61Gain = 1-0.39 = 0.61Gain=1−0.39=0.61 (equal to how much entropy we removed).

Same expression: higher information gain = more entropy removed

Generalisation and Evaluation

Under-fitting:

predictor too simplistic (too rigid)
not powerful enough to capture salient patterns in data
can find another predictor F′F'F′ with smaller EtrainE_{train}Etrain and EgenE_{gen}Egen

Over-fitting:

predictor too complex (flexible)
• fits “noise” in the training data

ROC curve:

	real Positive	real Negative
Predict Positive	TP	FP
Predict Negative	FN	TN

TPR=TPP=TPTP+FNTPR = \frac{TP}{P}=\frac{TP}{TP+FN}TPR=PTP=TP+FNTP

FPR=FPN=FPFP+TNFPR = \frac{FP}{N}=\frac{FP}{FP+TN}FPR=NFP=FP+TNFP

better model: FPR becomes smaller and TPR becomes larger.

AUC(Area Under Curve)

range: 0~1

The larger the AUC is, the better the model predicts.

Linear regression

process:

draw a line through the data
measure the distance from the line to the data, square each distance‾\underline{distance}distance(residuals), and add them up.(record this sum of squared residuals)
rotate the line a little bit, measure the residuals, square them, and then sum up the squares.(record it)
repeat step 2 and 3 n times and we check the bag of sum of squared residuals, we choose the least sum of squared residuals parameters as the output.(“Least Squares”)

Initial:

SS(mean)=(data−mean)2SS(mean) = (data-mean)^2SS(mean)=(data−mean)2

Variance around the mean = (data−mean)2n\frac{(data-mean)^2}{n}n(data−mean)2

Var(mean)=SS(mean)nVar(mean) = \frac{SS(mean)}{n}Var(mean)=nSS(mean)

Fit:

SS(mean)=(data−line)SS(mean) = (data-line)SS(mean)=(data−line)

Var(fit)=SS(fit)nVar(fit)=\frac{SS(fit)}{n}Var(fit)=nSS(fit)

R2=Var(mean)−Var(fit)Var(mean)=SS(mean)−SS(fit)SS(mean)R^2=\frac{Var(mean)-Var(fit)}{Var(mean)}=\frac{SS(mean)-SS(fit)}{SS(mean)}R2=Var(mean)Var(mean)−Var(fit)=SS(mean)SS(mean)−SS(fit)

R2R^2R2 tells us how much of the variance in y axis can be explained by taking x axis into account.
For example: we get Var(mean) = 11.1 and Var(fit) = 4.4
we will get $R^2 = \frac{11.1-4.4}{11.1}=0.6=60% $

There is a 60%60\%60% reduction in variance when we take the y axis into account.
Alternatively, we can say that x axis “explains” 60%60\%60% of the variance in y axis.

In linear regression, equations with more parameters will never make SS(fit) worse than equations with fewer parameters.
Why? This is because least squares will cause any term that makes SS(fit) worse to be multiplied by 0, and, in a sense, no longer exist.

Quantifies the relationship in the data(this is R2R^2R2)
1. This needs to be large.
Determines how reliable that relationship is (this is the p-value that we calculate with F(F-distribution))
1. This needs to be small.

Logistics regression

formula: p=sigmoid(y)=11+e−y=11+e−(wTx+w0)\large p = sigmoid(y) = \frac{1}{1+e^{-y}}=\frac{1}{1+e^{-(w^Tx+w_0)}}p=sigmoid(y)=1+e−y1=1+e−(wTx+w0)1

Evaluation for model:
Cross-Entropy Loss: L=−∑[ytruelog⁡(p)+(1−ytrue)log⁡(1−p)]L = -\sum [y_{true}\log(p)+(1-y_{true})\log(1-p)]L=−∑[ytruelog(p)+(1−ytrue)log(1−p)]
target: to minimize the loss function

Pros:

generate the probability for the unlabelled data
reduce the influence of outliers, make the prediction more accurete.

Cons:

only used in linear distribution

Regularisation

How to deal with overfitting:

Reduce number of features.
– Manually select which features to keep.
– Model selection algorithm(later in course)
Regularization.
– Keep all the features, but reduce magnitude/values of parameters θj\theta_jθj.
– Works well when we have a lot of features, each of which contributes a bit to predicting yyy.

SVM

How Linear SVM works:

SVM is a supervised machine learning algorithm which can be used for classification or regression problems. It uses a technique called the kernel trick to transform your data and then based on these transformations it finds an optimal boundary between the possible outputs.

target:

find maximum margin hyperplanes

{wTxi+w0≥1,foryi=+1wTxi+w0≤−1,foryi=−1.\left\{\begin{aligned}w^Tx_i+w_0 \ge1 ,&& for \ y_i=+1 \\ w^Tx_i+w_0 \le -1 ,&& for \ y_i=-1. \end{aligned}\right.{wTxi+w0≥1,wTxi+w0≤−1,for yi=+1for yi=−1.

and we have two hyperplanes:

{wTxi+w0=1wTxi+w0=−1\left\{\begin{aligned}w^Tx_i+w_0 =1 \\ w^Tx_i+w_0 = -1 \end{aligned}\right.{wTxi+w0=1wTxi+w0=−1

for points satisfy support vectors:

The distance to the hyperplane is 1∣∣w∣∣\frac{1}{||w||}∣∣w∣∣1
So the width of the margin is d++d−=2∣∣w∣∣d_+ + d_- = \frac{2}{||w||}d++d−=∣∣w∣∣2

As we can see, we want to maximize 2∣∣w∣∣\frac{2}{||w||}∣∣w∣∣2
Instead, we do it by minimizing ∣∣w∣∣22(=wTw2)\frac{||w||^2}{2}(=\frac{w^Tw}{2})2∣∣w∣∣2(=2wTw)

constraints:

yi(wTx+w0)−1≥0y_i(w^Tx+w_0)-1 \ge 0yi(wTx+w0)−1≥0

For Nonlinear case:

we use the kernel function K(x,x′)=φ(x)Tφ(x′)K(x,x')=\varphi(x)^T \varphi(x')K(x,x′)=φ(x)Tφ(x′)

Example: Assume x=[x1,x2]Tx = [x_1,x_2]^Tx=[x1,x2]T, to transform the data set into quadratic feature set
x→φ(x)=[x12,x22,2x1x2,2x1,2x2,1]Tx \rightarrow \varphi (x) = [x_1^2,x_2^2,\sqrt{2}x_1x_2,\sqrt{2}x_1,\sqrt{2}x_2,1]^Tx→φ(x)=[x12,x22,2x1x2,2x1,2x2,1]T

K(x′,x)=φ(x′)Tφ(x)=x12x1′2+x22x2′2+2x1x2x1′x2′+2x1x1′+1=(x1x1′+x2x2′+1)2=(1+(xTx′))2K(x',x) = \varphi(x')^T\varphi(x)=x_1^2x_1'^2+x_2^2x_2'^2+2x_1x_2x_1'x_2'+2x_1x_1'+1 \\=(x_1x_1'+x_2x_2'+1)^2=(1+(x^Tx'))^2K(x′,x)=φ(x′)Tφ(x)=x12x1′2+x22x2′2+2x1x2x1′x2′+2x1x1′+1=(x1x1′+x2x2′+1)2=(1+(xTx′))2

For kernel functions we have:

Linear kernel: K(x,x′)=xTx′K(x,x')=x^Tx'K(x,x′)=xTx′

Polynomial kernel: K(x,x′)=[1+xTx′]kK(x,x')=[1+x^Tx']^kK(x,x′)=[1+xTx′]k

Radial basis kernel: K(x,x′)=exp[−12∣∣x−x′∣∣2]K(x,x')=exp[-\frac{1}{2}||x-x'||^2]K(x,x′)=exp[−21∣∣x−x′∣∣2]

PCA

Principal component analysis (PCA) is a commonly used unsupervised learning method, which uses orthogonal transformation to convert the observed data represented by linearly independent variables into a few data represented by linearly independent variables, which are called principal components

How PCA works:

The target of PCA is to find a transformation matrix www to transform high-dimensional vectors into low-dimensional vectors. PCA is defined as an orthogonal linear transformation that transforms the data to a new coordinate system such that the greatest variance by some scalar projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on.

K-means

process:
1.select the number of clusters you want to identify in your data. This is the “K” in “K-means clustering”.
2.randomly select K distinct data points.(These are the initial clusters).
3.measure the distance between 1st1^{st}1st point and the K initial clusters.
4.assign the 1st1^{st}1st point to the nearest cluster.
5.repeat step 3 and 4 just modify the number of point.
6.calculate the mean of each cluster.
7.repeat step 3,4,5,6 until the clustering did not change at all during the last iteration, we stop.

K-means algorithm is sensitive to the initial data(clusters)

how to choose the best K:

we can try k-means algorithm from K=1 to K=n(n is the number of the whole points), meanwhile we record the reduction of variation, we plot them and we can find where the huge reduction in variation happens, this is called an “elbow plot”, and we can pick “K” by find the “elbow” in the plot.

how is K-means clustering different from hierarchical clustering?

K-means clustering specially tries to put the data into the number of clusters we tell it to.
Hierarchical clustering just tell us, pairwise, what two things are most similar.

GMM

process:
1.guess the number of clusters(number of gaussian distribution)
2.for every gaussian distribution, give the parameters(expectation, variance and weight) random value
3.for every instance, calculate the probability of it in each gaussian distribution
4.for each gaussian distribution, the contribution of each instance to this distribution can be represented using its probability. Using probabilities as weights to calculate the new expectation and variance in place of old expectation and variance.
5.repeat 3 and 4 steps until each gaussian distribution’s expectation and variance converge.

KNN

How KNN works:

KNN works by finding the distances between a query and all the examples in the data, selecting the specified number examples (K) closest to the query, then votes for the most frequent label (in the case of classification) or averages the labels (in the case of regression).

How to choose the best K:

The optimal value of K can be selected using a held-out validation dataset: we pick the value of K that gives the lowest classification error over the validation dataset.

How K impact the result:

Low values for K(like K=1 or K=2) can be noisy and subject to the effects of outliers.
Large values for K smooth over things, but we don’t want K to be so large that a category with only a few samples in it will always be out voted by other categories.

Hierarchical Agglomerative Clustering

process description:

Agglomerative clustering starts by assigning each instance to its own cluster, and iteratively merging pairs of clusters into a replacement cluster until we are left with a single cluster containing all the instances. This single cluster is the root of the hierarchy (called a dendrogram), the two clusters that are merged into it are its children, and so on recursively. The algorithm runs for n − 1 iterations, where n is the number of datapoints. At each iteration we merge the two clusters that have the smallest distance between them, according to some distance metric.

single linkage:

This is the distance between the closest members of the two clusters.

example: say we have one-dimension data: -1, 2, 3, 8, 10
the merge process is :
1.merge 2 and 3 for their distance is 1
2.merge 8 and 10 for their distance is 2
3.merge -1 and {2,3} for their distance is 3(the min distance between -1 and the group{2,3} which is -1 and 2)
4.merge {-1,2,3} and {8.10} for their distance is 5(the min distance between {-1,2,3} and {8,10})
5.completed.

formula: min⁡x1∈c1,x2∈c2D(x1,x2)\Large \min\limits_{x_1 \in c_1,x_2 \in c_2}D(x_1,x_2)x1∈c1,x2∈c2minD(x1,x2)

complete linkage:

This is the distance between the members that are the farthest apart.

example: say we have one-dimension data: -1, 2, 3, 8, 10
the merge process is :
1.merge 2 and 3 for their distance is 1
2.merge 8 and 10 for their distance is 2
3.merge -1 and {2,3} for their distance is 4(the max distance between -1 and the group{2,3} which is -1 and 3)
4.merge {-1,2,3} and {8,10} for their distance is 11(the max distance between {-1,2,3} and {8,10} which is -1 and 10)
5.completed

formula: max⁡x1∈c1,x2∈c2D(x1,x2)\Large \max\limits_{x_1 \in c_1,x_2 \in c_2}D(x_1,x_2)x1∈c1,x2∈c2maxD(x1,x2)

average linkage:

This method involves looking at the distances between all pairs and averages all of these distances. This is also called UPGMA-Unweighted Pair Group Mean Averaging.

example: say we have one-dimension data: -1, 2, 3, 8, 10
the merge process is :
1.merge 2 and 3 for their distance is 1
2.merge 8 and 10 for their distance is 2
3.merge -1 and {2,3} for their distance is 2.5(averge distance between -1 and the group{2,3} which is 1/2((-1 and 3)+(-1 and 2)))
4.merge {-1,2,3} and {8,10} for their distance is 7.67(averge distance between {-1,2,3} and {8,10} which is 1/2(ave(8 and {-1,2,3})+ave(10 and {-1,2,3})))
5.completed.

formula: 1n1n2∑x1∈c1,x2∈c2D(x1,x2)\Large \frac{1}{n_1n_2} \sum\limits_{x_1 \in c_1,x_2 \in c_2}D(x_1,x_2)n1n21x1∈c1,x2∈c2∑D(x1,x2) (assume c1c_1c1 has n1n_1n1 parameters and c2c_2c2 has n2n_2n2 parameters)

通俗解释：single linkage是指最小的距离里面挑最小的，complete linkage是指最大的距离里面挑最小的，average linkage是指平均距离（中等距离）里面挑最小的