模型索引

  • 常见学术名词及翻译
  • Naive bayes
  • Decision tree
  • Generalisation and Evaluation
  • Linear regression
  • Logistics regression
  • Regularisation
  • SVM
  • PCA
  • K-means
  • GMM
  • KNN
  • Hierarchical Agglomerative Clustering

常见学术名词及翻译

independent and identically distributed独立同分布

dataset数据集

denote表示

Principal Components Analysis主要成分分析

Logistic Regression逻辑回归

k-Nearest Neighbour Method k邻近算法

Clustering聚类

Linear regression线性回归

eigenvector特征向量

eigenvalue特征值

bias偏置常数

maximum likelihood极大似然

gradient descent梯度下降

prior probabilities先验概率

hyperplane超平面

Information entropy信息熵


Naive bayes

prior probability: p(c1)p(c_1)p(c1​), p(c2)p(c_2)p(c2​).

equation: p(c1∣x)=p(c1)p(x∣c1)p(x)=p(c1)p(x∣c1)p(c1)p(x∣c1)+p(c2)p(x∣c2)p(c_1|x) = \frac{p(c_1)p(x|c_1)}{p(x)}=\frac{p(c_1)p(x|c_1)}{p(c_1)p(x|c_1)+p(c_2)p(x|c_2)}p(c1​∣x)=p(x)p(c1​)p(x∣c1​)​=p(c1​)p(x∣c1​)+p(c2​)p(x∣c2​)p(c1​)p(x∣c1​)​

Decision tree

Information entropy calculation:

Entropy equation: E=−∑iCpilog⁡2piE = -\sum\limits_i^Cp_i\log_2p_iE=−i∑C​pi​log2​pi​

Example: in one split we have 1 red, 2 blues, 3 greens points
E=−(16log⁡(16)+26log⁡(26)+36log⁡(36))=1.47E = -(\frac{1}{6}\log(\frac{1}{6})+\frac{2}{6}\log(\frac{2}{6})+\frac{3}{6}\log(\frac{3}{6}))=1.47E=−(61​log(61​)+62​log(62​)+63​log(63​))=1.47

if we get 3 blues in one split(only blue)
E=−(1log⁡(1))=0E = -(1\log(1))=0E=−(1log(1))=0

Information gain:

consider the before split:
E=−(510log⁡(510)+510log⁡(510))=1E = -(\frac{5}{10}\log(\frac{5}{10})+\frac{5}{10}\log(\frac{5}{10}))=1E=−(105​log(105​)+105​log(105​))=1

and consider the latter split:
the left split: E=0E = 0E=0, the right split: E=−(16log⁡(16)+56log⁡(56))=0.65E = -(\frac{1}{6}\log(\frac{1}{6})+\frac{5}{6}\log(\frac{5}{6}))=0.65E=−(61​log(61​)+65​log(65​))=0.65
in final calculation, we need to add weigh to these two splits:
Esplit=410∗0+610∗0.65=0.39E_{split} = \frac{4}{10}*0+\frac{6}{10}*0.65 = 0.39Esplit​=104​∗0+106​∗0.65=0.39

So the Gain=1−0.39=0.61Gain = 1-0.39 = 0.61Gain=1−0.39=0.61 (equal to how much entropy we removed).

Same expression: higher information gain = more entropy removed

Generalisation and Evaluation

Under-fitting:

predictor too simplistic (too rigid)
not powerful enough to capture salient patterns in data
can find another predictor F′F'F′ with smaller EtrainE_{train}Etrain​ and EgenE_{gen}Egen​

Over-fitting:

predictor too complex (flexible)
• fits “noise” in the training data

ROC curve:

real Positive real Negative
Predict Positive TP FP
Predict Negative FN TN

TPR=TPP=TPTP+FNTPR = \frac{TP}{P}=\frac{TP}{TP+FN}TPR=PTP​=TP+FNTP​

FPR=FPN=FPFP+TNFPR = \frac{FP}{N}=\frac{FP}{FP+TN}FPR=NFP​=FP+TNFP​

better model: FPR becomes smaller and TPR becomes larger.

AUC(Area Under Curve)

range: 0~1

The larger the AUC is, the better the model predicts.

Linear regression

process:

  1. draw a line through the data
  2. measure the distance from the line to the data, square each distance‾\underline{distance}distance​(residuals), and add them up.(record this sum of squared residuals)
  3. rotate the line a little bit, measure the residuals, square them, and then sum up the squares.(record it)
  4. repeat step 2 and 3 n times and we check the bag of sum of squared residuals, we choose the least sum of squared residuals parameters as the output.(“Least Squares”)

Initial:

SS(mean)=(data−mean)2SS(mean) = (data-mean)^2SS(mean)=(data−mean)2

Variance around the mean = (data−mean)2n\frac{(data-mean)^2}{n}n(data−mean)2​

Var(mean)=SS(mean)nVar(mean) = \frac{SS(mean)}{n}Var(mean)=nSS(mean)​

Fit:

SS(mean)=(data−line)SS(mean) = (data-line)SS(mean)=(data−line)

Var(fit)=SS(fit)nVar(fit)=\frac{SS(fit)}{n}Var(fit)=nSS(fit)​

R2=Var(mean)−Var(fit)Var(mean)=SS(mean)−SS(fit)SS(mean)R^2=\frac{Var(mean)-Var(fit)}{Var(mean)}=\frac{SS(mean)-SS(fit)}{SS(mean)}R2=Var(mean)Var(mean)−Var(fit)​=SS(mean)SS(mean)−SS(fit)​

R2R^2R2 tells us how much of the variance in y axis can be explained by taking x axis into account.
For example: we get Var(mean) = 11.1 and Var(fit) = 4.4
we will get $R^2 = \frac{11.1-4.4}{11.1}=0.6=60% $

There is a 60%60\%60% reduction in variance when we take the y axis into account.
Alternatively, we can say that x axis “explains” 60%60\%60% of the variance in y axis.

In linear regression, equations with more parameters will never make SS(fit) worse than equations with fewer parameters.
Why? This is because least squares will cause any term that makes SS(fit) worse to be multiplied by 0, and, in a sense, no longer exist.

  1. Quantifies the relationship in the data(this is R2R^2R2)

    1. This needs to be large.
  2. Determines how reliable that relationship is (this is the p-value that we calculate with F(F-distribution))

    1. This needs to be small.

Logistics regression

formula: p=sigmoid(y)=11+e−y=11+e−(wTx+w0)\large p = sigmoid(y) = \frac{1}{1+e^{-y}}=\frac{1}{1+e^{-(w^Tx+w_0)}}p=sigmoid(y)=1+e−y1​=1+e−(wTx+w0​)1​

Evaluation for model:
Cross-Entropy Loss: L=−∑[ytruelog⁡(p)+(1−ytrue)log⁡(1−p)]L = -\sum [y_{true}\log(p)+(1-y_{true})\log(1-p)]L=−∑[ytrue​log(p)+(1−ytrue​)log(1−p)]
target: to minimize the loss function

Pros:

  1. generate the probability for the unlabelled data

  2. reduce the influence of outliers, make the prediction more accurete.

Cons:

  1. only used in linear distribution

Regularisation

How to deal with overfitting:

  1. Reduce number of features.
    – Manually select which features to keep.
    – Model selection algorithm(later in course)
  2. Regularization.
    – Keep all the features, but reduce magnitude/values of parameters θj\theta_jθj​.
    – Works well when we have a lot of features, each of which contributes a bit to predicting yyy.

SVM

How Linear SVM works:

SVM is a supervised machine learning algorithm which can be used for classification or regression problems. It uses a technique called the kernel trick to transform your data and then based on these transformations it finds an optimal boundary between the possible outputs.

target:

find maximum margin hyperplanes

{wTxi+w0≥1,foryi=+1wTxi+w0≤−1,foryi=−1.\left\{\begin{aligned}w^Tx_i+w_0 \ge1 ,&& for \ y_i=+1 \\ w^Tx_i+w_0 \le -1 ,&& for \ y_i=-1. \end{aligned}\right.{wTxi​+w0​≥1,wTxi​+w0​≤−1,​​for yi​=+1for yi​=−1.​

and we have two hyperplanes:

{wTxi+w0=1wTxi+w0=−1\left\{\begin{aligned}w^Tx_i+w_0 =1 \\ w^Tx_i+w_0 = -1 \end{aligned}\right.{wTxi​+w0​=1wTxi​+w0​=−1​

for points satisfy support vectors:

The distance to the hyperplane is 1∣∣w∣∣\frac{1}{||w||}∣∣w∣∣1​
So the width of the margin is d++d−=2∣∣w∣∣d_+ + d_- = \frac{2}{||w||}d+​+d−​=∣∣w∣∣2​

As we can see, we want to maximize 2∣∣w∣∣\frac{2}{||w||}∣∣w∣∣2​
Instead, we do it by minimizing ∣∣w∣∣22(=wTw2)\frac{||w||^2}{2}(=\frac{w^Tw}{2})2∣∣w∣∣2​(=2wTw​)

constraints:

yi(wTx+w0)−1≥0y_i(w^Tx+w_0)-1 \ge 0yi​(wTx+w0​)−1≥0

For Nonlinear case:

we use the kernel function K(x,x′)=φ(x)Tφ(x′)K(x,x')=\varphi(x)^T \varphi(x')K(x,x′)=φ(x)Tφ(x′)

Example: Assume x=[x1,x2]Tx = [x_1,x_2]^Tx=[x1​,x2​]T, to transform the data set into quadratic feature set
x→φ(x)=[x12,x22,2x1x2,2x1,2x2,1]Tx \rightarrow \varphi (x) = [x_1^2,x_2^2,\sqrt{2}x_1x_2,\sqrt{2}x_1,\sqrt{2}x_2,1]^Tx→φ(x)=[x12​,x22​,2​x1​x2​,2​x1​,2​x2​,1]T

K(x′,x)=φ(x′)Tφ(x)=x12x1′2+x22x2′2+2x1x2x1′x2′+2x1x1′+1=(x1x1′+x2x2′+1)2=(1+(xTx′))2K(x',x) = \varphi(x')^T\varphi(x)=x_1^2x_1'^2+x_2^2x_2'^2+2x_1x_2x_1'x_2'+2x_1x_1'+1 \\=(x_1x_1'+x_2x_2'+1)^2=(1+(x^Tx'))^2K(x′,x)=φ(x′)Tφ(x)=x12​x1′2​+x22​x2′2​+2x1​x2​x1′​x2′​+2x1​x1′​+1=(x1​x1′​+x2​x2′​+1)2=(1+(xTx′))2

For kernel functions we have:

Linear kernel: K(x,x′)=xTx′K(x,x')=x^Tx'K(x,x′)=xTx′

Polynomial kernel: K(x,x′)=[1+xTx′]kK(x,x')=[1+x^Tx']^kK(x,x′)=[1+xTx′]k

Radial basis kernel: K(x,x′)=exp[−12∣∣x−x′∣∣2]K(x,x')=exp[-\frac{1}{2}||x-x'||^2]K(x,x′)=exp[−21​∣∣x−x′∣∣2]

PCA

Principal component analysis (PCA) is a commonly used unsupervised learning method, which uses orthogonal transformation to convert the observed data represented by linearly independent variables into a few data represented by linearly independent variables, which are called principal components

How PCA works:

The target of PCA is to find a transformation matrix www to transform high-dimensional vectors into low-dimensional vectors. PCA is defined as an orthogonal linear transformation that transforms the data to a new coordinate system such that the greatest variance by some scalar projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on.

K-means

process:
1.select the number of clusters you want to identify in your data. This is the “K” in “K-means clustering”.
2.randomly select K distinct data points.(These are the initial clusters).
3.measure the distance between 1st1^{st}1st point and the K initial clusters.
4.assign the 1st1^{st}1st point to the nearest cluster.
5.repeat step 3 and 4 just modify the number of point.
6.calculate the mean of each cluster.
7.repeat step 3,4,5,6 until the clustering did not change at all during the last iteration, we stop.

K-means algorithm is sensitive to the initial data(clusters)

how to choose the best K:

we can try k-means algorithm from K=1 to K=n(n is the number of the whole points), meanwhile we record the reduction of variation, we plot them and we can find where the huge reduction in variation happens, this is called an “elbow plot”, and we can pick “K” by find the “elbow” in the plot.

how is K-means clustering different from hierarchical clustering?

K-means clustering specially tries to put the data into the number of clusters we tell it to.
Hierarchical clustering just tell us, pairwise, what two things are most similar.

GMM

process:
1.guess the number of clusters(number of gaussian distribution)
2.for every gaussian distribution, give the parameters(expectation, variance and weight) random value
3.for every instance, calculate the probability of it in each gaussian distribution
4.for each gaussian distribution, the contribution of each instance to this distribution can be represented using its probability. Using probabilities as weights to calculate the new expectation and variance in place of old expectation and variance.
5.repeat 3 and 4 steps until each gaussian distribution’s expectation and variance converge.

KNN

How KNN works:

KNN works by finding the distances between a query and all the examples in the data, selecting the specified number examples (K) closest to the query, then votes for the most frequent label (in the case of classification) or averages the labels (in the case of regression).

How to choose the best K:

The optimal value of K can be selected using a held-out validation dataset: we pick the value of K that gives the lowest classification error over the validation dataset.

How K impact the result:

Low values for K(like K=1 or K=2) can be noisy and subject to the effects of outliers.
Large values for K smooth over things, but we don’t want K to be so large that a category with only a few samples in it will always be out voted by other categories.

Hierarchical Agglomerative Clustering

process description:

Agglomerative clustering starts by assigning each instance to its own cluster, and iteratively merging pairs of clusters into a replacement cluster until we are left with a single cluster containing all the instances. This single cluster is the root of the hierarchy (called a dendrogram), the two clusters that are merged into it are its children, and so on recursively. The algorithm runs for n − 1 iterations, where n is the number of datapoints. At each iteration we merge the two clusters that have the smallest distance between them, according to some distance metric.

single linkage:

This is the distance between the closest members of the two clusters.

example: say we have one-dimension data: -1, 2, 3, 8, 10
the merge process is :
1.merge 2 and 3 for their distance is 1
2.merge 8 and 10 for their distance is 2
3.merge -1 and {2,3} for their distance is 3(the min distance between -1 and the group{2,3} which is -1 and 2)
4.merge {-1,2,3} and {8.10} for their distance is 5(the min distance between {-1,2,3} and {8,10})
5.completed.

formula: min⁡x1∈c1,x2∈c2D(x1,x2)\Large \min\limits_{x_1 \in c_1,x_2 \in c_2}D(x_1,x_2)x1​∈c1​,x2​∈c2​min​D(x1​,x2​)

complete linkage:

This is the distance between the members that are the farthest apart.

example: say we have one-dimension data: -1, 2, 3, 8, 10
the merge process is :
1.merge 2 and 3 for their distance is 1
2.merge 8 and 10 for their distance is 2
3.merge -1 and {2,3} for their distance is 4(the max distance between -1 and the group{2,3} which is -1 and 3)
4.merge {-1,2,3} and {8,10} for their distance is 11(the max distance between {-1,2,3} and {8,10} which is -1 and 10)
5.completed

formula: max⁡x1∈c1,x2∈c2D(x1,x2)\Large \max\limits_{x_1 \in c_1,x_2 \in c_2}D(x_1,x_2)x1​∈c1​,x2​∈c2​max​D(x1​,x2​)

average linkage:

This method involves looking at the distances between all pairs and averages all of these distances. This is also called UPGMA-Unweighted Pair Group Mean Averaging.

example: say we have one-dimension data: -1, 2, 3, 8, 10
the merge process is :
1.merge 2 and 3 for their distance is 1
2.merge 8 and 10 for their distance is 2
3.merge -1 and {2,3} for their distance is 2.5(averge distance between -1 and the group{2,3} which is 1/2((-1 and 3)+(-1 and 2)))
4.merge {-1,2,3} and {8,10} for their distance is 7.67(averge distance between {-1,2,3} and {8,10} which is 1/2(ave(8 and {-1,2,3})+ave(10 and {-1,2,3})))
5.completed.

formula: 1n1n2∑x1∈c1,x2∈c2D(x1,x2)\Large \frac{1}{n_1n_2} \sum\limits_{x_1 \in c_1,x_2 \in c_2}D(x_1,x_2)n1​n2​1​x1​∈c1​,x2​∈c2​∑​D(x1​,x2​) (assume c1c_1c1​ has n1n_1n1​ parameters and c2c_2c2​ has n2n_2n2​ parameters)

通俗解释:single linkage是指最小的距离里面挑最小的,complete linkage是指最大的距离里面挑最小的,average linkage是指平均距离(中等距离)里面挑最小的

机器学习常见模型英文介绍相关推荐

  1. 人工智能基础:机器学习常见的算法介绍

    目录 监督学习 1.1 分类 1.2 回归 无监督学习 2.1 聚类 2.2 降维 3.半监督学习 4.迁移学习 5.强化学习(ReinforcementLearning, RL) 今天给大家聊聊机器 ...

  2. 1.机器学习常见模型评价指标

    这里主要是讲述分类问题.多分类问题是能够转化为二分类问题的.因此评价指标主要都是基于二分类来提出的. 1.混淆矩阵 优点:能够很好地包含了整体的分类结果信息 缺点:不直观,外行看不懂 2.准确度(Ac ...

  3. ML之ME/LF:机器学习中常见模型评估指标/损失函数(LiR损失、L1损失、L2损失、Logistic损失)求梯度/求导、案例应用之详细攻略

    ML之ME/LF:机器学习中常见模型评估指标/损失函数(LiR损失.L1损失.L2损失.Logistic损失)求梯度/求导.案例应用之详细攻略 目录 常见损失函数求梯度案例 1.线性回归求梯度 2.L ...

  4. 机器学习笔记之卡尔曼滤波(一)动态模型基本介绍

    机器学习笔记之卡尔曼滤波--动态模型基本介绍 引言 回顾:动态模型 动态模型的相关任务 卡尔曼滤波介绍 引言 本节从动态模型开始,介绍卡尔曼滤波(Kalman Filter). 回顾:动态模型 我们在 ...

  5. NLP自然语言处理-机器学习和自然语言处理介绍(一)

    "NLP自然语言处理-机器学习和自然语言处理介绍" 一.机器学习 1.什么是机器学习 从广义上来说,机器学习是一种能够赋予机器学习的能力以此让它完成直接编程无法完成的功能的方法.但 ...

  6. 机器学习中模型泛化能力和过拟合现象(overfitting)的矛盾、以及其主要缓解方法正则化技术原理初探...

    1. 偏差与方差 - 机器学习算法泛化性能分析 在一个项目中,我们通过设计和训练得到了一个model,该model的泛化可能很好,也可能不尽如人意,其背后的决定因素是什么呢?或者说我们可以从哪些方面去 ...

  7. 干货 | 22道机器学习常见面试题目

    来源:机器学习算法与自然语言处理 本文共6600字,建议阅读13分钟. 本文为你带来22道机器学习常见的面试问题和回答. 1.无监督和有监督算法的区别? 有监督学习:对具有概念标记(分类)的训练样本进 ...

  8. 机器学习--判别式模型与生成式模型

    原文地址为:机器学习--判别式模型与生成式模型 一.引言 本材料参考Andrew Ng大神的机器学习课程 http://cs229.stanford.edu 在上一篇有监督学习回归模型中,我们利用训练 ...

  9. js 计算任意凸多边形内最大矩形_题库 | 计算机视觉常见面试题型介绍及解答 第 7 期...

    - 计算机视觉 -为什么说 Dropout 可以解决过拟合?(1)取平均的作用: 先回到标准的模型即没有 dropout,我们用相同的训练数据去训练 5 个不同的神经网络,一般会得到 5 个不同的结果 ...

  10. JavaScript玩转机器学习:模型转换

    JavaScript玩转机器学习:模型转换 模型转换 TensorFlow.js 配备了各种预训练模型,这些模型可以在浏览器中使用,模型仓库 中有相关介绍.但是,您可能已经在其他地方找到或创建了一个 ...

最新文章

  1. poj2186(强连通分量)
  2. python3 import execjs ModuleNotFoundError: No module named ‘execjs‘
  3. Asp.Net MVC中Action跳转小结
  4. 不同版本的SQL Server之间数据导出导入的方法及性能比较
  5. LLVM4更新--简化对象定义
  6. JAVA日期和时间API
  7. 计算机科学导论课后单词,计算机科学导论课后总结
  8. 【Linux】Vi中的各种命令
  9. sql server的数据同步
  10. Android 黑色样式menu
  11. 百度地图 绘制运动轨迹_国产免费高配版“谷歌地球”,地图分析用这款软件秒杀谷歌地球...
  12. 易优超级字典生成器 v3.35 下载
  13. Bootstrap Timepicker
  14. 提搞网站访问速度可做的优化-------转载自熊哥的博客
  15. 视频横竖屏模式切换,如何将多个视频任意转换
  16. web安全攻防渗透+赵雨佳43
  17. 全网最通俗易懂的爬虫教程
  18. Android 设置向导启动分析
  19. 关于UWB汽车钥匙介绍
  20. BTree和B+Tree详解结构

热门文章

  1. PC系统启动过程简介以及Windows引导修复
  2. 铁蛋白-AHLL纳米颗粒|人表皮生长因子-铁蛋白重链亚基纳米粒子(EGF-5Cys-FTH1)|铁蛋白颗粒包载氯霉素Chloramphenicol-Ferritin
  3. 一个过不了情关的男人!!
  4. 活动报名丨AI ProCon 2020火爆来袭!
  5. SpringBoot的Banner
  6. TableOne数据分析工具
  7. 安全是我们的生命线,将时刻保持敬畏心
  8. Ubuntu 使用上的一些小tip
  9. win10开启移动热点,手机无法获取ip地址
  10. 数据用什么挖?数据挖掘常用工具分享