MATLAB 距离函数及用法

判别分析时，通常涉及到计算两个样本之间的距离，多元统计学理论中有多种距离计算公式。MATLAB中已有对应函数，可方便直接调用计算。距离函数有：pdist, pdist2, mahal, squareform, mdscale, cmdscale

主要介绍pdist2 ,其它可参考matlab help

D = pdist2(X,Y)
D = pdist2(X,Y,distance)
D = pdist2(X,Y,'minkowski',P)
D = pdist2(X,Y,'mahalanobis',C)
D = pdist2(X,Y,distance,'Smallest',K)
D = pdist2(X,Y,distance,'Largest',K)
[D,I] = pdist2(X,Y,distance,'Smallest',K)
[D,I] = pdist2(X,Y,distance,'Largest',K)

练习：

2种计算方式，一种直接利用pdist计算，另一种按公式（见最后理论）直接计算。

% distance

clc;clear;
x = rand(4,3)
y = rand(1,3)

for i =1:size(x,1)
    for j =1:size(y,1)
        a = x(i,:); b=y(j,:);

%         Euclidean distance
        d1(i,j)=sqrt((a-b)*(a-b)');

%         Standardized Euclidean distance
        V = diag(1./std(x).^2);
        d2(i,j)=sqrt((a-b)*V*(a-b)');

%         Mahalanobis distance
        C = cov(x);
        d3(i,j)=sqrt((a-b)*pinv(C)*(a-b)');

%         City block metric
        d4(i,j)=sum(abs(a-b));

%         Minkowski metric
        p=3;
        d5(i,j)=(sum(abs(a-b).^p))^(1/p);

%         Chebychev distance
        d6(i,j)=max(abs(a-b));

%         Cosine distance
        d7(i,j)=1-(a*b')/sqrt(a*a'*b*b');

%         Correlation distance
        ac = a-mean(a); bc = b-mean(b);
        d8(i,j)=1- ac*bc'/(sqrt(sum(ac.^2))*sqrt(sum(bc.^2)));

end
end

md1 = pdist2(x,y,'Euclidean');

md2 = pdist2(x,y,'seuclidean');

md3 = pdist2(x,y,'mahalanobis');

md4 = pdist2(x,y,'cityblock');

md5 = pdist2(x,y,'minkowski',p);

md6 = pdist2(x,y,'chebychev');

md7 = pdist2(x,y,'cosine');

md8 = pdist2(x,y,'correlation');

md9 = pdist2(x,y,'hamming');

md10 = pdist2(x,y,'jaccard');
md11 = pdist2(x,y,'spearman');

D1=[d1,md1],D2=[d2,md2],D3=[d3,md3]

D4=[d4,md4],D5=[d5,md5],D6=[d6,md6]

D7=[d7,md7],D8=[d8,md8]

md9,md10,md11

运行结果如下：

x =

0.5225    0.6382    0.6837
    0.3972    0.5454    0.2888
    0.8135    0.0440    0.0690
    0.6608    0.5943    0.8384

y =

0.5898 0.7848 0.4977

D1 =

0.2462    0.2462
    0.3716    0.3716
    0.8848    0.8848
    0.3967    0.3967

D2 =

0.8355    0.8355
    1.5003    1.5003
    3.1915    3.1915
    1.2483    1.2483

D3 =

439.5074  439.5074
  437.5606  437.5606
  438.3339  438.3339
  437.2702  437.2702

D4 =

0.3999    0.3999
    0.6410    0.6410
    1.3934    1.3934
    0.6021    0.6021

D5 =

0.2147    0.2147
    0.3107    0.3107
    0.7919    0.7919
    0.3603    0.3603

D6 =

0.1860    0.1860
    0.2395    0.2395
    0.7409    0.7409
    0.3406    0.3406

D7 =

0.0253    0.0253
    0.0022    0.0022
    0.3904    0.3904
    0.0531    0.0531

D8 =

1.0731    1.0731
    0.0066    0.0066
    1.2308    1.2308
    1.8954    1.8954

md9 =

1
     1
     1
     1

md10 =

1
     1
     1
     1

md11 =

1.5000
    0.0000
    1.5000
    2.0000

基本理论公式如下：

转自：http://blog.sina.com.cn/s/blog_57235cc70100jjf8.html

一、pdist

Pairwise distance between pairs of objects

Syntax

D = pdist(X)
D = pdist(X,distance)

Description

D = pdist(X) 计算 X 中各对行向量的相互距离(X是一个m-by-n的矩阵). 这里 D 要特别注意，D 是一个长为m(m–1)/2的行向量.可以这样理解 D 的生成：首先生成一个 X 的距离方阵，由于该方阵是对称的，令对角线上的元素为0，所以取此方阵的下三角元素，按照Matlab中矩阵的按列存储原则，此下三角各元素的索引排列即为(2,1), (3,1), ..., (m,1), (3,2), ..., (m,2), ..., (m,m–1).可以用命令 squareform(D) 将此行向量转换为原距离方阵.(squareform函数是专门干这事的，其逆变换是也是squareform。)

D = pdist(X,distance) 使用指定的距离.distance可以取下面圆括号中的值，用红色标出！

Metrics

Given an m-by-n data matrix X, which is treated as m (1-by-n) row vectors x₁, x₂, ..., x_m, the various distances between the vector x_s and x_t are defined as follows:

欧几里德距离Euclidean distance('euclidean')

Notice that the Euclidean distance is a special case of the Minkowski metric, where p = 2.
欧氏距离虽然很有用，但也有明显的缺点。
一：它将样品的不同属性（即各指标或各变量）之间的差别等同看待，这一点有时不能满足实际要求。
二：它没有考虑各变量的数量级(量纲)，容易犯大数吃小数的毛病。所以，可以先对原始数据进行规范化处理再进行距离计算。
标准欧几里德距离Standardized Euclidean distance('seuclidean')

where V is the n-by-n diagonal matrix whose jth diagonal element is S(j)², where S is the vector of standard deviations.
相比单纯的欧氏距离，标准欧氏距离能够有效的解决上述缺点。注意，这里的V在许多Matlab函数中是可以自己设定的，不一定非得取标准差，可以依据各变量的重要程度设置不同的值，如knnsearch函数中的Scale属性。
马哈拉诺比斯距离Mahalanobis distance('mahalanobis')

where C is the covariance matrix.
马氏距离是由印度统计学家马哈拉诺比斯(P. C. Mahalanobis)提出的，表示数据的协方差距离。它是一种有效的计算两个未知样本集的相似度的方法。与欧式距离不同的是它考虑到各种特性之间的联系（例如：一条关于身高的信息会带来一条关于体重的信息，因为两者是有关联的）并且是尺度无关的(scale-invariant)，即独立于测量尺度。
如果协方差矩阵为单位矩阵,那么马氏距离就简化为欧式距离,如果协方差矩阵为对角阵,则其也可称为正规化的欧氏距离.
马氏优缺点：

1）马氏距离的计算是建立在总体样本的基础上的，因为C是由总样本计算而来，所以马氏距离的计算是不稳定的；

2）在计算马氏距离过程中，要求总体样本数大于样本的维数。

3）协方差矩阵的逆矩阵可能不存在。
曼哈顿距离(城市区块距离)City block metric('cityblock')

Notice that the city block distance is a special case of the Minkowski metric, where p=1.
闵可夫斯基距离Minkowski metric('minkowski')

Notice that for the special case of p = 1, the Minkowski metric gives the city block metric, for the special case of p = 2, the Minkowski metric gives the Euclidean distance, and for the special case of p = ∞, the Minkowski metric gives the Chebychev distance.
闵可夫斯基距离由于是欧氏距离的推广，所以其缺点与欧氏距离大致相同。
切比雪夫距离Chebychev distance('chebychev')

Notice that the Chebychev distance is a special case of the Minkowski metric, where p = ∞.
夹角余弦距离Cosine distance('cosine')

与Jaccard距离相比，Cosine距离不仅忽略0-0匹配，而且能够处理非二元向量，即考虑到变量值的大小。
相关距离Correlation distance('correlation')

Correlation距离主要用来度量两个向量的线性相关程度。

汉明距离Hamming distance('hamming')

两个向量之间的汉明距离的定义为两个向量不同的变量个数所占变量总数的百分比。
杰卡德距离Jaccard distance('jaccard')

Jaccard距离常用来处理仅包含非对称的二元(0-1)属性的对象。很显然，Jaccard距离不关心0-0匹配，而Hamming距离关心0-0匹配。
Spearman distance('spearman')

where
- r_sj is the rank of x_sj taken over x_1j, x_2j, ...x_mj, as computed by tiedrank
- r_s and r_t are the coordinate-wise rank vectors of x_s and x_t, i.e., r_s = (r_s1, r_s2, ... r_sn)

二、pdist2

Pairwise distance between two sets of observations

Syntax

Description

这里 X 是 mx-by-n 维矩阵，Y 是 my-by-n 维矩阵，生成 mx-by-my 维距离矩阵 D。

[D,I] = pdist2(X,Y,distance,'Smallest',K) 生成 K-by-my 维矩阵 D 和同维矩阵 I，其中D的每列是原距离矩阵中最小的元素，按从小到大排列，I 中对应的列即为其索引号。注意，这里每列各自独立地取 K 个最小值。

例如，令原mx-by-my 维距离矩阵为A，则 K-by-my 维矩阵 D 满足 D(:,j)=A(I(:,j),j).

一、pdist

Pairwise distance between pairs of objects

Syntax

D = pdist(X)
D = pdist(X,distance)

Description

D = pdist(X,distance) 使用指定的距离.distance可以取下面圆括号中的值，用红色标出！

Metrics

Given an m-by-n data matrix X, which is treated as m (1-by-n) row vectors x₁, x₂, ..., x_m, the various distances between the vector x_s and x_t are defined as follows:

欧几里德距离Euclidean distance('euclidean')

$$d_{s,\;t}^2 = \left( {{x_s} - {x_t}} \right) \cdot \left( {{x_s} - {x_t}} \right)'$$
Notice that the Euclidean distance is a special case of the Minkowski metric, where p = 2.
欧氏距离虽然很有用，但也有明显的缺点。
一：它将样品的不同属性（即各指标或各变量）之间的差别等同看待，这一点有时不能满足实际要求。
二：它没有考虑各变量的数量级(量纲)，容易犯大数吃小数的毛病。所以，可以先对原始数据进行规范化处理再进行距离计算。
标准欧几里德距离Standardized Euclidean distance('seuclidean')
$$d_{s,\;t}^2 = \left( {{x_s} - {x_t}} \right){V^{ - 1}}\left( {{x_s} - {x_t}} \right)'$$
where V is the n-by-n diagonal matrix whose jth diagonal element is S(j)², where S is the vector of standard deviations.
相比单纯的欧氏距离，标准欧氏距离能够有效的解决上述缺点。注意，这里的V在许多Matlab函数中是可以自己设定的，不一定非得取标准差，可以依据各变量的重要程度设置不同的值，如knnsearch函数中的Scale属性。
马哈拉诺比斯距离Mahalanobis distance('mahalanobis')
$$d_{s,\;t}^2 = \left( {{x_s} - {x_t}} \right){C^{ - 1}}\left( {{x_s} - {x_t}} \right)'$$
where C is the covariance matrix.
马氏距离是由印度统计学家马哈拉诺比斯(P. C. Mahalanobis)提出的，表示数据的协方差距离。它是一种有效的计算两个未知样本集的相似度的方法。与欧式距离不同的是它考虑到各种特性之间的联系（例如：一条关于身高的信息会带来一条关于体重的信息，因为两者是有关联的）并且是尺度无关的(scale-invariant)，即独立于测量尺度。
如果协方差矩阵为单位矩阵,那么马氏距离就简化为欧式距离,如果协方差矩阵为对角阵,则其也可称为正规化的欧氏距离.
马氏优缺点：

1）马氏距离的计算是建立在总体样本的基础上的，因为C是由总样本计算而来，所以马氏距离的计算是不稳定的；

2）在计算马氏距离过程中，要求总体样本数大于样本的维数。

3）协方差矩阵的逆矩阵可能不存在。
曼哈顿距离(城市区块距离)City block metric('cityblock')
$$d_{s,\;t}^{} = \sum\limits_{j = 1}^n {\left| {{x_{{s_j}}} - {x_{{t_j}}}} \right|} $$
Notice that the city block distance is a special case of the Minkowski metric, where p=1.
闵可夫斯基距离Minkowski metric('minkowski')
$$d_{s,\;t}^{} = \sqrt[p]{{\sum\limits_{j = 1}^n {{{\left| {{x_{{s_j}}} - {x_{{t_j}}}} \right|}^p}} }}$$
Notice that for the special case of p = 1, the Minkowski metric gives the city block metric, for the special case of p = 2, the Minkowski metric gives the Euclidean distance, and for the special case of p = ∞, the Minkowski metric gives the Chebychev distance.
闵可夫斯基距离由于是欧氏距离的推广，所以其缺点与欧氏距离大致相同。
切比雪夫距离Chebychev distance('chebychev')

$$d_{s,\;t}^{} = {\max _j}\left| {{x_{{s_j}}} - {x_{{t_j}}}} \right|$$
Notice that the Chebychev distance is a special case of the Minkowski metric, where p = ∞.
夹角余弦距离Cosine distance('cosine')

$$d_{s,\;t}^{} = 1 - \frac{{{x_s}{x_t}'}}{{{{\left\| {{x_s}} \right\|}_2} \cdot {{\left\| {{x_t}} \right\|}_2}}}$$
与Jaccard距离相比，Cosine距离不仅忽略0-0匹配，而且能够处理非二元向量，即考虑到变量值的大小。
相关距离Correlation distance('correlation')

$$d_{s,\;t}^{} = 1 - \frac{{{x_s}{x_t}'}}{{\sqrt {\left( {{x_s} - \overline {{x_s}} } \right) \cdot \left( {{x_s} - \overline {{x_s}} } \right)'} \cdot \sqrt {\left( {{x_t} - \overline {{x_t}} } \right) \cdot \left( {{x_t} - \overline {{x_t}} } \right)'} }}$$
Correlation距离主要用来度量两个向量的线性相关程度。

汉明距离Hamming distance('hamming')

$$d_{s,\;t}^{} = \left( {\frac{{\# ({x_{{s_j}}} \ne {x_{{t_j}}})}}{n}} \right)$$
两个向量之间的汉明距离的定义为两个向量不同的变量个数所占变量总数的百分比。
杰卡德距离Jaccard distance('jaccard')

$$d_{s,\;t}^{} = \left( {\frac{{\# \left[ {({x_{{s_j}}} \ne {x_{{t_j}}}) \cap \left( {({x_{{s_j}}} \ne 0) \cup ({x_{{t_j}}} \ne 0)} \right)} \right]}}{{\# \left[ {({x_{{s_j}}} \ne 0) \cup ({x_{{t_j}}} \ne 0)} \right]}}} \right)$$
Jaccard距离常用来处理仅包含非对称的二元(0-1)属性的对象。很显然，Jaccard距离不关心0-0匹配，而Hamming距离关心0-0匹配。
Spearman distance('spearman')

$$d_{s,\;t}^{} = 1 - \frac{{\left( {{r_s} - \overline {{r_s}} } \right)\left( {{r_t} - \overline {{r_t}} } \right)'}}{{\sqrt {\left( {{r_s} - \overline {{r_s}} } \right)\left( {{r_s} - \overline {{r_s}} } \right)'} \sqrt {\left( {{r_t} - \overline {{r_t}} } \right)\left( {{r_t} - \overline {{r_t}} } \right)'} }}$$
where
- r_sj is the rank of x_sj taken over x_1j, x_2j, ...x_mj, as computed by tiedrank
- r_s and r_t are the coordinate-wise rank vectors of x_s and x_t, i.e., r_s = (r_s1, r_s2, ... r_sn)
- $\overline {{r_s}} = \frac{1}{n}\sum\limits_j {{r_{{s_j}}}} = \frac{{n + 1}}{2}$
- $\overline{{r_t}} = \frac{1}{n}\sum\limits_j {{r_{{t_j}}}} = \frac{{n + 1}}{2}$

二、pdist2

Pairwise distance between two sets of observations

Syntax

Description

这里 X 是 mx-by-n 维矩阵，Y 是 my-by-n 维矩阵，生成 mx-by-my 维距离矩阵 D。

例如，令原mx-by-my 维距离矩阵为A，则 K-by-my 维矩阵 D 满足 D(:,j)=A(I(:,j),j).

转载本文请联系原作者获取授权，同时请注明本文来自朱新宇科学网博客。
链接地址：http://blog.sciencenet.cn/blog-531885-589056.html

MATLAB 距离函数及用法相关推荐

Matlab 中@ 的用法
Matlab 中@的用法主要有:函数句柄.函数表达式.调用父类以及类文件夹. 前两种有很多介绍,后两种涉及类,介绍的人很少.前2个例子,参考了其它博客,总感觉@用法都写的不全,所以在此基础上,补充了3 ...
matlab中的fix,matlab fix函数用法
matlab fix(x)和floor(x)的区别? 1)fix(n)的意义是取小于n的整数(是向零点舍入的意思是往零的方向上靠),这是一类应用在整数取值上的函数,就如同以前我们所研究的CSS布局HT ...
MATLAB中fix啥意思,matlab fix函数用法_常见问题解析,matlab
matlab syms什么意思_常见问题解析 matlab中syms的意思是定义多个变量,可以用来创建符号变量x和y,语法是"syms x y":也可以创建一些符号变量.函数和数组 ...
Matlab filter2的用法
Matlab filter2的用法 Y=filter2(h,x,'shape') h为滤波器,x为要滤波的数据,将h放在x上移动进行模板滤波. shape可取 full,same,valid (不写默 ...
matlab中的fix,matlab fix函数用法_常见问题解析
matlab syms什么意思_常见问题解析 matlab中syms的意思是定义多个变量,可以用来创建符号变量x和y,语法是"syms x y":也可以创建一些符号变量.函数和数组 ...
[转载]Matlab fmincon函数用法
原文地址:Matlab fmincon函数用法作者:长笛人倚楼Gloria 这个函数在之前优化工具箱一文中已经介绍过,由于其应用广泛,所以这里通过实例单独整理一下其用法. 一.基本介绍求解问题的标准 ...
matlab uigetfile的用法,matlab中uigetfile的用法
函数:uigetfile [功能描述]创建标准的对话框并通过交互式操作取得文件名 [函数用法] uigetfile:显示一个模态对话框,对话框列出了当前目录下的文件和目录,用于可以选择一个将要打开的文 ...
matlab冒号分号区别,matlab : 关于冒号用法大全以及实例
具体用法如下:1.a:b 表示[a,a+1,--,b]>> A=1:6A = 1 2 3 4 5 62.当然如果b和a不是整数的话,则向量的 ...
matlab中ode45用法,ode45(ode45用法举例)
ode45是用4阶方法提供候选解,5阶方法控制误差,是一种自适应步长的方法.而我们平时用的4阶和5阶龙格库塔法的公式中步长是给定的.具体算法和原理你可以看. ode45的初始条件是否必须是在x=0处 ...
自动控制原理中的MATLAB函数以及用法总结
MATLAB与自动控制原理简记最近要进行自控实验考试,在这里记录一下分析自控问题需要用的到一些函数以及用法,以供自己查阅,之后可能会总结一下如何用MATLAB来解自控的题. 一.数学模型的表示建 ...

MATLAB 距离函数及用法

MATLAB 距离函数及用法相关推荐

最新文章

热门文章