本文的参考资料：《Python数据科学手册》；
本文的源代上传到了Gitee上；

本文用到的包：

%matplotlib inline
import os
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib import offsetbox
from mpl_toolkits import mplot3dfrom matplotlib.image import imread
from sklearn.datasets import fetch_lfw_people
from sklearn.metrics import pairwise_distances
from sklearn.manifold import MDS, LocallyLinearEmbedding, Isomapsns.set()
plt.rc('font', family='SimHei')
plt.rc('axes', unicode_minus=False)

流形学习（Manifold Learning）

流行学习是一个无监督的评估器，它通过将一个低维度的流形嵌入到高维度的空间来描述数据集；

本章将会介绍的流形学习算法包括多维标度法（multidimensional scaling），局部线性嵌入法（locally linear embedding），保距映射法（isometric mapping）；

首先我们构造一个能够画出HELLO形状的散点图的函数：

def make_hello(n: int=1000, seed: int=42)->np.ndarray:fig, ax = plt.subplots(figsize=(4, 1))  # type: plt.Figure, plt.Axesfig.subplots_adjust(left=0, right=1, bottom=0, top=1)ax.axis('off')ax.text(0.5, 0.4, 'HELLO', va='center', ha='center', weight='bold', size=85)fig.savefig('./HELLO.png')plt.close(fig)data = imread('./HELLO.png')[::-1, :, 0].Trng = np.random.RandomState(seed)X = rng.rand(4 * n, 2)  # type: np.ndarrayi, j = (X * data.shape).astype(int).Tmask = (data[i, j] < 1)X = X[mask]X[:, 0] *= (data.shape[0] / data.shape[1])X = X[:n]os.remove('./HELLO.png')return X[np.argsort(X[:, 0])]def rotate_array(a: np.ndarray, angle: int or float) -> np.ndarray:theta = np.deg2rad(angle)R = [[np.cos(theta), np.sin(theta)],[-np.sin(theta), np.cos(theta)],]return np.dot(a, R)hello = make_hello()
hello_r = rotate_array(hello, 30)
plt.figure(figsize=(10, 10))
plt.scatter(x=hello_r[:, 0], y=hello_r[:, 1], c=hello[:, 0], cmap=plt.cm.get_cmap('rainbow', lut=5))
plt.axis('equal')
plt.title('Hello Manifold Learning')

什么是多维标度法

在我们构造的HELLO数据中，很显然，每个点的xy坐标并不能直接用来表示数据之间的关系，这个例子中，真正基础的特征是每一个点和其他点的距离：

我们用距离矩阵来表示数据之间的两两距离，sklearn中使用pairwise_distance函数来计算这个矩阵；

虽然从xy坐标计算距离矩阵十分简单，但如果要从距离矩阵计算xy坐标就比较麻烦了，这也正是多维标度法解决的问题（从一个距离矩阵计算xy坐标）；
多维标度法在sklearn中由MDS类实现，示例如下：

dis = pairwise_distances(hello)
plt.figure(figsize=(8, 8))
plt.imshow(dis, cmap='Blues')
plt.colorbar()
plt.title('HEELO数据的距离矩阵')model = MDS(n_components=2, random_state=42, dissimilarity='precomputed')
res = model.fit_transform(dis)
plt.figure(figsize=(8, 8))
plt.scatter(res[:, 0], res[:, 1], c=hello[:, 0], cmap=plt.cm.get_cmap('rainbow', lut=5))
plt.axis('equal')
plt.title('使用多维标度法将距离矩阵转化为二维坐标')

多维标度法用于流形学习

既然数据的距离矩阵可以从任意维度进行计算，那么接下来我们尝试将之前的HELLO数据投影到三维空间，再使用MDS将其从三维空间降维至二维平面；

fig = plt.figure(figsize=(10, 10))  # type: plt.Figure
ax = plt.axes(projection='3d')  #type: mplot3d.Axes3D
hello_3d = np.hstack((hello_r,(np.linspace(0, 0.5, hello_r.size // 2) + np.random.rand(hello_r.size // 2) / 10)[:, np.newaxis],
))
ax.scatter3D(hello_3d[:, 0],hello_3d[:, 1],hello_3d[:, 2],c=hello[:, 0],cmap=plt.cm.get_cmap('rainbow', lut=5)
)
ax.set_xlabel('X'); ax.set_ylabel('Y'); ax.set_zlabel('Z')
ax.view_init(50, 120)
ax.set_title('HELLO数据投影到三维空间')dis_3d = pairwise_distances(hello_3d)
plt.figure(figsize=(8, 8))
plt.imshow(dis_3d, cmap='Blues')
plt.colorbar()
plt.title('三维HELLO数据的距离矩阵')model = MDS(n_components=2, random_state=1, dissimilarity='precomputed')
res = model.fit_transform(dis_3d)plt.figure(figsize=(8, 8))
plt.scatter(res[:, 0], res[:, 1], c=hello[:, 0], cmap=plt.cm.get_cmap('rainbow', lut=5))
plt.axis('equal')
plt.title('三维HELLO数据的距离矩阵经过MDS变换至二维平面')

非线性的流形学习

上面的例子，数据只经过了线性变换（旋转、平移、缩放），如果是经过非线性变换的数据，MDS就无能为力了；

MDS之所以无法处理非线性的变换，是因为MDS考虑了所有的点之间的距离，而距离较远的点在进行非线性变换时，距离的改变幅度与相近的点相比会很大；

局部线性嵌入法（Locally Linear Embedding）可以解决非线性的流形问题，LLE只考虑比较近的点之间的距离，能够较好的解决非线性的流形问题；
局部线性嵌入法在sklearn中由LocallyLinearEmbedding实现；

def nonlinear_transform(a: np.ndarray)->np.ndarray:t = (a[:, 0] - 2) * 0.75 * np.pix = np.sin(t)y = a[:, 1]z = np.sign(t) * (np.cos(t) - 1)return np.vstack((x, y, z)).Thello_nonlinear = nonlinear_transform(hello)
plt.figure(figsize=(8, 8))
ax = plt.axes(projection='3d')  # type: mplot3d.Axes3D
ax.scatter3D(xs=hello_nonlinear[:, 0],ys=hello_nonlinear[:, 1],zs=hello_nonlinear[:, 2],c=hello[:, 0],cmap=plt.cm.get_cmap('rainbow', lut=5)
)
ax.set_xlabel('X')
ax.set_ylabel('Y')
ax.set_zlabel('Z')
ax.set_title('经过非线性变换的HELLO数据')model = LocallyLinearEmbedding(n_neighbors=100,n_components=2,method='modified',eigen_solver='dense'
)
model.fit(hello_nonlinear)
res = model.transform(hello_nonlinear)plt.figure(figsize=(8, 8))
plt.scatter(x=res[:, 0],y=res[:, 1],c=hello[:, 0],cmap=plt.cm.get_cmap('rainbow', lut=5)
)
plt.axis('equal')
plt.title('经过局部线性嵌入法还原的HELLO数据')

关于流形方法的一些思考

流形学习在实际应用中的要求很严格，因此除了对高维数据进行简单的定性可视化之外，流形学习很少被正式使用；
流形学习中，并没有很好的方法来处理缺失值，相比之下，PCA有很成熟的方法处缺失值；
流形学习中，噪音对结果的影响很大，相比之下，PCA可以很自然地处理噪音；
局部线性嵌入法的结果高度依赖选择的节点个数，且没有很好的方法确定最佳节点个数；
流形学习中，全局最佳输出维度很难确定；
流形学习通常的复杂度是 $Θ(N2)\Theta(N^2)$ 或者 $Θ(N3)\Theta(N^3)$ ，而PCA通过引入随机方法通常速度更快；

案例：使用保距映射法（Isomap）处理人脸数据

根据上一节对PCA降维人脸数据的实验结果，PCA需要保留100个维度才能保持90%以上的方差，而根据这里Isomap的实验结果，
Isomap只用了两个维度就能比较好的描述图像特征的变化（从可视化结果中可以看出），可见Isomap是一个很强大的降维算法（作者说的，不关我事）；

（书上这里就给了我一个样例，说好的深入介绍呢？？？）

def plot_components(data, model, images=None, ax=None,thumb_frac=0.05, cmap='gray'):ax = ax or plt.gca()proj = model.fit_transform(data)ax.plot(proj[:, 0], proj[:, 1], '.k')if images is not None:min_dist_2 = (thumb_frac * max(proj.max(0) - proj.min(0))) ** 2shown_images = np.array([2 * proj.max(0)])for i in range(data.shape[0]):dist = np.sum((proj[i] - shown_images) ** 2, 1)if np.min(dist) < min_dist_2:continueshown_images = np.vstack([shown_images, proj[i]])imagebox = offsetbox.AnnotationBbox(offsetbox.OffsetImage(images[i], cmap=cmap),proj[i])ax.add_artist(imagebox)faces = fetch_lfw_people(min_faces_per_person=30)
plt.figure(figsize=(15, 15))
ax = plt.axes()
plot_components(data=faces.data,model=Isomap(n_components=2),images=faces.images[:, ::2, ::2],ax=ax
)

完整代码（Jupyter Notebook）

#%% md# 流形学习（Manifold Learning）流行学习是一个无监督的评估器，它通过将一个低维度的流形嵌入到高维度的空间来描述数据集；<br>
本章将会介绍的流形学习算法包括**多维标度法**（multidimensional scaling），**局部线性嵌入法**（locally linear embedding），**保距映射法**（isometric mapping）；<br>
首先我们构造一个能够画出HELLO形状的散点图的函数：#%%%matplotlib inline
import os
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib import offsetbox
from mpl_toolkits import mplot3dfrom matplotlib.image import imread
from sklearn.datasets import fetch_lfw_people
from sklearn.metrics import pairwise_distances
from sklearn.manifold import MDS, LocallyLinearEmbedding, Isomapsns.set()
plt.rc('font', family='SimHei')
plt.rc('axes', unicode_minus=False)#%%def make_hello(n: int=1000, seed: int=42)->np.ndarray:fig, ax = plt.subplots(figsize=(4, 1))  # type: plt.Figure, plt.Axesfig.subplots_adjust(left=0, right=1, bottom=0, top=1)ax.axis('off')ax.text(0.5, 0.4, 'HELLO', va='center', ha='center', weight='bold', size=85)fig.savefig('./HELLO.png')plt.close(fig)data = imread('./HELLO.png')[::-1, :, 0].Trng = np.random.RandomState(seed)X = rng.rand(4 * n, 2)  # type: np.ndarrayi, j = (X * data.shape).astype(int).Tmask = (data[i, j] < 1)X = X[mask]X[:, 0] *= (data.shape[0] / data.shape[1])X = X[:n]os.remove('./HELLO.png')return X[np.argsort(X[:, 0])]def rotate_array(a: np.ndarray, angle: int or float) -> np.ndarray:theta = np.deg2rad(angle)R = [[np.cos(theta), np.sin(theta)],[-np.sin(theta), np.cos(theta)],]return np.dot(a, R)hello = make_hello()
hello_r = rotate_array(hello, 30)
plt.figure(figsize=(10, 10))
plt.scatter(x=hello_r[:, 0], y=hello_r[:, 1], c=hello[:, 0], cmap=plt.cm.get_cmap('rainbow', lut=5))
plt.axis('equal')
plt.title('Hello Manifold Learning')#%% md## 什么是多维标度法在我们构造的HELLO数据中，很显然，每个点的xy坐标并不能直接用来表示数据之间的关系，这个例子中，真正基础的特征是每一个点和其他点的距离：<br>
我们用距离矩阵来表示数据之间的两两距离，sklearn中使用pairwise_distance函数来计算这个矩阵；虽然从xy坐标计算距离矩阵十分简单，但如果要从距离矩阵计算xy坐标就比较麻烦了，这也正是多维标度法解决的问题；#%%dis = pairwise_distances(hello)
plt.figure(figsize=(8, 8))
plt.imshow(dis, cmap='Blues')
plt.colorbar()
plt.title('HEELO数据的距离矩阵')model = MDS(n_components=2, random_state=42, dissimilarity='precomputed')
res = model.fit_transform(dis)
plt.figure(figsize=(8, 8))
plt.scatter(res[:, 0], res[:, 1], c=hello[:, 0], cmap=plt.cm.get_cmap('rainbow', lut=5))
plt.axis('equal')
plt.title('使用多维标度法将距离矩阵转化为二维坐标')#%% md## 多维标度法用于流形学习既然数据的距离矩阵可以从任意维度进行计算，那么接下来我们尝试将之前的HELLO数据投影到三维空间，再使用MDS将其从三维空间降维至二维平面；#%%fig = plt.figure(figsize=(10, 10))  # type: plt.Figure
ax = plt.axes(projection='3d')  #type: mplot3d.Axes3D
hello_3d = np.hstack((hello_r,(np.linspace(0, 0.5, hello_r.size // 2) + np.random.rand(hello_r.size // 2) / 10)[:, np.newaxis],
))
ax.scatter3D(hello_3d[:, 0],hello_3d[:, 1],hello_3d[:, 2],c=hello[:, 0],cmap=plt.cm.get_cmap('rainbow', lut=5)
)
ax.set_xlabel('X'); ax.set_ylabel('Y'); ax.set_zlabel('Z')
ax.view_init(50, 120)
ax.set_title('HELLO数据投影到三维空间')dis_3d = pairwise_distances(hello_3d)
plt.figure(figsize=(8, 8))
plt.imshow(dis_3d, cmap='Blues')
plt.colorbar()
plt.title('三维HELLO数据的距离矩阵')model = MDS(n_components=2, random_state=1, dissimilarity='precomputed')
res = model.fit_transform(dis_3d)plt.figure(figsize=(8, 8))
plt.scatter(res[:, 0], res[:, 1], c=hello[:, 0], cmap=plt.cm.get_cmap('rainbow', lut=5))
plt.axis('equal')
plt.title('三维HELLO数据的距离矩阵经过MDS变换至二维平面')#%% md## 非线性的流形学习上面的例子，数据只经过了线性变换（旋转、平移、缩放），如果是经过非线性变换的数据，MDS就无能为力了；<br>
MDS之所以无法处理非线性的变换，是因为MDS考虑了所有的点之间的距离，而距离较远的点在进行非线性变换时，距离的改变幅度与相近的点相比会很大；<br>
局部线性嵌入法（Locally Linear Embedding）可以解决非线性的流形问题，LLE只考虑比较近的点之间的距离，能够较好的解决非线性的流形问题；#%%def nonlinear_transform(a: np.ndarray)->np.ndarray:t = (a[:, 0] - 2) * 0.75 * np.pix = np.sin(t)y = a[:, 1]z = np.sign(t) * (np.cos(t) - 1)return np.vstack((x, y, z)).Thello_nonlinear = nonlinear_transform(hello)
plt.figure(figsize=(8, 8))
ax = plt.axes(projection='3d')  # type: mplot3d.Axes3D
ax.scatter3D(xs=hello_nonlinear[:, 0],ys=hello_nonlinear[:, 1],zs=hello_nonlinear[:, 2],c=hello[:, 0],cmap=plt.cm.get_cmap('rainbow', lut=5)
)
ax.set_xlabel('X')
ax.set_ylabel('Y')
ax.set_zlabel('Z')
ax.set_title('经过非线性变换的HELLO数据')model = LocallyLinearEmbedding(n_neighbors=100,n_components=2,method='modified',eigen_solver='dense'
)
model.fit(hello_nonlinear)
res = model.transform(hello_nonlinear)plt.figure(figsize=(8, 8))
plt.scatter(x=res[:, 0],y=res[:, 1],c=hello[:, 0],cmap=plt.cm.get_cmap('rainbow', lut=5)
)
plt.axis('equal')
plt.title('经过局部线性嵌入法还原的HELLO数据')#%% md## 关于流形方法的一些思考-   流形学习在实际应用中的要求很严格，因此除了对高维数据进行简单的定性可视化之外，流形学习很少被正式使用；
-   流形学习中，并没有很好的方法来处理缺失值，相比之下，PCA有很成熟的方法处缺失值；
-   流形学习中，噪音对结果的影响很大，相比之下，PCA可以很自然地处理噪音；
-   局部线性嵌入法的结果高度依赖选择的节点个数，且没有很好的方法确定最佳节点个数；
-   流形学习中，全局最佳输出维度很难确定；
-   流形学习通常的复杂度是$\Theta(N^2)$或者$\Theta(N^3)$，而PCA通过引入随机方法通常速度更快；# 案例：使用保距映射法（Isomap）处理人脸数据根据上一节对PCA降维人脸数据的实验结果，PCA需要保留100个维度才能保持90%以上的方差，而根据这里Isomap的实验结果，
Isomap只用了两个维度就能比较好的描述图像特征的变化，可见ISO马屁是一个很强大的降维算法（作者说的，不关我事）；<br>
（书上这里就给了我一个样例，说好的深入介绍呢？？？）#%%def plot_components(data, model, images=None, ax=None,thumb_frac=0.05, cmap='gray'):ax = ax or plt.gca()proj = model.fit_transform(data)ax.plot(proj[:, 0], proj[:, 1], '.k')if images is not None:min_dist_2 = (thumb_frac * max(proj.max(0) - proj.min(0))) ** 2shown_images = np.array([2 * proj.max(0)])for i in range(data.shape[0]):dist = np.sum((proj[i] - shown_images) ** 2, 1)if np.min(dist) < min_dist_2:continueshown_images = np.vstack([shown_images, proj[i]])imagebox = offsetbox.AnnotationBbox(offsetbox.OffsetImage(images[i], cmap=cmap),proj[i])ax.add_artist(imagebox)faces = fetch_lfw_people(min_faces_per_person=30)
plt.figure(figsize=(15, 15))
ax = plt.axes()
plot_components(data=faces.data,model=Isomap(n_components=2),images=faces.images[:, ::2, ::2],ax=ax
)

【Python】机器学习笔记08-流形学习（Manifold Learning）相关推荐

机器学习：流形学习Manifold Learning之LLE（局部线性嵌入）
流形学习被认为属于非线性降维的一个分支. 线性降维的图例如下: 原图: 线性降维后的图: 线性的算法基本就是这个样子,可以看到线性的算法能把最重要的维度们找出来,蛋卷的形状被保全了下来, 但是对很多应 ...
转发：很好理解流形学习的文章-浅谈流形学习(Manifold Learning)
转很好理解流形学习的文章-浅谈流形学习(Manifold Learning) 来源 Machine Learning 虽然名字里带了 Learning 一个词,让人乍一看觉得和 Intelligen ...
四、降维——流形学习 (manifold learning)
zz from prfans ............................... dodo:流形学习 (manifold learning) dodo 流形学习是个很广泛的概念.这里我主 ...
流形学习(Manifold Learning)以及推导
流形学习(Manifold Learning) 前言流行学习简介主要的代表方法 1) Isomap (等距映射) Isomap算法步骤: 2) LLE(Locally Linear Embeddi ...
流形学习 (Manifold Learning)
流形学习 (manifold learning) zz from prfans ............................... dodo:流形学习 (manifold learnin ...
流形学习(Manifold Learning)
流形学习(Manifold Learning) 前言流行学习简介主要的代表方法 1) Isomap (等距映射) Isomap算法步骤: 2) LLE(Locally Linear Embeddi ...
浅谈流形学习(Manifold Learning)
Machine Learning 虽然名字里带了 Learning 一个词,让人乍一看觉得和 Intelligence 相比不过是换了个说法而已,然而事实上这里的 Learning 的意义要朴素得多. ...
很好理解流形学习的文章-浅谈流形学习(Manifold Learning)
来源 Machine Learning 虽然名字里带了 Learning 一个词,让人乍一看觉得和 Intelligence 相比不过是换了个说法而已,然而事实上这里的 Learning 的意义要朴素 ...
流形学习-Manifold Learning
来源: 转载本文请联系原作者获取授权,同时请注明本文来自张重科学网博客. 链接地址:http://blog.sciencenet.cn/blog-722391-583413.html 流形(manif ...
FlyAI小课堂：python机器学习笔记：深入学习决策树算法原理
分类技术(或分类法)是一种根据输入数据建立分类模型的系统方法,分类法的例子包括决策分类法,基于规则的分类法,神经网络,支持向量机和朴素贝叶斯分类法.这些技术都使用一种学习算法(learning alg ...

【Python】机器学习笔记08-流形学习（Manifold Learning）