【支持向量机SVM系列教程4】SVM应用实战

文章目录

4 实例：使用SVM完成人脸分类任务
- 4.1 数据集的下载
- 4.2 数据集的进一步提取与查看
- 4.3 PCA降维
- 4.4 模型的定义及拟合
- 4.5 划分训练集和测试集
- 4.6 调参
- 4.7 可视化预测结果
- 4.8 特征脸的绘制
- 4.9 获取模型的评估参数
- 4.10 绘制混淆矩阵

4 实例：使用SVM完成人脸分类任务

SVM的功能十分强大，除了能完成线性分类任务，还能完成非线性分类任务；除了能完成二分类任务，还能完成多分类任务。下面将使用SVM完成一个较为复杂的多分类任务：LFW数据集人脸分类任务。该实战中一步步演示了使用SVM做人脸分类任务的流程，并且涉及到了sklearn中一些模型评估的方法。

4.1 数据集的下载

这里用的是Wild数据集。由于网络问题，直接在程序中加载数据集的速度很慢，所以要先将数据集下载到本地。

下载链接为：http://vis-www.cs.umass.edu/lfw/lfw-funneled.tgz
下载完之后解压，得到名称为 lfw_funneled 的目录。
使用下列语句统计下该目录中的子目录个数：

ls -lR | grep "^d" | wc -l

可得到子目录的个数为11507，每一个目录里面是同一个人的人脸图片。
总的人脸图片数为26485，说明平均每个人所占的人脸图片数很少，但其中还是有一些人物的人脸图片很多，总体数据集的分布是不均匀的。

下面将分步实现SVM人脸分类任务。

4.2 数据集的进一步提取与查看

这里选择使用Wild数据集中带标记的人脸图像作为要处理的数据集。
sklearn中内置了接口函数fetch_lfw_people ，用于获取Wild数据集并进行进一步的提取。
代码如下：

from sklearn.datasets import fetch_lfw_people
# 读取本地数据集，输入数据集目录的绝对路径（也可以直接输入数据集的下载链接，但速度很慢）
faces = fetch_lfw_people(data_home='C:/Users/Administrator/MLDP/sklearn-examples/SVM/lfw_funneled',# 只选择人脸图片数超过60张的人物对应的子目录min_faces_per_person=60)# 打印出满足条件的人物名称以及数据集的尺寸（图片数×图片高度×图片宽度）
print(faces.target_names)
print(faces.images.shape)

输出结果：

[‘Ariel Sharon’ ‘Colin Powell’ ‘Donald Rumsfeld’ ‘George W Bush’
‘Gerhard Schroeder’ ‘Hugo Chavez’ ‘Junichiro Koizumi’ ‘Tony Blair’]
(1348, 62, 47)

打印并查看该数据集的数据信息：

n_samples, h, w = faces.images.shape# 读取数据集数据
X = faces.data
# 特征数量（每张图片的像素）
n_features = X.shape[1]# 读取数据集标签
y = faces.target
# 标签名称
target_names = faces.target_names
n_classes = target_names.shape[0]print("Different parameters of the dataset are as Followed:")
print("样本数: %d" % n_samples)
print("特征数（每张图片的像素）: %d" % n_features)
print("类别数: %d" % n_classes)

输出结果：

Different parameters of the dataset are as Followed:
样本数: 1348
特征数（每张图片的像素）: 2914
类别数: 8

取出数据集中的前12张人脸图片，看看长什么样：

import matplotlib.pyplot as plt
plt.figure(figsize=(6, 4))
fig, ax = plt.subplots(3, 4)
for i, axi in enumerate(ax.flat):axi.imshow(faces.images[i], cmap='bone')axi.set(xticks=[], yticks=[], xlabel=faces.target_names[faces.target[i]])

输出结果：

4.3 PCA降维

可以看到，每张图片的尺寸接近3000个像素，人脸图像均很清晰，非常容易分辨。但是，如果全部像素均拿来作特征，会使得计算量过大，速度太慢。所以，这里选择使用 主成分分析(PCA) 来提取出每张图片中的若干个像素，再送给SVM分类器。这个“若干”具体是多少呢？下面将进行分析。

from sklearn.decomposition import PCA
import numpy as nppca = PCA()
pca.fit(X)# 0.95 是解释方差比，表示提取后的图片特征的方差占的总方差达到 95% 时的像素数量
# cumsum >= 0.95是一系列True和False的数组，np.argmax可以找出第一个最大（即True）的元素的下标
cumsum = np.cumsum(pca.explained_variance_ratio_)
d = np.argmax(cumsum >= 0.95) + 1
print("Reduce the dimension of each picture to %d takes up 0.95 of the whole variance." % d)

输出结果：

Reduce the dimension of each picture to 160 takes up 0.95 of the whole variance.

可以看到，将图片的尺寸从接近3000像素降维到160像素之后，就可以达到95%的解释方差比，所以初步考虑将维数降低到160。但是，这个95%是一拍脑袋瞎定的，为了能更直观、更准确地找到合适的降维数，还需要进行可视化操作，用于辅助寻找更加合适的降维数。
下面便绘制出了解释方差比随降维后的维度数变化的曲线图：

plt.figure(figsize=(6, 4))
plt.plot(cumsum, 'r',linewidth=3)
plt.axis([0, 400, 0, 1])
plt.xlabel("Dimensions")
plt.ylabel("Explained Variance Ratio")
# 画出解释方差比为 0.95 时所对应的点
plt.plot([d, d], [0, 0.95], "k:")
plt.plot([0, d], [0.95, 0.95], "k:")
plt.plot(d, 0.95, "ko")
# 画出肘部（曲线走势开始变缓的转折点）
plt.annotate("Elbow", xy=(50, 0.85), xytext=(70, 0.7), arrowprops=dict(arrowstyle="->"), fontsize=14)
plt.grid(True)
plt.show()

输出结果如下：

从图中可以看到，160之后，解释方差比随维度变化的趋势已经变得很慢，维度提升的速度远远快于方差比上升的速度。所以折衷考虑，将维度降低到160是明智的选择。但是选择降到200维，方差比也有些许提升。这里为了更加准确，选择降到200维。
确定了降维后的维数之后，接下来就可以定义模型了。

4.4 模型的定义及拟合

下面将降维和SVM分类器通过管道封装到一起。代码如下：

from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline
# 使用PCA方法将每张人脸图片提取成200维的特征
pca = PCA(n_components=200, whiten=True, random_state=42)
# 定义高斯核SVM
svc = SVC(kernel='rbf', class_weight='balanced')
model = make_pipeline(pca, svc)

这里有一个细节：在定义svc的时候，定义了参数 class_weight =‘balanced’。官方文档中对该参数给出了下列解释：

class_weight：dict or‘balanced’, default=None Set the parameter C of class i to class_weight[i]*C for SVC.

If not given, all classes are supposed to have weight one.
The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y)) `

简单翻译一下就是说：该参数是配合参数 C 一起使用的。

class_weight不指定时，默认为None，对于不同类，class_weight值均为1，C值设置为class_weight×C，即C本身；
class_weight指定为’balanced’时，对于不同类，class_weight值为：
class_weight=n_samplesn_classes×np.bincount(y)class\_weight=\frac{n\_samples} {n\_classes × np.bincount(y)} class_weight=n_classes×np.bincount(y)n_samples
即：
各类的class_weight值=数据集总样本数/类别总数×np.bincount(数据集各样本的标签)各类的class\_weight值={数据集总样本数} / {类别总数×np.bincount(数据集各样本的标签)} 各类的class_weight值=数据集总样本数/类别总数×np.bincount(数据集各样本的标签)
对于第 iii 类，CCC 值设置为：当前类的 class_weightclass\_weightclass_weight 值 class_weight[i]×Cclass\_weight[i] × Cclass_weight[i]×C。

看一下下面代码的直观演示：

print("n_samples=", y.shape[0])
print("n_classes=", 8)
class_weights = 1348 / (8 * np.bincount(y))
print("class_weight=", class_weights)
for i in range(8):print("class_weight of the %dth class is %.8f" % (i+1, class_weights[i]))

输出结果如下：

n_samples= 1348
n_classes= 8
class_weight= [2.18831169 0.71398305 1.39256198 0.31792453 1.54587156 2.37323944
2.80833333 1.17013889]
class_weight of the 1th class is 2.18831169
class_weight of the 2th class is 0.71398305
class_weight of the 3th class is 1.39256198
class_weight of the 4th class is 0.31792453
class_weight of the 5th class is 1.54587156
class_weight of the 6th class is 2.37323944
class_weight of the 7th class is 2.80833333
class_weight of the 8th class is 1.17013889

可以看到，不同类别的class_weight值不同，取决于当前类的样本数。样本数越多，对应的class_weight值越大，初始化的C值也越大。这样的初始化方式在样本数据分布不均匀的时候尤为有效，会比C比全部初始化为1时取得更好的效果。所以，当各类别的样本数量的分布不均匀的时候，设置**class_weight =‘balanced’**是比较合理的方法。

4.5 划分训练集和测试集

为了可以对训练后的模型进行评估，下面将数据集划分为训练集和测试集：

from sklearn.model_selection import train_test_split# 为指定划分的比例时，默认的测试集比例为0.25
Xtrain, Xtest, ytrain, ytest = train_test_split(faces.data, faces.target,
# 定义随机数种子，确保每次分割方式均相同，使得代码可复现random_state=42)

4.6 调参

分割完之后，接下来在训练数据集[Xtrain, ytrain]上进行网格搜索，寻找最佳的参数组合：

from sklearn.model_selection import GridSearchCV# 使用网格搜索寻找最佳参数组合
param_grid = {'svc__C': [1, 5, 10, 50],'svc__gamma': [0.0001, 0.0005, 0.001, 0.005]}
grid = GridSearchCV(model, param_grid, n_jobs=-1)%time grid.fit(Xtrain, ytrain)
print(grid.best_params_)

输出结果如下：

Wall time: 28.6 s

{svc__C: 5, svc__gamma: 0.001}

得到的最佳参数组合即可用来定义最佳模型，然后用最佳模型来预测测试集：

model_best = grid.best_estimator_
# 该模型在测试集上的预测结果
y_pred= model_best.predict(Xtest)

得到了模型在测试集上的预测结果 y_pred 后，结合测试集的实际标签 ytest，我们就可以获取模型测试集上的表现。

4.7 可视化预测结果

下面将打印出模型在前30张人脸上的预测结果。代码如下：

# 定义PCA模型用于降维（注意该模型应该在Xtrain上进行拟合）
pca = PCA(n_components=200, whiten=True, random_state=42).fit(Xtrain)
# 定义特征脸（PCA模型提取出的解释方差比最大。人脸特征最显著的前200个特征的可视化图片）
eigenfaces = pca.components_.reshape((200, h, w))# 画出人脸图片和对应预测结果和实际标签
def plot_gallery(images, titles, h, w, n_row=5, n_col=6):plt.figure(figsize=(1.8 * n_col, 2.4 * n_row))plt.subplots_adjust(bottom=0, left=0.1, right=0.99, top=0.99, hspace=0.4)for i in range(n_row * n_col):plt.subplot(n_row, n_col, i + 1)plt.imshow(images[i].reshape((h, w)), cmap=plt.cm.gray)plt.title(titles[i], size=12)plt.xticks(())plt.yticks(())# 定义人脸图片的对应预测结果和实际标签
def title(y_pred, ytest, target_names, i):pred_name = target_names[y_pred[i]].rsplit(' ', 1)[-1]true_name = target_names[ytest[i]].rsplit(' ', 1)[-1]return 'Predicted: %s\n Label:      %s' % (pred_name, true_name)prediction_titles = [title(y_pred, ytest, target_names, i) for i in range(y_pred.shape[0])]print("Here are the faces with predicted labels and actual labels: ")
plot_gallery(Xtest, prediction_titles, h, w)plt.show()

输出结果如下：

Here are the faces with predicted labels and actual labels:

在这30张人脸中，只有第三行第三列的那一张预测错误，其他均预测正确。该SVM分类器的分类效果似乎还不错。

4.8 特征脸的绘制

前面定义了一个将人脸图片降维到200的PCA模型。该PCA模型所提取出的这200个特征，直白来说，就是使得人脸图片最有区分度的特征。我们可以尝试将这些特征可视化，看看PCA模型究竟提取出了人脸中的哪些特征。代码如下：

eigenface_titles = ["eigenface %d" % i for i in range(eigenfaces.shape[0])]
print("Here are the eigenfaces sorted of explained variance ratio: ")
plot_gallery(eigenfaces, eigenface_titles, h, w)
plt.show()Here are the eigenfaces sorted of explained variance ratio:

输出结果如下：

Here are the eigenfaces sorted of explained variance ratio:

可以看到，PCA模型所提取出的特征大多聚焦到五官和人脸轮廓部分，使得不同的人脸易于区分。这为加强SVM分类器的分类效果起了促进作用。

4.9 获取模型的评估参数

当然，在感性认识方面判断模型好坏是远远不够的，我们依然需要获取客观数据，从理性认识方面判断模型的好坏。下列代码列举出了模型在不同类的人脸图片上的参数指标：

from sklearn.metrics import classification_report
print(classification_report(ytest, y_pred, target_names=faces.target_names))

输出结果如下：

precision recall f1-score support

Ariel Sharon 0.68 0.87 0.76 15
Colin Powell 0.80 0.93 0.86 68

Donald Rumsfeld 0.77 0.74 0.75 31
George W Bush 0.93 0.83 0.88 126
Gerhard Schroeder 0.72 0.78 0.75 23
Hugo Chavez 0.93 0.70 0.80 20
Junichiro Koizumi 0.85 0.92 0.88 12
Tony Blair 0.88 0.90 0.89 42

accuracy 0.85 337
macro avg 0.82 0.83 0.82 337

weighted avg 0.85 0.85 0.85 337

模型在测试集上取得了0.85的 f1分数加权平均准确率。

4.10 绘制混淆矩阵

下面画出混淆矩阵，使得我们可以从直观上感知到模型在不同类别上的分类效果：

from sklearn.metrics import confusion_matrix
import seaborn as sns
mat = confusion_matrix(ytest, y_pred)
plt.figure(figsize=(6, 6))
sns.heatmap(mat.T, square=True, annot=True, fmt='d', xticklabels=faces.target_names,yticklabels=faces.target_names)
plt.xlabel('Actual Label')
plt.ylabel('Predicted Label');

输出结果如下：

其中，横轴表示实际标签，纵轴表示预测结果，对角线上方块表示每一类人脸中分对的数量，同一列中的其他方块表示对应类误分到其他各类中的数量。从图中就可以直观看出不同类别样本数量的多少，以及各自的分类情况。