案例1. 通过返回每个样本的轮廓系数,设置不同的聚类数,以可视化的形式,与总体的均值轮廓系数做比较,找到最佳聚类数
案例2. 27万个像素点,9万种特征组合的颜色,通过kmeans选择64种特征组合的颜色作为质心(获得64种特征组合的序列号),进行聚类,将27万个像素点找到对应的序列号(0-63),而后给备份的数据覆盖聚类质心对应的特征组合的颜色,进而实现数据降维

from sklearn.datasets import make_blobs #创建包含几个簇的数据集 import matplotlib.pyplot as plt
#自己创建数据集
X, y = make_blobs(n_samples=500,n_features=2,centers=4,random_state=1)
# random_state 稳定数据集
X.shape
(500, 2)
y.shape
(500,)
fig, ax1 = plt.subplots(1)#生成子图
ax1.scatter(X[:, 0], X[:, 1]#.scatter画散点图,()里是横纵坐标,marker='o' #点的形状,s=8 #点的大小)
plt.show()

#如果我们想要看见这个点的分布,怎么办?
color = ["red","pink","orange","gray"]
fig, ax1 = plt.subplots(1)for i in range(4):ax1.scatter(X[y==i, 0], X[y==i, 1],marker='o' #点的形状,s=8 #点的大小,c=color[i])
plt.show()

from sklearn.cluster import KMeansn_clusters = 3
cluster = KMeans(n_clusters=n_clusters,random_state=0).fit(X)
#n_clusters 簇的个数
#属性Labels_,查看样本聚类类别
y_pred = cluster.labels_
y_pred
array([2, 2, 0, 1, 0, 1, 0, 0, 0, 0, 2, 2, 0, 1, 0, 2, 0, 2, 1, 0, 0, 0,0, 1, 0, 0, 1, 1, 0, 0, 2, 1, 0, 2, 0, 2, 0, 0, 2, 0, 0, 0, 1, 0,0, 2, 0, 0, 1, 1, 1, 0, 0, 0, 2, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0,2, 0, 0, 0, 2, 0, 0, 2, 0, 0, 2, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1,0, 0, 1, 2, 0, 0, 1, 2, 2, 0, 2, 1, 1, 2, 1, 0, 1, 0, 0, 1, 1, 0,0, 2, 1, 0, 1, 0, 1, 0, 1, 0, 0, 2, 2, 0, 0, 0, 1, 2, 2, 0, 1, 0,0, 0, 0, 2, 1, 0, 1, 1, 0, 2, 0, 1, 1, 1, 0, 0, 2, 2, 0, 0, 1, 2,1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 2, 2, 0, 1, 2, 0, 0, 2, 1, 0,0, 0, 0, 2, 0, 0, 1, 2, 2, 0, 0, 2, 2, 0, 1, 1, 2, 2, 0, 1, 0, 2,2, 1, 2, 0, 1, 0, 0, 2, 0, 0, 2, 0, 0, 0, 0, 2, 0, 0, 0, 1, 0, 1,0, 2, 0, 0, 0, 0, 0, 1, 0, 1, 2, 0, 2, 0, 1, 1, 0, 2, 1, 2, 0, 0,2, 2, 2, 2, 0, 0, 2, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0,1, 0, 2, 2, 0, 0, 0, 0, 1, 1, 0, 1, 0, 2, 1, 2, 1, 2, 2, 1, 2, 1,1, 0, 0, 0, 0, 0, 0, 0, 2, 1, 2, 2, 2, 0, 0, 0, 2, 0, 2, 2, 0, 2,2, 0, 1, 2, 0, 0, 1, 1, 0, 2, 1, 1, 0, 2, 1, 1, 0, 0, 1, 0, 0, 2,2, 1, 0, 2, 0, 1, 1, 0, 0, 0, 2, 0, 1, 1, 0, 1, 1, 1, 1, 2, 2, 0,1, 0, 0, 2, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 2, 1, 2, 2, 2, 2, 2,2, 0, 2, 1, 2, 1, 1, 0, 1, 0, 0, 0, 2, 1, 0, 1, 0, 2, 0, 0, 2, 0,0, 1, 1, 2, 0, 0, 1, 0, 0, 2, 2, 0, 2, 0, 0, 2, 0, 2, 0, 1, 2, 1,0, 0, 1, 0, 0, 1, 2, 0, 1, 1, 0, 0, 0, 0, 2, 1, 2, 0, 1, 2, 2, 2,0, 1, 0, 2, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 2, 0, 0, 2, 0, 1, 0,1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 2, 0, 1, 0, 2, 1, 2, 1, 2, 0, 1, 1,2, 0, 0, 2, 0, 0, 0, 2, 0, 1, 0, 0, 2, 2, 2, 0])
#KMeans不需要建立模型或者预测结果,只需要fit就能够得到聚类结
#KMeans也有接口predict(X)==fit_predict(X),表示学习数据X并对X的类进行预测
pre = cluster.fit_predict(X)
pre
array([2, 2, 0, 1, 0, 1, 0, 0, 0, 0, 2, 2, 0, 1, 0, 2, 0, 2, 1, 0, 0, 0,0, 1, 0, 0, 1, 1, 0, 0, 2, 1, 0, 2, 0, 2, 0, 0, 2, 0, 0, 0, 1, 0,0, 2, 0, 0, 1, 1, 1, 0, 0, 0, 2, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0,2, 0, 0, 0, 2, 0, 0, 2, 0, 0, 2, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1,0, 0, 1, 2, 0, 0, 1, 2, 2, 0, 2, 1, 1, 2, 1, 0, 1, 0, 0, 1, 1, 0,0, 2, 1, 0, 1, 0, 1, 0, 1, 0, 0, 2, 2, 0, 0, 0, 1, 2, 2, 0, 1, 0,0, 0, 0, 2, 1, 0, 1, 1, 0, 2, 0, 1, 1, 1, 0, 0, 2, 2, 0, 0, 1, 2,1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 2, 2, 0, 1, 2, 0, 0, 2, 1, 0,0, 0, 0, 2, 0, 0, 1, 2, 2, 0, 0, 2, 2, 0, 1, 1, 2, 2, 0, 1, 0, 2,2, 1, 2, 0, 1, 0, 0, 2, 0, 0, 2, 0, 0, 0, 0, 2, 0, 0, 0, 1, 0, 1,0, 2, 0, 0, 0, 0, 0, 1, 0, 1, 2, 0, 2, 0, 1, 1, 0, 2, 1, 2, 0, 0,2, 2, 2, 2, 0, 0, 2, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0,1, 0, 2, 2, 0, 0, 0, 0, 1, 1, 0, 1, 0, 2, 1, 2, 1, 2, 2, 1, 2, 1,1, 0, 0, 0, 0, 0, 0, 0, 2, 1, 2, 2, 2, 0, 0, 0, 2, 0, 2, 2, 0, 2,2, 0, 1, 2, 0, 0, 1, 1, 0, 2, 1, 1, 0, 2, 1, 1, 0, 0, 1, 0, 0, 2,2, 1, 0, 2, 0, 1, 1, 0, 0, 0, 2, 0, 1, 1, 0, 1, 1, 1, 1, 2, 2, 0,1, 0, 0, 2, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 2, 1, 2, 2, 2, 2, 2,2, 0, 2, 1, 2, 1, 1, 0, 1, 0, 0, 0, 2, 1, 0, 1, 0, 2, 0, 0, 2, 0,0, 1, 1, 2, 0, 0, 1, 0, 0, 2, 2, 0, 2, 0, 0, 2, 0, 2, 0, 1, 2, 1,0, 0, 1, 0, 0, 1, 2, 0, 1, 1, 0, 0, 0, 0, 2, 1, 2, 0, 1, 2, 2, 2,0, 1, 0, 2, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 2, 0, 0, 2, 0, 1, 0,1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 2, 0, 1, 0, 2, 1, 2, 1, 2, 0, 1, 1,2, 0, 0, 2, 0, 0, 0, 2, 0, 1, 0, 0, 2, 2, 2, 0])
pre == y_pred#全都是True
array([ True,  True,  True,  True,  True,  True,  True,  True,  True,True,  True,  True,  True,  True,  True,  True,  True,  True,True,  True,  True,  True,  True,  True,  True,  True,  True,True,  True,  True,  True,  True,  True,  True,  True,  True,True,  True,  True,  True,  True,  True,  True,  True,  True,True,  True,  True,  True,  True,  True,  True,  True,  True,True,  True,  True,  True,  True,  True,  True,  True,  True,True,  True,  True,  True,  True,  True,  True,  True,  True,True,  True,  True,  True,  True,  True,  True,  True,  True,True,  True,  True,  True,  True,  True,  True,  True,  True,True,  True,  True,  True,  True,  True,  True,  True,  True,True,  True,  True,  True,  True,  True,  True,  True,  True,True,  True,  True,  True,  True,  True,  True,  True,  True,True,  True,  True,  True,  True,  True,  True,  True,  True,True,  True,  True,  True,  True,  True,  True,  True,  True,True,  True,  True,  True,  True,  True,  True,  True,  True,True,  True,  True,  True,  True,  True,  True,  True,  True,True,  True,  True,  True,  True,  True,  True,  True,  True,True,  True,  True,  True,  True,  True,  True,  True,  True,True,  True,  True,  True,  True,  True,  True,  True,  True,True,  True,  True,  True,  True,  True,  True,  True,  True,True,  True,  True,  True,  True,  True,  True,  True,  True,True,  True,  True,  True,  True,  True,  True,  True,  True,True,  True,  True,  True,  True,  True,  True,  True,  True,True,  True,  True,  True,  True,  True,  True,  True,  True,True,  True,  True,  True,  True,  True,  True,  True,  True,True,  True,  True,  True,  True,  True,  True,  True,  True,True,  True,  True,  True,  True,  True,  True,  True,  True,True,  True,  True,  True,  True,  True,  True,  True,  True,True,  True,  True,  True,  True,  True,  True,  True,  True,True,  True,  True,  True,  True,  True,  True,  True,  True,True,  True,  True,  True,  True,  True,  True,  True,  True,True,  True,  True,  True,  True,  True,  True,  True,  True,True,  True,  True,  True,  True,  True,  True,  True,  True,True,  True,  True,  True,  True,  True,  True,  True,  True,True,  True,  True,  True,  True,  True,  True,  True,  True,True,  True,  True,  True,  True,  True,  True,  True,  True,True,  True,  True,  True,  True,  True,  True,  True,  True,True,  True,  True,  True,  True,  True,  True,  True,  True,True,  True,  True,  True,  True,  True,  True,  True,  True,True,  True,  True,  True,  True,  True,  True,  True,  True,True,  True,  True,  True,  True,  True,  True,  True,  True,True,  True,  True,  True,  True,  True,  True,  True,  True,True,  True,  True,  True,  True,  True,  True,  True,  True,True,  True,  True,  True,  True,  True,  True,  True,  True,True,  True,  True,  True,  True,  True,  True,  True,  True,True,  True,  True,  True,  True,  True,  True,  True,  True,True,  True,  True,  True,  True,  True,  True,  True,  True,True,  True,  True,  True,  True,  True,  True,  True,  True,True,  True,  True,  True,  True,  True,  True,  True,  True,True,  True,  True,  True,  True,  True,  True,  True,  True,True,  True,  True,  True,  True,  True,  True,  True,  True,True,  True,  True,  True,  True,  True,  True,  True,  True,True,  True,  True,  True,  True,  True,  True,  True,  True,True,  True,  True,  True,  True,  True,  True,  True,  True,True,  True,  True,  True,  True])
#我们什么时候需要predict呢?当数据量太大的时候!
#数据量非常大的时候,少量部分的数据就可确定质心
#剩下的数据的聚类结果,使用predict来调用
cluster_smallsub = KMeans(n_clusters=n_clusters, random_state=0).fit(X[:200])
D:\py1.1\lib\site-packages\sklearn\cluster\_kmeans.py:1334: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=1.warnings.warn(
y_pred_ = cluster_smallsub.predict(X)
y_pred_
array([1, 1, 2, 0, 2, 0, 2, 2, 2, 2, 1, 1, 2, 0, 2, 1, 2, 1, 0, 2, 2, 2,2, 0, 2, 2, 0, 0, 2, 2, 1, 0, 2, 1, 2, 1, 2, 2, 1, 2, 2, 2, 0, 2,2, 1, 2, 2, 0, 0, 0, 2, 2, 2, 1, 2, 2, 2, 2, 2, 0, 0, 2, 2, 0, 2,1, 2, 2, 2, 1, 2, 2, 1, 2, 2, 1, 2, 2, 2, 0, 0, 2, 0, 0, 2, 2, 0,2, 2, 0, 1, 2, 2, 0, 1, 1, 2, 1, 0, 0, 1, 0, 2, 0, 2, 2, 0, 0, 2,2, 1, 0, 2, 0, 2, 0, 2, 0, 2, 2, 1, 1, 2, 2, 2, 0, 1, 1, 2, 0, 2,2, 2, 2, 1, 0, 2, 0, 0, 2, 1, 2, 0, 0, 0, 2, 2, 1, 1, 2, 2, 0, 1,0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 1, 1, 1, 2, 0, 1, 2, 2, 1, 0, 2,2, 2, 2, 1, 2, 2, 0, 1, 1, 1, 2, 1, 1, 2, 0, 0, 1, 1, 2, 0, 2, 1,1, 0, 1, 2, 0, 2, 2, 1, 2, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 0, 2, 0,2, 1, 2, 2, 2, 2, 2, 0, 2, 0, 1, 2, 1, 2, 0, 0, 2, 1, 0, 1, 2, 2,1, 1, 1, 1, 2, 2, 1, 2, 2, 0, 0, 2, 2, 0, 2, 2, 2, 0, 2, 0, 2, 2,0, 2, 1, 1, 2, 2, 2, 2, 0, 0, 2, 0, 2, 1, 0, 1, 0, 1, 1, 0, 1, 0,0, 2, 2, 2, 2, 2, 2, 2, 1, 0, 1, 1, 1, 2, 2, 2, 1, 2, 1, 1, 2, 1,1, 2, 0, 1, 2, 2, 0, 0, 2, 1, 0, 0, 2, 1, 0, 0, 2, 2, 0, 2, 2, 1,1, 0, 2, 1, 2, 0, 0, 2, 2, 2, 1, 2, 0, 0, 2, 0, 0, 0, 0, 1, 1, 2,0, 2, 2, 1, 0, 2, 0, 2, 0, 2, 2, 2, 0, 2, 2, 1, 0, 1, 1, 1, 1, 1,1, 2, 1, 0, 1, 0, 0, 2, 0, 2, 2, 2, 1, 0, 2, 0, 2, 1, 2, 2, 1, 2,2, 0, 0, 1, 2, 2, 0, 2, 2, 1, 1, 2, 1, 2, 2, 1, 2, 1, 2, 0, 1, 0,2, 2, 0, 2, 2, 0, 1, 2, 0, 0, 2, 2, 2, 2, 1, 0, 1, 2, 0, 1, 1, 1,2, 0, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 1, 2, 2, 1, 2, 0, 2,0, 2, 2, 2, 0, 0, 0, 2, 2, 2, 1, 2, 0, 2, 1, 0, 1, 0, 1, 2, 0, 0,1, 2, 2, 1, 2, 2, 2, 1, 2, 0, 2, 2, 1, 1, 1, 2])
y_pred == y_pred_#数据量非常大的时候,效果会好
#但从运行得出这样的结果,肯定与直接fit全部数据会不一致。有时候,当我们不要求那么精确,或者我们的数据量实在太大,那我们可以使用这种方法,使用接口predict
#如果数据量还行,不是特别大,直接使用fit之后调用属性.labels_提出来
array([False, False, False, False, False, False, False, False, False,False, False, False, False, False, False, False, False, False,False, False, False, False, False, False, False, False, False,False, False, False, False, False, False, False, False, False,False, False, False, False, False, False, False, False, False,False, False, False, False, False, False, False, False, False,False, False, False, False, False, False, False, False, False,False, False, False, False, False, False, False, False, False,False, False, False, False, False, False, False, False, False,False, False, False, False, False, False, False, False, False,False, False, False, False, False, False, False, False, False,False, False, False, False, False, False, False, False, False,False, False, False, False, False, False, False, False, False,False, False, False, False, False, False, False, False, False,False, False, False, False, False, False, False, False, False,False, False, False, False, False, False, False, False, False,False, False, False, False, False, False, False, False, False,False, False, False, False, False, False, False, False, False,False, False, False, False, False, False, False, False, False,False, False, False, False, False, False, False, False, False,False, False, False, False, False, False, False, False, False,False, False, False, False, False, False, False, False, False,False, False, False, False, False, False, False, False, False,False, False, False, False, False, False, False, False, False,False, False, False, False, False, False, False, False, False,False, False, False, False, False, False, False, False, False,False, False, False, False, False, False, False, False, False,False, False, False, False, False, False, False, False, False,False, False, False, False, False, False, False, False, False,False, False, False, False, False, False, False, False, False,False, False, False, False, False, False, False, False, False,False, False, False, False, False, False, False, False, False,False, False, False, False, False, False, False, False, False,False, False, False, False, False, False, False, False, False,False, False, False, False, False, False, False, False, False,False, False, False, False, False, False, False, False, False,False, False, False, False, False, False, False, False, False,False, False, False, False, False, False, False, False, False,False, False, False, False, False, False, False, False, False,False, False, False, False, False, False, False, False, False,False, False, False, False, False, False, False, False, False,False, False, False, False, False, False, False, False, False,False, False, False, False, False, False, False, False, False,False, False, False, False, False, False, False, False, False,False, False, False, False, False, False, False, False, False,False, False, False, False, False, False, False, False, False,False, False, False, False, False, False, False, False, False,False, False, False, False, False, False, False, False, False,False, False, False, False, False, False, False, False, False,False, False, False, False, False, False, False, False, False,False, False, False, False, False, False, False, False, False,False, False, False, False, False, False, False, False, False,False, False, False, False, False, False, False, False, False,False, False, False, False, False, False, False, False, False,False, False, False, False, False, False, False, False, False,False, False, False, False, False])
#重要属性cLuster_centers_,查看质心
centroid = cluster.cluster_centers_
centroid
array([[-8.0807047 , -3.50729701],[-1.54234022,  4.43517599],[-7.11207261, -8.09458846]])
centroid.shape
(3, 2)
#重要属性inertia_,查看总距离平方和
inertia = cluster.inertia_
inertia
1903.560766461176
color = ["red","pink","orange","gray"]fig, ax1 = plt.subplots(1) #一张画布for i in range(n_clusters):ax1.scatter(X[y_pred==i, 0], X[y_pred==i, 1]# scatter(标签,坐标),marker='o' #点的形状,s=8 #点的大小,c=color[i])ax1.scatter(centroid[:,0],centroid[:,1],marker="x",s=15,c="black")
plt.show()

#如果我们把猜测的羡数换成4,Inertia会怎么样?
n_clusters = 4
cluster_ = KMeans(n_clusters=n_clusters, random_state=0).fit(X)
inertia_ = cluster_.inertia_
inertia_
908.3855684760616
n_clusters = 5
cluster_ = KMeans(n_clusters=n_clusters, random_state=0).fit(X)
inertia_ = cluster_.inertia_
inertia_
811.0841324482415
n_clusters = 6
cluster_ = KMeans(n_clusters=n_clusters, random_state=0).fit(X)
inertia_ = cluster_.inertia_
inertia_
# inertia_不是一个评估指标
733.1538350083076
# 3.1.2聚类算法的模型评估指标
from sklearn.metrics import silhouette_score
from sklearn.metrics import silhouette_samples
X.shape#(500, 2)
y_pred #.labels_
array([2, 2, 0, 1, 0, 1, 0, 0, 0, 0, 2, 2, 0, 1, 0, 2, 0, 2, 1, 0, 0, 0,0, 1, 0, 0, 1, 1, 0, 0, 2, 1, 0, 2, 0, 2, 0, 0, 2, 0, 0, 0, 1, 0,0, 2, 0, 0, 1, 1, 1, 0, 0, 0, 2, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0,2, 0, 0, 0, 2, 0, 0, 2, 0, 0, 2, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1,0, 0, 1, 2, 0, 0, 1, 2, 2, 0, 2, 1, 1, 2, 1, 0, 1, 0, 0, 1, 1, 0,0, 2, 1, 0, 1, 0, 1, 0, 1, 0, 0, 2, 2, 0, 0, 0, 1, 2, 2, 0, 1, 0,0, 0, 0, 2, 1, 0, 1, 1, 0, 2, 0, 1, 1, 1, 0, 0, 2, 2, 0, 0, 1, 2,1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 2, 2, 0, 1, 2, 0, 0, 2, 1, 0,0, 0, 0, 2, 0, 0, 1, 2, 2, 0, 0, 2, 2, 0, 1, 1, 2, 2, 0, 1, 0, 2,2, 1, 2, 0, 1, 0, 0, 2, 0, 0, 2, 0, 0, 0, 0, 2, 0, 0, 0, 1, 0, 1,0, 2, 0, 0, 0, 0, 0, 1, 0, 1, 2, 0, 2, 0, 1, 1, 0, 2, 1, 2, 0, 0,2, 2, 2, 2, 0, 0, 2, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0,1, 0, 2, 2, 0, 0, 0, 0, 1, 1, 0, 1, 0, 2, 1, 2, 1, 2, 2, 1, 2, 1,1, 0, 0, 0, 0, 0, 0, 0, 2, 1, 2, 2, 2, 0, 0, 0, 2, 0, 2, 2, 0, 2,2, 0, 1, 2, 0, 0, 1, 1, 0, 2, 1, 1, 0, 2, 1, 1, 0, 0, 1, 0, 0, 2,2, 1, 0, 2, 0, 1, 1, 0, 0, 0, 2, 0, 1, 1, 0, 1, 1, 1, 1, 2, 2, 0,1, 0, 0, 2, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 2, 1, 2, 2, 2, 2, 2,2, 0, 2, 1, 2, 1, 1, 0, 1, 0, 0, 0, 2, 1, 0, 1, 0, 2, 0, 0, 2, 0,0, 1, 1, 2, 0, 0, 1, 0, 0, 2, 2, 0, 2, 0, 0, 2, 0, 2, 0, 1, 2, 1,0, 0, 1, 0, 0, 1, 2, 0, 1, 1, 0, 0, 0, 0, 2, 1, 2, 0, 1, 2, 2, 2,0, 1, 0, 2, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 2, 0, 0, 2, 0, 1, 0,1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 2, 0, 1, 0, 2, 1, 2, 1, 2, 0, 1, 1,2, 0, 0, 2, 0, 0, 0, 2, 0, 1, 0, 0, 2, 2, 2, 0])
silhouette_score(X,y_pred)
0.5882004012129721
silhouette_score(X,cluster_.labels_) #分4簇的确比分3簇效果要好
0.5150064498560357
silhouette_score(X,cluster_.labels_) #分5簇
0.5150064498560357
silhouette_score(X,cluster_.labels_) #分6簇
0.5150064498560357
silhouette_samples(X,y_pred)
array([ 0.62982017,  0.5034877 ,  0.56148795,  0.84881844,  0.56034142,0.78740319,  0.39254042,  0.4424015 ,  0.48582704,  0.41586457,0.62497924,  0.75540751,  0.50080674,  0.8452256 ,  0.54730432,0.60232423,  0.54574988,  0.68789747,  0.86605921,  0.25389678,0.49316173,  0.47993065,  0.2222642 ,  0.8096265 ,  0.54091189,0.30638567,  0.88557311,  0.84050532,  0.52855895,  0.49260117,0.65291019,  0.85602282,  0.47734375,  0.60418857,  0.44210292,0.6835351 ,  0.44776257,  0.423086  ,  0.6350923 ,  0.4060121 ,0.54540657,  0.5628461 ,  0.78366733,  0.37063114,  0.35132112,0.74493029,  0.53691616,  0.36724842,  0.87717083,  0.79594363,0.84641859,  0.38341344,  0.42043012,  0.4024608 ,  0.64639537,0.46244151,  0.31853572,  0.10047008,  0.37909034,  0.56424494,0.86153448,  0.82630007,  0.53288582,  0.35699772,  0.86994617,0.52259763,  0.71296285,  0.5269434 ,  0.42375504,  0.3173951 ,0.67512993,  0.47574584,  0.44493897,  0.70152025,  0.37911024,0.44338293,  0.75528756,  0.23339973,  0.48832955,  0.36920643,0.84872127,  0.87346766,  0.53069113,  0.85553096,  0.85764386,0.47306874,  0.02036611,  0.83126042,  0.38759022,  0.49233068,0.74566044,  0.60466216,  0.56741342,  0.43416703,  0.83602352,0.72477786,  0.65632253,  0.53058775,  0.60023269,  0.77641023,0.84703763,  0.70993659,  0.7801523 ,  0.46161604,  0.84373446,0.39295281,  0.46052385,  0.88273449,  0.87440032,  0.48304623,0.53380475,  0.75891465,  0.85876382,  0.38558097,  0.85795763,0.39785899,  0.85219954,  0.53642823,  0.86038619,  0.43699704,0.38829633,  0.54291415,  0.69030671,  0.43887074,  0.51384962,0.51912781,  0.83667847,  0.76248539,  0.69612144,  0.51530997,0.86167552,  0.55346107,  0.56205672,  0.49273512,  0.38805592,0.57038854,  0.68677314,  0.20332654,  0.75659329,  0.82280178,0.51078711,  0.56655943,  0.39855324,  0.87777997,  0.81846156,0.85011915,  0.53745726,  0.48476499,  0.57083761,  0.62520973,0.48791422,  0.57163867,  0.80710385,  0.75753237,  0.80107683,0.50370862,  0.49411065,  0.56270422,  0.46054445,  0.46870708,0.53443711,  0.52806612,  0.54696216,  0.38036632,  0.8439417 ,0.43517732,  0.74914748,  0.64728736,  0.41663216,  0.8823285 ,0.65599758,  0.56449485,  0.51988053,  0.62928512,  0.88015404,0.56872777,  0.39189978,  0.49345531,  0.46686063,  0.59723997,0.44721036,  0.30721342,  0.75113026,  0.50932716,  0.73578982,-0.11420488,  0.41858652,  0.75882296,  0.7275962 , -0.04073665,0.80153593,  0.87004395,  0.68206941,  0.43331808,  0.46482802,0.84659276,  0.50866477,  0.68601103,  0.74449975,  0.83022338,0.73707965,  0.27681202,  0.66098479,  0.28977719,  0.51863521,0.63445046,  0.40559979,  0.14818081,  0.76068525,  0.23252498,0.53021521,  0.47737535,  0.20930573,  0.73655361,  0.40050939,0.38201296,  0.53131423,  0.8300432 ,  0.57416668,  0.83002234,0.43809863,  0.72601129,  0.30355831,  0.36933954,  0.48245049,0.50126688,  0.50360422,  0.87011861,  0.56950365,  0.83076761,0.71764725,  0.53645163,  0.7001754 ,  0.50522187,  0.87888555,0.77936165,  0.10535855,  0.73083257,  0.87808798,  0.66433392,0.46478475,  0.37703473,  0.73374533,  0.74890043,  0.73918627,0.63932594,  0.09590229,  0.56398421,  0.65471361,  0.32850826,0.50686886,  0.82252268,  0.8784639 ,  0.50307722,  0.55480534,0.87909816,  0.47641098,  0.31311959,  0.52686075,  0.88545307,0.20448704,  0.80778118,  0.44642434,  0.40574811,  0.88056023,0.4973487 ,  0.69311101,  0.72625355,  0.48589387,  0.4978385 ,0.55313636,  0.50253656,  0.87260952,  0.86131163,  0.40383223,0.86877735,  0.47545049,  0.55504965,  0.88434796,  0.70495153,0.88081422,  0.73413228,  0.74319485,  0.86247661,  0.68152552,0.87029291,  0.81761732,  0.55085702,  0.49102505,  0.55389601,0.124766  ,  0.4404892 ,  0.53977082,  0.57674226,  0.52475521,0.71693971,  0.59037229,  0.27134864,  0.55075649,  0.5305809 ,0.45997724,  0.52098416,  0.69242901,  0.42370109,  0.55411474,0.56138849,  0.53447704,  0.69329183,  0.54368936,  0.32886853,0.86126399,  0.71469113,  0.49146367,  0.50494774,  0.82158862,0.86861319,  0.54403438,  0.73940315,  0.81462808,  0.84352203,0.48207009,  0.7354327 ,  0.78085872,  0.87875202,  0.04033208,0.50804578,  0.80938918,  0.51061604,  0.38053425,  0.64455589,0.67957545,  0.87709406,  0.54770971,  0.49617626,  0.06631062,0.82052164,  0.85247897,  0.4986702 ,  0.41583248,  0.53794955,0.73049329,  0.28601778,  0.87874615,  0.86432778,  0.53085921,0.81504707,  0.80902757,  0.73654387,  0.79629133,  0.69825831,0.71042076,  0.37753505,  0.87392688,  0.36052199,  0.53293388,0.65652301,  0.8590337 ,  0.37778142,  0.88171647,  0.55744616,0.72988524,  0.47205379,  0.25321102,  0.36665898,  0.87510459,0.54567292,  0.4377203 ,  0.69836179,  0.88279947,  0.73712769,0.7571288 ,  0.64200399,  0.71414246,  0.66105524,  0.64924985,-0.03393189,  0.67879166,  0.87717775,  0.70483203,  0.81570721,0.88445546,  0.42536337,  0.84352976,  0.19940384,  0.33446675,-0.05200008,  0.63729057,  0.86077417,  0.29232998,  0.85936207,0.01230106,  0.74072871,  0.54572786,  0.4226642 ,  0.75803727,0.41490286,  0.47701084,  0.81796862,  0.80656788,  0.63246787,0.43149716,  0.47554846,  0.67481449,  0.29491288,  0.47884262,0.73531065,  0.74909774,  0.53905722,  0.60853703,  0.41799506,0.26889856,  0.65941878,  0.57469934,  0.74695893,  0.53566443,0.87031783,  0.55546256,  0.74959292,  0.52013136,  0.48602131,0.84252024,  0.5553399 ,  0.32396765,  0.83121787,  0.6507822 ,0.40589711,  0.81861161,  0.85537229,  0.51500612,  0.46370284,0.35233694,  0.41423309,  0.66647621,  0.87838551,  0.55564776,0.52172866,  0.80216634,  0.74626963,  0.70305507,  0.727976  ,0.4315848 ,  0.71546113, -0.14042082,  0.70475791,  0.54510442,0.49963818,  0.50497552,  0.5260391 ,  0.7371355 ,  0.39249758,0.47181954,  0.51361169,  0.4902578 ,  0.42402416,  0.54710266,0.42517899,  0.54612333,  0.40920498,  0.73864644,  0.5056526 ,0.87463183,  0.41531738,  0.88324604,  0.4574416 ,  0.50326717,0.56519891,  0.86397315,  0.84031419,  0.81795975,  0.55956891,0.43032946,  0.28423933,  0.75002919,  0.53694244,  0.86418082,0.50509088,  0.75702551,  0.85123063,  0.47073065,  0.85904201,0.69214588,  0.32746785,  0.87507056,  0.77556871,  0.47820639,0.37692453,  0.23345891,  0.46482472,  0.36325517,  0.17966353,0.31925836,  0.67652463,  0.35889712,  0.87965911,  0.3907438 ,0.5748237 ,  0.74655924,  0.57403918,  0.69733646,  0.52992071])
silhouette_samples(X,y_pred).shape#(500,)
(500,)
silhouette_samples(X,y_pred).mean()
0.5882004012129721
from sklearn.metrics import calinski_harabasz_score#计算速度远超轮廓系数
#harabasz大意,组间迹/组内迹,迹的值与离散程度成正比
X
y_pred
array([2, 2, 0, 1, 0, 1, 0, 0, 0, 0, 2, 2, 0, 1, 0, 2, 0, 2, 1, 0, 0, 0,0, 1, 0, 0, 1, 1, 0, 0, 2, 1, 0, 2, 0, 2, 0, 0, 2, 0, 0, 0, 1, 0,0, 2, 0, 0, 1, 1, 1, 0, 0, 0, 2, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0,2, 0, 0, 0, 2, 0, 0, 2, 0, 0, 2, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1,0, 0, 1, 2, 0, 0, 1, 2, 2, 0, 2, 1, 1, 2, 1, 0, 1, 0, 0, 1, 1, 0,0, 2, 1, 0, 1, 0, 1, 0, 1, 0, 0, 2, 2, 0, 0, 0, 1, 2, 2, 0, 1, 0,0, 0, 0, 2, 1, 0, 1, 1, 0, 2, 0, 1, 1, 1, 0, 0, 2, 2, 0, 0, 1, 2,1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 2, 2, 0, 1, 2, 0, 0, 2, 1, 0,0, 0, 0, 2, 0, 0, 1, 2, 2, 0, 0, 2, 2, 0, 1, 1, 2, 2, 0, 1, 0, 2,2, 1, 2, 0, 1, 0, 0, 2, 0, 0, 2, 0, 0, 0, 0, 2, 0, 0, 0, 1, 0, 1,0, 2, 0, 0, 0, 0, 0, 1, 0, 1, 2, 0, 2, 0, 1, 1, 0, 2, 1, 2, 0, 0,2, 2, 2, 2, 0, 0, 2, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0,1, 0, 2, 2, 0, 0, 0, 0, 1, 1, 0, 1, 0, 2, 1, 2, 1, 2, 2, 1, 2, 1,1, 0, 0, 0, 0, 0, 0, 0, 2, 1, 2, 2, 2, 0, 0, 0, 2, 0, 2, 2, 0, 2,2, 0, 1, 2, 0, 0, 1, 1, 0, 2, 1, 1, 0, 2, 1, 1, 0, 0, 1, 0, 0, 2,2, 1, 0, 2, 0, 1, 1, 0, 0, 0, 2, 0, 1, 1, 0, 1, 1, 1, 1, 2, 2, 0,1, 0, 0, 2, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 2, 1, 2, 2, 2, 2, 2,2, 0, 2, 1, 2, 1, 1, 0, 1, 0, 0, 0, 2, 1, 0, 1, 0, 2, 0, 0, 2, 0,0, 1, 1, 2, 0, 0, 1, 0, 0, 2, 2, 0, 2, 0, 0, 2, 0, 2, 0, 1, 2, 1,0, 0, 1, 0, 0, 1, 2, 0, 1, 1, 0, 0, 0, 0, 2, 1, 2, 0, 1, 2, 2, 2,0, 1, 0, 2, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 2, 0, 0, 2, 0, 1, 0,1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 2, 0, 1, 0, 2, 1, 2, 1, 2, 0, 1, 1,2, 0, 0, 2, 0, 0, 0, 2, 0, 1, 0, 0, 2, 2, 2, 0])
calinski_harabasz_score(X, y_pred)
1809.991966958033
from time import time
#time():记下每一次time()这一行命令时的时间戳
#时间戳是一行数字,用来记录此时此刻的时间
t0 = time()
calinski_harabasz_score(X, y_pred)
time() - t0#0.0009999275207519531
0.0
t0 = time()
silhouette_score(X,y_pred)
time() - t0#0.007976055145263672
0.006726503372192383
t0
1663512591.8903003
#时间戳可以通过datetime中的函数fromtimestamp将yi'cuan数字转换成真正的时间格式
import datetime
datetime.datetime.fromtimestamp(t0).strftime("%Y-%m-%d %H:%M:%S")
'2022-09-18 22:49:51'
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples, silhouette_score
import matplotlib.pyplot as plt
import matplotlib.cm as cm #colormap#[使用小数来调用颜色的函数]
import numpy as np
import pandas as pd
#用两张图来选择最优聚类数n_clusters
#计算每个类的轮廓系数,及其对比的可视化图
#聚类的异色分布
#设簇数
n_clusters = 4
#一张画布,两个子图,排列方式为:一行两列
fig, (ax1, ax2) = plt.subplots(1, 2)
#画布尺寸
fig.set_size_inches(18,7)
#只体现轮廓系数大于-0.1的部分,便于可视化,比较数据差异
ax1.set_xlim([-0.1, 1])
#给柱状图留间隔
ax1.set_ylim([0, X.shape[0] + (n_clusters + 1) * 10])
#建模,调用标签
clusterer = KMeans(n_clusters=n_clusters, random_state=10).fit(X)
cluster_labels = clusterer.labels_
#silhouette_score求得的是所有样本点轮廓系数的均值
#输入(特征矩阵,聚类标签)
silhouette_avg = silhouette_score(X, cluster_labels)
print("For n_clusters =", n_clusters,"The average silhouette_score is :", silhouette_avg)
#silhouette_samples返回每个样本点的轮廓系数
sample_silhouette_values = silhouette_samples(X, cluster_labels)
For n_clusters = 4 The average silhouette_score is : 0.6505186632729437

#设定图型底端出现的最低坐标
y_lower = 10
#在每个簇内循环
for i in range(n_clusters):#从所有样本轮廓系数的列表中抽取出,第i个簇的轮廓系数列表ith_cluster_silhouette_values = sample_silhouette_values[cluster_labels == i]#排列ith_cluster_silhouette_values.sort()#查看样本数size_cluster_i = ith_cluster_silhouette_values.shape[0]#设定图像顶端上限y_upper = y_lower + size_cluster_i# cm.nipy_spectral中输入一个小数表示一种颜色color = cm.nipy_spectral(float(i)/n_clusters)# fill_between让一定范围内的柱状图统一颜色的函数#参数(纵坐标的下限,纵坐标的上限,x轴的取值,柱状图的颜色,透明度)ax1.fill_betweenx(np.arange(y_lower, y_upper),ith_cluster_silhouette_values,facecolor=color,alpha=0.7)#显示簇的编号text参数(横坐标,纵坐标,编号内容)ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))y_lower = y_upper + 10ax1.set_title("The silhouette plot for the various clusters.")
ax1.set_xlabel("The silhouette coefficient values")
ax1.set_ylabel("Cluster label")
#将整个数据集的轮廓系数的均值以虚线的形式置入图内
ax1.axvline(x=silhouette_avg, color="red", linestyle="--")
# 让y轴不显示坐标
ax1.set_yticks([])
# 显示x轴规定刻度
ax1.set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])
[<matplotlib.axis.XTick at 0x1cde985a8e0>,<matplotlib.axis.XTick at 0x1cde985a8b0>,<matplotlib.axis.XTick at 0x1cde981be20>,<matplotlib.axis.XTick at 0x1cde9847cd0>,<matplotlib.axis.XTick at 0x1cde98471f0>,<matplotlib.axis.XTick at 0x1cde98c3a30>,<matplotlib.axis.XTick at 0x1cde98d91c0>]
cluster_labels.astype(float)
array([2., 2., 3., 1., 0., 1., 0., 0., 0., 0., 2., 2., 0., 1., 0., 2., 0.,2., 1., 0., 3., 3., 0., 1., 0., 0., 1., 1., 3., 0., 2., 1., 0., 2.,0., 2., 3., 3., 2., 3., 0., 3., 1., 0., 0., 2., 3., 0., 1., 1., 1.,3., 3., 0., 2., 3., 3., 3., 3., 0., 1., 1., 3., 0., 1., 0., 2., 0.,3., 3., 2., 3., 0., 2., 0., 0., 2., 0., 0., 3., 1., 1., 3., 1., 1.,3., 3., 1., 3., 3., 1., 2., 3., 0., 1., 2., 2., 0., 2., 1., 1., 2.,1., 3., 1., 0., 0., 1., 1., 3., 0., 2., 1., 3., 1., 3., 1., 0., 1.,0., 3., 2., 2., 3., 0., 3., 1., 2., 2., 0., 1., 3., 3., 3., 3., 2.,1., 0., 1., 1., 0., 2., 0., 1., 1., 1., 0., 0., 2., 2., 3., 3., 1.,2., 1., 3., 3., 3., 3., 3., 3., 3., 3., 3., 1., 2., 2., 2., 0., 1.,2., 3., 0., 2., 1., 3., 3., 3., 3., 2., 0., 3., 1., 2., 2., 3., 0.,2., 2., 0., 1., 1., 2., 2., 0., 1., 0., 2., 2., 1., 2., 3., 1., 0.,0., 2., 0., 3., 2., 0., 3., 0., 3., 2., 0., 0., 0., 1., 3., 1., 0.,2., 3., 0., 3., 3., 3., 1., 3., 1., 2., 3., 2., 3., 1., 1., 3., 2.,1., 2., 0., 3., 2., 2., 2., 2., 0., 3., 2., 3., 0., 1., 1., 0., 0.,1., 3., 0., 3., 1., 0., 1., 3., 3., 1., 0., 2., 2., 3., 3., 3., 0.,1., 1., 0., 1., 3., 2., 1., 2., 1., 2., 2., 1., 2., 1., 1., 0., 3.,3., 3., 0., 0., 3., 2., 1., 2., 2., 2., 0., 3., 0., 2., 3., 2., 2.,3., 2., 2., 3., 1., 2., 0., 0., 1., 1., 3., 2., 1., 1., 0., 2., 1.,1., 0., 3., 1., 3., 0., 2., 2., 1., 3., 2., 0., 1., 1., 0., 0., 0.,2., 0., 1., 1., 3., 1., 1., 1., 1., 2., 2., 0., 1., 3., 0., 2., 1.,3., 1., 0., 1., 3., 0., 3., 1., 0., 0., 2., 1., 2., 2., 2., 2., 2.,2., 3., 2., 1., 2., 1., 1., 3., 1., 0., 3., 3., 2., 1., 3., 1., 0.,2., 3., 3., 2., 3., 3., 1., 1., 2., 3., 0., 1., 0., 0., 2., 2., 0.,2., 3., 3., 2., 3., 2., 3., 1., 2., 1., 3., 0., 1., 3., 0., 1., 2.,0., 1., 1., 3., 0., 3., 0., 2., 1., 2., 0., 1., 2., 2., 2., 3., 1.,0., 2., 0., 0., 3., 3., 2., 0., 0., 0., 0., 0., 0., 2., 0., 3., 2.,0., 1., 0., 1., 0., 3., 3., 1., 1., 1., 3., 0., 3., 2., 3., 1., 0.,2., 1., 2., 1., 2., 0., 1., 1., 2., 3., 0., 2., 3., 3., 3., 2., 0.,1., 3., 0., 2., 2., 2., 0.])
colors = cm.nipy_spectral(cluster_labels.astype(float) / n_clusters)#颜色映射(映射出一个颜色列表)
#scatter散点
ax2.scatter(X[:, 0], X[:, 1],marker='o'#点的形状,s=8 #点的大小,c=colors  #点的颜色(列表))centers = clusterer.cluster_centers_#各个簇的质心的列表
# Draw white circles at cluster centers
ax2.scatter(centers[:, 0], centers[:, 1], marker='x',c="red", alpha=1, s=200)#alpha在这里调节透明度,1 是完全不透明ax2.set_title("The visualization of the clustered data.")#簇类数据可视化
ax2.set_xlabel("Feature space for the 1st feature")
ax2.set_ylabel("Feature space for the 2nd feature")plt.suptitle(("Silhouette analysis for KMeans clustering on sample data"#对样本数据进行 KMeans 聚类的轮廓分析"with n_clusters = %d" % n_clusters),fontsize=14, fontweight='bold')#bold苍劲的粗体
plt.show()
<Figure size 432x288 with 0 Axes>
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples, silhouette_scoreimport matplotlib.pyplot as plt
import matplotlib.cm as cm
import numpy as npfor n_clusters in [2,3,4,5,6,7]:n_clusters = n_clustersfig, (ax1, ax2) = plt.subplots(1, 2)fig.set_size_inches(18, 7)ax1.set_xlim([-0.1, 1])ax1.set_ylim([0, X.shape[0] + (n_clusters + 1) * 10])clusterer = KMeans(n_clusters=n_clusters, random_state=10).fit(X)cluster_labels = clusterer.labels_silhouette_avg = silhouette_score(X, cluster_labels)print("For n_clusters =", n_clusters,"The average silhouette_score is :", silhouette_avg)sample_silhouette_values = silhouette_samples(X, cluster_labels)y_lower = 10for i in range(n_clusters):ith_cluster_silhouette_values = sample_silhouette_values[cluster_labels == i]ith_cluster_silhouette_values.sort()size_cluster_i = ith_cluster_silhouette_values.shape[0]y_upper = y_lower + size_cluster_icolor = cm.nipy_spectral(float(i)/n_clusters)ax1.fill_betweenx(np.arange(y_lower, y_upper),ith_cluster_silhouette_values,facecolor=color,alpha=0.7)ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))y_lower = y_upper + 10ax1.set_title("The silhouette plot for the various clusters.")ax1.set_xlabel("The silhouette coefficient values")ax1.set_ylabel("Cluster label")ax1.axvline(x=silhouette_avg, color="red", linestyle="--")ax1.set_yticks([])ax1.set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])colors = cm.nipy_spectral(cluster_labels.astype(float) / n_clusters)ax2.scatter(X[:, 0], X[:, 1],marker='o',s=8,c=colors)centers = clusterer.cluster_centers_# Draw white circles at cluster centersax2.scatter(centers[:, 0], centers[:, 1], marker='x',c="red", alpha=1, s=200)ax2.set_title("The visualization of the clustered data.")ax2.set_xlabel("Feature space for the 1st feature")ax2.set_ylabel("Feature space for the 2nd feature")plt.suptitle(("Silhouette analysis for KMeans clustering on sample data ""with n_clusters = %d" % n_clusters),fontsize=14, fontweight='bold')plt.show()
#落地于商业场景,对于用户的精准营销,当资金充足时选择四个方向,当资金紧张时选择两个方向
For n_clusters = 2 The average silhouette_score is : 0.7049787496083262

For n_clusters = 3 The average silhouette_score is : 0.5882004012129721

For n_clusters = 4 The average silhouette_score is : 0.6505186632729437

For n_clusters = 5 The average silhouette_score is : 0.56376469026194

For n_clusters = 6 The average silhouette_score is : 0.4504666294372765

For n_clusters = 7 The average silhouette_score is : 0.39092211029930857

X.shape
(500, 2)
y.shape
(500,)
plus = KMeans(n_clusters = 10).fit(X)#质心选择默认k_means++,选择质心耗时,迭代次数少于random
plus.n_iter_#迭代次数
11
random = KMeans(n_clusters = 10,init="random",random_state=420).fit(X)
random.n_iter_#迭代次数
19
random = KMeans(n_clusters = 10,init="random",max_iter=10,random_state=420).fit(X)
y_pred_max10 = random.labels_
silhouette_score(X,y_pred_max10)
#有时迭代次数少反而模型效果好
0.3952586444034157
random = KMeans(n_clusters = 10,init="random",max_iter=20,random_state=420).fit(X)
y_pred_max20 = random.labels_
silhouette_score(X,y_pred_max20)
0.3401504537571701
from sklearn.cluster import k_means#输入一系列值,直接返回质心,标签值,inertia总距离平方和,n_init最佳迭代次数k_means(X,4,return_n_iter=False)
(array([[-10.00969056,  -3.84944007],[ -1.54234022,   4.43517599],[ -6.08459039,  -3.17305983],[ -7.09306648,  -8.10994454]]),array([3, 3, 0, 1, 2, 1, 2, 2, 2, 2, 3, 3, 2, 1, 2, 3, 2, 3, 1, 2, 0, 0,2, 1, 2, 2, 1, 1, 0, 2, 3, 1, 2, 3, 2, 3, 0, 0, 3, 0, 2, 0, 1, 2,2, 3, 0, 2, 1, 1, 1, 0, 0, 2, 3, 0, 0, 0, 0, 2, 1, 1, 0, 2, 1, 2,3, 2, 0, 0, 3, 0, 2, 3, 2, 2, 3, 2, 2, 0, 1, 1, 0, 1, 1, 0, 0, 1,0, 0, 1, 3, 0, 2, 1, 3, 3, 2, 3, 1, 1, 3, 1, 0, 1, 2, 2, 1, 1, 0,2, 3, 1, 0, 1, 0, 1, 2, 1, 2, 0, 3, 3, 0, 2, 0, 1, 3, 3, 2, 1, 0,0, 0, 0, 3, 1, 2, 1, 1, 2, 3, 2, 1, 1, 1, 2, 2, 3, 3, 0, 0, 1, 3,1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 3, 3, 3, 2, 1, 3, 0, 2, 3, 1, 0,0, 0, 0, 3, 2, 0, 1, 3, 3, 0, 2, 3, 3, 2, 1, 1, 3, 3, 2, 1, 2, 3,3, 1, 3, 0, 1, 2, 2, 3, 2, 0, 3, 2, 0, 2, 0, 3, 2, 2, 2, 1, 0, 1,2, 3, 0, 2, 0, 0, 0, 1, 0, 1, 3, 0, 3, 0, 1, 1, 0, 3, 1, 3, 2, 0,3, 3, 3, 3, 2, 0, 3, 0, 2, 1, 1, 2, 2, 1, 0, 2, 0, 1, 2, 1, 0, 0,1, 2, 3, 3, 0, 0, 0, 2, 1, 1, 2, 1, 0, 3, 1, 3, 1, 3, 3, 1, 3, 1,1, 2, 0, 0, 0, 2, 2, 0, 3, 1, 3, 3, 3, 2, 0, 2, 3, 0, 3, 3, 0, 3,3, 0, 1, 3, 2, 2, 1, 1, 0, 3, 1, 1, 2, 3, 1, 1, 2, 0, 1, 0, 2, 3,3, 1, 0, 3, 2, 1, 1, 2, 2, 2, 3, 2, 1, 1, 0, 1, 1, 1, 1, 3, 3, 2,1, 0, 2, 3, 1, 0, 1, 2, 1, 0, 2, 0, 1, 2, 2, 3, 1, 3, 3, 3, 3, 3,3, 0, 3, 1, 3, 1, 1, 0, 1, 2, 0, 0, 3, 1, 0, 1, 2, 3, 0, 0, 3, 0,0, 1, 1, 3, 0, 2, 1, 2, 2, 3, 3, 2, 3, 0, 0, 3, 0, 3, 0, 1, 3, 1,0, 2, 1, 0, 2, 1, 3, 2, 1, 1, 0, 2, 0, 2, 3, 1, 3, 2, 1, 3, 3, 3,0, 1, 2, 3, 2, 2, 0, 0, 3, 2, 2, 2, 2, 2, 2, 3, 2, 0, 3, 2, 1, 2,1, 2, 0, 0, 1, 1, 1, 0, 2, 0, 3, 0, 1, 2, 3, 1, 3, 1, 3, 2, 1, 1,3, 0, 2, 3, 0, 0, 0, 3, 2, 1, 0, 2, 3, 3, 3, 2]),908.3855684760616)
#4.案例:聚类算法用于降维,Kmeans的矢量量化应用
#矢量量化的降维在于压缩样本的信息量的大小
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.metrics import pairwise_distances_argmin#对两个序列中的点进行距离匹配的函数
from sklearn.datasets import load_sample_image#导入图片数据所用的类
from sklearn.utils import shuffle #打乱序列的函数,使得数组,列表等无序化
# 实例化,导入颐和园的图片
china = load_sample_image("china.jpg")
china
array([[[174, 201, 231],[174, 201, 231],[174, 201, 231],...,[250, 251, 255],[250, 251, 255],[250, 251, 255]],[[172, 199, 229],[173, 200, 230],[173, 200, 230],...,[251, 252, 255],[251, 252, 255],[251, 252, 255]],[[174, 201, 231],[174, 201, 231],[174, 201, 231],...,[252, 253, 255],[252, 253, 255],[252, 253, 255]],...,[[ 88,  80,   7],[147, 138,  69],[122, 116,  38],...,[ 39,  42,  33],[  8,  14,   2],[  6,  12,   0]],[[122, 112,  41],[129, 120,  53],[118, 112,  36],...,[  9,  12,   3],[  9,  15,   3],[ 16,  24,   9]],[[116, 103,  35],[104,  93,  31],[108, 102,  28],...,[ 43,  49,  39],[ 13,  21,   6],[ 15,  24,   7]]], dtype=uint8)
#查看数据类型
china.dtype#(uint8典型图片类型)
dtype('uint8')
china.shape
#长度 x 宽度 x 像素 > 每个像素点,带三个特征
(427, 640, 3)
china[0][0]#三个特征决定一个颜色
array([174, 201, 231], dtype=uint8)
#包含多少种不同的颜色?
newimage = china.reshape((427 * 640,3))#降维
newimage.shape
(273280, 3)
import pandas as pd
pd.DataFrame(newimage).drop_duplicates().shape#.drop_duplicates()去重复值#我们现在有9W多种颜色
(96615, 3)
# 图像可视化
plt.figure(figsize=(15,15))
plt.imshow(china) #导入3维数组形成的图片
<matplotlib.image.AxesImage at 0x1cdeaf4da30>

#查看模块中的另一张图片
flower = load_sample_image("flower.jpg")
plt.figure(figsize=(15,15))
plt.imshow(flower)
<matplotlib.image.AxesImage at 0x1cdeafbacd0>

n_clusters = 64china = np.array(china, dtype=np.float64) / china.max()
w, h, d = original_shape = tuple(china.shape)
assert d == 3
image_array = np.reshape(china, (w * h, d))
#plt.imshow在浮点数上表现非常优异,在这里我们把china中的数据,转换为浮点数,压缩到[0,1]之间
china = np.array(china, dtype=np.float64) / china.max()
(china < 0).sum()
0
(china > 1).sum()
0
#把china从图像格式,转换成矩阵格式
w, h, d = original_shape = tuple(china.shape)
w
427
h
640
d
3
assert d == 3
#assert相当于 raise error if not,表示为,“不为True就报错”
#要求d必须等于3,如果不等于,就报错
#展示assert的功能
d_ = 3
assert d_ == 3, "一个格子中特征数不等于3"
image_array = np.reshape(china, (w * h, d)) #reshape是改变结构,成二维数组
image_array
array([[0.68235294, 0.78823529, 0.90588235],[0.68235294, 0.78823529, 0.90588235],[0.68235294, 0.78823529, 0.90588235],...,[0.16862745, 0.19215686, 0.15294118],[0.05098039, 0.08235294, 0.02352941],[0.05882353, 0.09411765, 0.02745098]])
image_array.shape
(273280, 3)
#np.reshape(a, newshape, order='C'), reshape函数的第一个参数a是要改变结构的对象,第二个参数是要改变的新结构
#展示np.reshape的效果
a = np.random.random((2,4))
a.shape
(2, 4)
a.reshape((4,2)) == np.reshape(a,(4,2))
array([[ True,  True],[ True,  True],[ True,  True],[ True,  True]])
np.reshape(a,(2,2,2)).shape
(2, 2, 2)
np.reshape(a,(8,1))
array([[0.02952851],[0.07764712],[0.98057566],[0.85412758],[0.80587119],[0.09098496],[0.77735822],[0.89251407]])
#np.reshape(a,(1,4))
#无论有几维,只要维度之间相乘后的总数据量不变,维度可以随意变换
a.shape
(2, 4)
#首先,先使用1000个数据来找出质心
image_array_sample = shuffle(image_array, random_state=0)[:1000]#shuffle打乱数据顺序
kmeans = KMeans(n_clusters=n_clusters, random_state=0).fit(image_array_sample)#聚类64个质心(颜色)
kmeans.cluster_centers_.shape
(64, 3)
#找出质心之后,按照已存在的质心对所有数据进行聚类
labels = kmeans.predict(image_array)
labels.shape
(273280,)
set(labels)#集合去重
{0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63}
#使用质心来替换所有的样本
image_kmeans = image_array.copy()
image_kmeans #27W个样本点,9W多种不同的颜色(像素点)
array([[0.68235294, 0.78823529, 0.90588235],[0.68235294, 0.78823529, 0.90588235],[0.68235294, 0.78823529, 0.90588235],...,[0.16862745, 0.19215686, 0.15294118],[0.05098039, 0.08235294, 0.02352941],[0.05882353, 0.09411765, 0.02745098]])
labels #这27W个样本点所对应的簇的质心的索引
array([62, 62, 62, ...,  1,  6,  6])
kmeans.cluster_centers_[labels[0]]#索引对应的质心
array([0.73524384, 0.82021116, 0.91925591])
for i in range(w*h):image_kmeans[i] = kmeans.cluster_centers_[labels[i]]#使得第i个样本数据为对应的聚类质心的样本像素点
#查看生成的新图片信息,数据已聚类
image_kmeans.shape
(273280, 3)
pd.DataFrame(image_kmeans).drop_duplicates().shape
(64, 3)
#恢复图片的结构,否则无法画图
image_kmeans = image_kmeans.reshape(w,h,d)
image_kmeans.shape
(427, 640, 3)
centroid_random = shuffle(image_array, random_state=0)[:n_clusters]#打乱顺序,随机抽取64个点作为质心
centroid_random.shape#求出64个带索引的64个质心
(64, 3)
labels_random = pairwise_distances_argmin(centroid_random,image_array,axis=0)#函数pairwise_distances_argmin(x1,x2,axis) #x1和x2分别是序列
#用来计算x2中的每个样本到x1中的每个样本点的距离,并返回和x2相同形状的,x1中对应的最近的样本点的索引(返回每个样本最近的质心的索引)
labels_random.shape
(273280,)
labels_random
array([55, 55, 55, ..., 52, 60, 60], dtype=int64)
len(set(labels_random))
64
#根据索引找到随机质心(像素数据)来替换所有样本的像素数据
image_random = image_array.copy()
for i in range(w*h):image_random[i] = centroid_random[labels_random[i]]
#恢复图片的结构
image_random = image_random.reshape(w,h,d)
image_random.shape
(427, 640, 3)
plt.figure(figsize=(10,10))
plt.axis('off')
plt.title('Original image (96,615 colors)')
plt.imshow(china)plt.figure(figsize=(10,10))
plt.axis('off')
plt.title('Quantized image (64 colors, K-Means)')
plt.imshow(image_kmeans)plt.figure(figsize=(10,10))
plt.axis('off')
plt.title('Quantized image (64 colors, Random)')
plt.imshow(image_random)
plt.show()

sklearn_聚类算法与Kmeans_菜菜视频学习笔记相关推荐

  1. sklearn_逻辑回归制作评分卡_菜菜视频学习笔记

    逻辑回归制作评分卡 3.0 前言 逻辑回归与线性回归的关系 消除特征间的多重共线性 为什么使用逻辑回归处理金融领域数据 正则化的选择 特征选择的方法 分箱的作用 3.1导库 3.2数据预处理 3.2. ...

  2. 影像组学视频学习笔记(35)-基于2D超声影像的影像组学特征提取、Li‘s have a solution and plan.

    作者:北欧森林 链接:https://www.jianshu.com/p/f82d30289d68 来源:简书,已获转载授权 RadiomicsWorld.com "影像组学世界" ...

  3. 影像组学视频学习笔记(33)-使用SimpleITK实现医学影像差值、Li‘s have a solution and plan.

    作者:北欧森林 链接:https://www.jianshu.com/p/afcd06221ea4 来源:简书,已获转载授权 RadiomicsWorld.com "影像组学世界" ...

  4. 影像组学视频学习笔记(32)-使用SimpleITK进行N4偏置场校正、Li‘s have a solution and plan.

    作者:北欧森林 链接:https://www.jianshu.com/p/ae0f502dc146 来源:简书,已获授权转载 RadiomicsWorld.com "影像组学世界" ...

  5. 影像组学视频学习笔记(24)-文献导读:了解88种降维、分类器组合、Li‘s have a solution and plan.

    本笔记来源于B站Up主: 有Li 的影像组学系列教学视频 本节(24)主要讲解: 解读一篇文献,了解不同的降维.分类器组合方法 这篇文献2018年发表在European Radiology上: Rad ...

  6. 影像组学视频学习笔记(15)-ROC曲线及其绘制、Li‘s have a solution and plan.

    本笔记来源于B站Up主: 有Li 的影像组学系列教学视频 本节(15)主要介绍: ROC曲线及其绘制 ROC 曲线 ROC = receiver operating characteristic cu ...

  7. 影像组学视频学习笔记(14)-特征权重做图及美化、Li‘s have a solution and plan.

    本笔记来源于B站Up主: 有Li 的影像组学系列教学视频 本节(14)主要介绍: 特征权重做图及美化 import matplotlib.pyplot as plt %matplotlib inlin ...

  8. 影像组学视频学习笔记(12)-支持向量机(SVM)参数优化(代码)、Li‘s have a solution and plan.

    本笔记来源于B站Up主: 有Li 的影像组学系列教学视频 本节(12)主要介绍: SVM参数优化(代码) 参数优化: 自动寻找最合适的γ和C组合. 原理:遍历所有给定的参数组合,对数据进行训练,找到最 ...

  9. 影像组学视频学习笔记(11)-支持向量机(SVM)(理论)、Li‘s have a solution and plan.

    本笔记来源于B站Up主: 有Li 的影像组学系列教学视频 本节(11)主要介绍: SVM支持向量机(理论) 支持向量机 (support vector machine, SVM) 号称是鲁棒性(rob ...

最新文章

  1. 图像低频高频区域分离
  2. 个性化推荐系统该如何评估,四种不同策略的角度
  3. JavaJDK中的命令行工具
  4. android新拟态实现方法,Android 新拟态UI (Neumorphism)
  5. springboot入门(一)--快速搭建一个springboot框架
  6. Python 3 教程一:入门
  7. 类的初始化列表_【Flutter 111】Flutter手把手教程Dart语言——类、类的的成员变量和方法、类的构造函数...
  8. Oracle数据库常用十一大操作指令
  9. c语言无线网络抓包程序,c语言实现抓包
  10. datagrid不显示 easy_[Easy UI ]DataGrid 首次进入页面时,不加载任何数据
  11. 腾讯校园招聘笔试 2019-8-17 第五题
  12. 方差、标准差(均方差),均方误差、均方根误差
  13. 使用LaTeX给PDF加背景
  14. Python衍射初步
  15. 浙江正泰中自 DCS系统PCS1800系统介绍
  16. Sketch MAC破解
  17. 川西云南行散记之五《稻城-亚丁-洛绒牛场》
  18. JProfiler ERROR: Invalid license key. Aborting.
  19. 电脑上怎么看主板型号
  20. 【C语言】求N的阶乘

热门文章

  1. 要嫁就嫁程序员,钱多话少死得早~
  2. PT9010S(单触控双输出 LED 调光调色温 IC)
  3. 由BN、CB领投的跨链网络—Axelar,即将在Coinlist 平台IDO
  4. 海康visionMaster4.1二维码识别/检测之脚本文件的调试
  5. 新的聊天方式,随机视频聊天
  6. mysql sql语句执行顺序
  7. python获取微信群和群成员
  8. php如何制作表格,php实例:表格绘制
  9. 写给像我一样初次接触渗透测试的人
  10. 如何在idea中使用翻译