


我喜欢树的语言来描述DBSCAN中的cluster增长。它以任意seed点开始,该seed点在εε的距离(或“半径”)内至少具有MinPts点。我们沿着这些附近的点进行广度优先搜索。对于给定的附近点,我们检查它在半径内有多少个点。如果它的邻居数少于MinPts,则此点变为叶子 -我们不会继续从中增长cluster。但是,如果它确实至少有MinPts,则它是一个分支,我们将其所有邻居添加到我们广度优先搜索的FIFO队列中。



噪声点被识别为选择新种子的过程的一部分 - 如果特定种子点没有足够的邻居,则将其标记为噪声点。此标签通常是临时的,但是这些噪点通常被某些群集选为叶节点。


Naftali Harris创建了一个基于Web的可视化(https://www.naftaliharris.com/blog/visualizing-dbscan-clustering/),可在二维数据集上运行DBSCAN。




import numpy

def MyDBSCAN(D, eps, MinPts):


Cluster the dataset `D` using the DBSCAN algorithm.

MyDBSCAN takes a dataset `D` (a list of vectors), a threshold distance

`eps`, and a required number of points `MinPts`.

It will return a list of cluster labels. The label -1 means noise, and then

the clusters are numbered starting from 1.


# This list will hold the final cluster assignment for each point in D.

# There are two reserved values:

# -1 - Indicates a noise point

# 0 - Means the point hasn't been considered yet.

# Initially all labels are 0.

labels = [0]*len(D)

# C is the ID of the current cluster.

C = 0

# This outer loop is just responsible for picking new seed points--a point

# from which to grow a new cluster.

# Once a valid seed point is found, a new cluster is created, and the

# cluster growth is all handled by the 'expandCluster' routine.

# For each point P in the Dataset D...

# ('P' is the index of the datapoint, rather than the datapoint itself.)

for P in range(0, len(D)):

# Only points that have not already been claimed can be picked as new

# seed points.

# If the point's label is not 0, continue to the next point.

if not (labels[P] == 0):


# Find all of P's neighboring points.

NeighborPts = regionQuery(D, P, eps)

# If the number is below MinPts, this point is noise.

# This is the only condition under which a point is labeled

# NOISE--when it's not a valid seed point. A NOISE point may later

# be picked up by another cluster as a boundary point (this is the only

# condition under which a cluster label can change--from NOISE to

# something else).

if len(NeighborPts) < MinPts:

labels[P] = -1

# Otherwise, if there are at least MinPts nearby, use this point as the

# seed for a new cluster.


# Get the next cluster label.

C += 1

# Assing the label to our seed point.

labels[P] = C

# Grow the cluster from the seed point.

growCluster(D, labels, P, C, eps, MinPts)

# All data has been clustered!

return labels

def growCluster(D, labels, P, C, eps, MinPts):


Grow a new cluster with label `C` from the seed point `P`.

This function searches through the dataset to find all points that belong

to this new cluster. When this function returns, cluster `C` is complete.


`D` - The dataset (a list of vectors)

`labels` - List storing the cluster labels for all dataset points

`P` - Index of the seed point for this new cluster

`C` - The label for this new cluster.

`eps` - Threshold distance

`MinPts` - Minimum required number of neighbors


# SearchQueue is a FIFO queue of points to evaluate. It will only ever

# contain points which belong to cluster C (and have already been labeled

# as such).


# The points are represented by their index values (not the actual vector).


# The FIFO queue behavior is accomplished by appending new points to the

# end of the list, and using a while-loop rather than a for-loop.

SearchQueue = [P]

# For each point in the queue:

# 1. Determine whether it is a branch or a leaf

# 2. For branch points, add their unclaimed neighbors to the search queue

i = 0

while i < len(SearchQueue):

# Get the next point from the queue.

P = SearchQueue[i]

# Find all the neighbors of P

NeighborPts = regionQuery(D, P, eps)

# If the number of neighbors is below the minimum, then this is a leaf

# point and we move to the next point in the queue.

if len(NeighborPts) < MinPts:

i += 1


# Otherwise, we have the minimum number of neighbors, and this is a

# branch point.

# For each of the neighbors...

for Pn in NeighborPts:

# If Pn was labelled NOISE during the seed search, then we

# know it's not a branch point (it doesn't have enough

# neighbors), so make it a leaf point of cluster C and move on.

if labels[Pn] == -1:

labels[Pn] = C

# Otherwise, if Pn isn't already claimed, claim it as part of

# C and add it to the search queue.

elif labels[Pn] == 0:

# Add Pn to cluster C.

labels[Pn] = C

# Add Pn to the SearchQueue.


# Advance to the next point in the FIFO queue.

i += 1

# We've finished growing cluster C!

def regionQuery(D, P, eps):


Find all points in dataset `D` within distance `eps` of point `P`.

This function calculates the distance between a point P and every other

point in the dataset, and then returns only those points which are within a

threshold distance `eps`.


neighbors = []

# For each point in the dataset...

for Pn in range(0, len(D)):

# If the distance is below the threshold, add it to the neighbors list.

if numpy.linalg.norm(D[P] - D[Pn]) < eps:


return neighbors



