【算法】FLANN中kd树构建和查询的简明分析

flann源码参考：flann: https://github.com/flann-lib/flannsudo apt install libflann-dev

K-最近邻搜索（K-Nearest Neighbour,KNN）

什么是kd树

FLANN中kd树的构建（超平面的确定）

维度划分方式对查询性能的影响

大规模数据swap的方式

FLANN中超平面确定方法

其他超平面确定方法

FLANN中kd树的查询

最近距离表&搜索半径

近似最近邻查询（Approximate Nearest Neighbor Search,ANNS）

FLANN中kd树查询策略

多线程并行查询

K-最近邻搜索（K-Nearest Neighbour,KNN）

K-最近邻(K-Nearest Neighbour, KNN)算法是一种基本分类与回归方法，属于监督学习方法，其工作机制非常简单：给定测试样本，基于某种距离度量找出训练集中与其最靠近的k个训练样本。算法的输入输出如下：

在特征匹配的应用中，我们通常找到最接近2个最近邻点（k=2），并在这两个关键点中，若最近的距离除以次近的距离小于指定阈值ratio，则接受这一对匹配点。Low推荐ratio的取值为0.8，但Low对大量存在尺度、旋转和亮度变化的两幅图片进行匹配，实验结果表明ratio取值在0. 4~0. 6为最佳，小于0. 4会造成非常少的匹配点，大于0. 6会造成大量的错误匹配点。

Low D G . Distinctive Image Features from Scale-Invariant Keypoints[J]. International Journal of Computer Vision, 2004.

knn的实现：可以简单分为：索引构建和搜索策略

比如说，kd树就是其中一个比较主流索引构建的方法，当索引构建完毕剩下的就是确定搜索策略。不同的索引构建方法会带来匹配效率和匹配效果的不同。

常见的索引构建方法：有基于hash函数的局部敏感哈希LSH、基于kmeans的kd树（kmeans作为超平面确定方法）、层次graph等。

【大咖讲堂】杨晓春：高维数据的近似最近邻搜索及其在跨模态检索中的应用_哔哩哔哩_bilibili

什么是kd树

kd树的全称是k-dimensional tree，是一种分割k维数据空间的数据结构。

kd树是一种二叉树。在每一层上kd树沿着按照划分维度将数据分为两组，两组数据依次进行分割形成子树。分割的对象称之为超平面(hyperplane)，超平面垂直于对应维度的轴。理想的超平面是对应维度的中位数(median)，这样可以保证树的平衡(balance)，从而降低树的深度。

下图所示，是k=2时的一颗kd树。需要提醒的是进行划分(split)的维度的顺序可以是任意的，不一定按照x,y,z,x,y,z…的顺序进行。每一个节点都会记录划分的维度。FLANN中有划分维度选择（或者叫做，超平面确定）的算法（比如随机最大方差，或者，kmeans聚类）。

参考博文：KD树的主要算法以及FLANN(PCL)的实现分析 - Fun With GeometryFun With Geometry

在FLANN中，将kd树的节点定义如下：

../flann/src/cpp/flann/algorithms/kdtree_index.h

struct Node{/*** Dimension used for subdivision.* 如果是中间节点，则为划分维度* 如果是叶节点的话，则为该元素在数据集中的索引*/int divfeat;/*** The values used for subdivision.* 划分维度对应的值*/DistanceType divval;/*** Point data* 保存节点对应的原始元素，只有叶节点才会保存*/ElementType* point;/*** The child nodes.* child1为左子树和child2为右子树* 左子树节点及其子节点在划分维度的对应值均小于当前节点* 右子树节点及其子节点在划分维度的对应值均大于当前节点*/Node* child1, *child2;}

FLANN中kd树的构建（超平面的确定）

维度划分方式对查询性能的影响

网上很多其他实现会将kd树的维度划分就保持x,y,z,…,x,y,z…的顺序。实际上，并不一定要保持这样的顺序。而且，不同的划分顺序，往往能给kd树的效率带来影响。

因为点数与维数是固定的，kd树的深度(depth)也是固定的，如果采用均匀划分的方式，在分散程度比较小的维度上过度划分，势必会造成分散维度较大的维度划分不充分，更容易导致“长条形”区域出现。在各维度查询概率相同的情况下，长条形肯定更容易被命中，导致被查询次数过多而效率不佳。

以上两图为例，同样的输入数据(二维)，采用y-x-y的划分方式与x-x-y的方式得到的效果，可以看到，后一种划分比较充分，从而查询涉及无关分支的情况比较小，查询效率可能会比较高(具体高不高要看查询范围是不是趋向于正方形)。

大规模数据swap的方式

在kd树的构建中，涉及到了频繁的依据超平面进行分割的问题。因为kd树是面向高维数据的，高维数据的点的尺寸通常比较大，如果直接针对数据本身进行swap，那么会多次调用点数据本身的拷贝赋值函数，开销比较大。有没有比较合适的的方法？

其一就是采用指针作为数组元素，而不是对象本身，这样指针的swap就要廉价很多了。但是这样的话，相当于对输入数据就有要求了，恐怕适用性不强。

还有一种方法就是FLANN使用的方式，比较提倡使用。具体方法是，待构建的kd树持有原始数据points，然后我们构建一个和原始数据等长的索引数组indices，indices[i]标识数据元素在原始数据中的索引值，因此调整后的p_i=points[indices[i]]，当进行分割的时候，调整的只是indices的数值，开销就要小很多了，而且原始数据的顺序并没有发生改变。

比如，上图中将原始数据{12,45,67,89,10,72,1，-9，87}进行中值排序的时候，并不需要swap原始数据中元素本身，而是swap元素在原始数据中对应的索引。

FLANN中超平面确定方法

FLANN在每一次划分数据partition的时候，会随机挑选出100个元素，计算各个维度的均值和方差。找到方差最大的5个维度，并随机选出1个作为“划分维度divfeat”，该维度对应的均值作为“划分维度对应值divval”。

利用divfeat和divval，将原始数据在划分维度divfeat上划分为，并且在索引数组ind上重新排序（flann中使用快速排序）。

     *  dataset[ind[0 , ... , lim1-1]][divfeat] < divval*  dataset[ind[lim1 , ... , lim2-1]][divfeat] == divval*  dataset[ind[lim2 , ... , count]][divfeat] > divval

对于重新排序后的ind数组，ind[0 , ... , median]索引的原始数据在divfeat维度上的值小于等于divval；ind[median , ... , count]索引的原始数据在divfeat维度上的值大于等于divval；是否等于按照下述条件进行判断：

当 lim1 > count/2 时，median = lim1;
否则当 lim2 < count2 时，median = lim2;
否则median = count/2;

其他超平面确定方法

最理想的超平面是中位数，而确定中位数的算法还有其他的方法。比如说，因为而中位数并不要求数据完全排序，而是要求在中位数左侧的数据不大于中位数，中位数右侧的数据不小于中位数而已。因此还可以使用std::nth_element, 其算法复杂度为O(n)，系数为2左右。有技术博客表明当输入数据分布不均匀时，采用FLANN中超平面确定方法进行划分的方式很容易导致树不均衡从而使得树的深度增加，查询效率降低。

KD树的主要算法以及FLANN(PCL)的实现分析 - Fun With GeometryFun With Geometry

按照其描述的方法，我对flann kd树的超平面分割方法进行改进，结果表示效率提升仅2%，且匹配准确率下降。

其他超平面确定方法，还有k-means、Hierachical Clustering等，暂时挂起，后续会继续探索。

FLANN中kd树的查询

最近距离表&搜索半径

在kd树的查询过程中，会维护一个“最近距离表”，这个表中会记录k个最近节点及其距离。并且将表中的最大距离作为搜索距离。

划分距离是指：搜索点和KD树节点在划分维度上的差的绝对值。可以设想一下，如果划分距离小于搜索半径，则存在潜在最近节点，那么后续可以在相对应的子树上继续查询。

FLANN中采用KNNSimpleResultSet来维护“最近距离表”；addPoint的实现是一个on-line的冒泡排序。worstDist()返回表中的最远距离。

../flann/src/cpp/flann/util/result_set.h

template <typename DistanceType>
class KNNSimpleResultSet : public ResultSet<DistanceType>
{
public:typedef DistanceIndex<DistanceType> DistIndex;/*** Add a point to result set* @param dist distance to point* @param index index of point*/void addPoint(DistanceType dist, size_t index);DistanceType worstDist() const{return worst_distance_;}
private:size_t capacity_;size_t count_;DistanceType worst_distance_;std::vector<DistIndex> dist_index_;
};

近似最近邻查询（Approximate Nearest Neighbor Search,ANNS）

FLANN中的ANNS实现如下：

../flann/src/cpp/flann/algorithms/kdtree_index.h

  /***  Search starting from a given node of the tree.  Based on any mismatches at*  higher levels, all exemplars below this level must have a distance of*  at least "mindistsq".*/template<bool with_removed>void searchLevel(ResultSet<DistanceType>& result_set, const ElementType* vec, NodePtr node, DistanceType mindist, int& checkCount, int maxCheck, float epsError, Heap<BranchSt>* heap, DynamicBitset& checked) const{if (result_set.worstDist()<mindist) {//           printf("Ignoring branch, too far\n");return;}/* If this is a leaf node, then do check and return. */if ((node->child1 == NULL)&&(node->child2 == NULL)) {int index = node->divfeat;if (with_removed) {if (removed_points_.test(index)) return;}/*  Do not check same node more than once when searching multiple trees. */if ( checked.test(index) || ((checkCount>=maxCheck)&& result_set.full()) ) return;checked.set(index);checkCount++;DistanceType dist = distance_(node->point, vec, veclen_);result_set.addPoint(dist,index);return;}/* Which child branch should be taken first? */ElementType val = vec[node->divfeat];DistanceType diff = val - node->divval;NodePtr bestChild = (diff < 0) ? node->child1 : node->child2;NodePtr otherChild = (diff < 0) ? node->child2 : node->child1;/* Create a branch record for the branch not taken.  Add distanceof this feature boundary (we don't attempt to correct for anyuse of this feature in a parent node, which is unlikely tohappen and would have only a small effect).  Don't botheradding more branches to heap after halfway point, as cost ofadding exceeds their value.*///计算划分距离DistanceType new_distsq = mindist + distance_.accum_dist(val, node->divval, node->divfeat);//这里进行近似判断：//NN过滤Node的条件是 “搜索点的划分距离d” 大于 “当前搜索半径r” ，//ANN通过将 r除以“大于1的系数α”，使得Node 更容易被过滤，从而加快了查询速度，//α = 1+eps;这里epsError就是α//或者说，ANN通过将d*α作为“搜索点到 Node外接矩形的距离”，这样就更容易让 “搜索点的划分距离d” 大于 “当前搜索半径r” //但是得到的结果也是近似的最近点。//     if (2 * checkCount < maxCheck  ||  !result.full()) {if ((new_distsq*epsError < result_set.worstDist())||  !result_set.full()) {heap->insert( BranchSt(otherChild, new_distsq) );}/* Call recursively to search next level down. */searchLevel<with_removed>(result_set, vec, bestChild, mindist, checkCount, maxCheck, epsError, heap, checked);}

FLANN中kd树查询策略

0. 构建n个kd树：根据上文提到的随机最大方差的方法，构建n个kd树，默认为4。

1. 对于每一个搜索点到会在n个kd树中进行查询，查询过程中维护同一个“最近距离表”

2. 采用子树堆的方式将“搜索点到kd树节点的划分距离d” 小于 "当前搜索半径r"的对应子树分支进行缓存，在遍历过当前所有kd树之后，会对子树集进行查询

3. 设置总的节点访问数量，用于防止过度遍历子树导致的耗时问题

FLANN中对于kd树的数量参数设置在：../flann/src/cpp/flann/algorithms/kdtree_index.h

struct KDTreeIndexParams : public IndexParams
{KDTreeIndexParams(int trees = 4,int use_nthelement = 0){(*this)["algorithm"] = FLANN_INDEX_KDTREE;(*this)["trees"] = trees;//kd树的数量}
};

多线程并行查询

flann1.6.10没有包含多线程查询方法，flann1.8.4和flann1.9.1采用多线程，对于每个搜索点都会开启一个查询线程：

../flann/src/cpp/flann/algorithms/nn_index.h

    /*** @brief Perform k-nearest neighbor search* @param[in] queries The query points for which to find the nearest neighbors* @param[out] indices The indices of the nearest neighbors found* @param[out] dists Distances to the nearest neighbors found* @param[in] knn Number of nearest neighbors to return* @param[in] params Search parameters*/virtual int knnSearch(const Matrix<ElementType>& queries,Matrix<size_t>& indices,Matrix<DistanceType>& dists,size_t knn,const SearchParams& params) const{assert(queries.cols == veclen());assert(indices.rows >= queries.rows);assert(dists.rows >= queries.rows);assert(indices.cols >= knn);assert(dists.cols >= knn);bool use_heap;if (params.use_heap==FLANN_Undefined) {use_heap = (knn>KNN_HEAP_THRESHOLD)?true:false;}else {use_heap = (params.use_heap==FLANN_True)?true:false;}int count = 0;if (use_heap) {
#pragma omp parallel num_threads(params.cores){KNNResultSet2<DistanceType> resultSet(knn);
#pragma omp for schedule(static) reduction(+:count)for (int i = 0; i < (int)queries.rows; i++) {resultSet.clear();findNeighbors(resultSet, queries[i], params);size_t n = std::min(resultSet.size(), knn);resultSet.copy(indices[i], dists[i], n, params.sorted);indices_to_ids(indices[i], indices[i], n);count += n;}}}else {
#pragma omp parallel num_threads(params.cores){KNNSimpleResultSet<DistanceType> resultSet(knn);
#pragma omp for schedule(static) reduction(+:count)for (int i = 0; i < (int)queries.rows; i++) {resultSet.clear();findNeighbors(resultSet, queries[i], params);size_t n = std::min(resultSet.size(), knn);resultSet.copy(indices[i], dists[i], n, params.sorted);indices_to_ids(indices[i], indices[i], n);count += n;}}}return count;}

FLANN中利用SearchParams对于查询过程的参数进行管理

通过对参数cores进行设置线程数

../flann/src/cpp/flann/util/params.h

struct SearchParams
{SearchParams(int checks_ = 32, float eps_ = 0.0, bool sorted_ = true ) :checks(checks_), eps(eps_), sorted(sorted_){max_neighbors = -1;use_heap = FLANN_Undefined;cores = 1;matrices_in_gpu_ram = false;}// how many leafs to visit when searching for neighbours (-1 for unlimited)int checks;// search for eps-approximate neighbours (default: 0)float eps;// only for radius search, require neighbours sorted by distance (default: true)bool sorted;// maximum number of neighbors radius search should return (-1 for unlimited)int max_neighbors;// use a heap to manage the result set (default: FLANN_Undefined)tri_type use_heap;// how many cores to assign to the search (used only if compiled with OpenMP capable compiler) (0 for auto)int cores;// for GPU search indicates if matrices are already in GPU rambool matrices_in_gpu_ram;};

小结：

对比多种索引方法，综合考虑效率和匹配准确率，优先选择kd树。（后续可以研究一下基于“结构化graph”的索引方法，据说效率更高，但是flann没有相关实现）