• 介绍

    • 并发concurrency VS 并行parallelism
  • 解决问题的步骤
    • 了解当前状态
    • 内在表示
    • 找到替代办法
    • 从替代办法中选择符合的

转载请注明出处:http://blog.csdn.net/c602273091/article/details/55004823

介绍

之前的硬件篇: http://blog.csdn.net/c602273091/article/details/54728511 介绍了代码加速的平台的基础知识。这一篇主要是以kmeans算法为例子介绍软件层面的代码加速的知识。在软件层面,包括数据结构、算法、软件架构三个重要部分。

在软件方面,需要明确的一个重要的东西就是找到并发部分(concurrency),找到之后把它映射到并行平台。

并发(concurrency) VS 并行(parallelism)

首先要明白并发和并行的区别。并发针对的是应用程序、并行指的是平台并行。并发说的是一个程序是分成几部分并行执行的。并行平台说的就是可以同时执行多个任务的平台。

解决问题的步骤

  • 理解当前状态
  • 观察内部表示
  • 查找可替代的方法
  • 从中选择

    以K均值算法为例。k均值就是找到k个中心,数据点距离它最近的中心是同一个中心就是同一类。k个中心的值会不断更新,直到满足条件。

K均值的算法流程如下:

  • 初始化K个中心
  • 根据数据点到每个中心的距离远近,把数据点分成K类
  • 在每个类中计算中值作为新的中点
  • 是否满足迭代条件,不满足的话回到第二步

    k均值的代码实现如下:

/* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */
/*   File:         seq_kmeans.c  (sequential version)                        */
/*   Description:  Implementation of simple k-means clustering algorithm     */
/*                 This program takes an array of N data objects, each with  */
/*                 M coordinates and performs a k-means clustering given a   */
/*                 user-provided value of the number of clusters (K). The    */
/*                 clustering results are saved in 2 arrays:                 */
/*                 1. a returned array of size [K][N] indicating the center  */
/*                    coordinates of K clusters                              */
/*                 2. membership[N] stores the cluster center ids, each      */
/*                    corresponding to the cluster a data object is assigned */
/*                                                                           */
/*   Author:  Wei-keng Liao                                                  */
/*            ECE Department, Northwestern University                        */
/*            email: wkliao@ece.northwestern.edu                             */
/*   Copyright, 2005, Wei-keng Liao                                          */
/*                                                                           */
/* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */// Copyright (c) 2005 Wei-keng Liao
// Copyright (c) 2011 Serban Giuroiu
//
// Permission is hereby granted, free of charge, to any person obtaining a copy
// of this software and associated documentation files (the "Software"), to deal
// in the Software without restriction, including without limitation the rights
// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
// copies of the Software, and to permit persons to whom the Software is
// furnished to do so, subject to the following conditions:
//
// The above copyright notice and this permission notice shall be included in
// all copies or substantial portions of the Software.
//
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
// THE SOFTWARE.// -----------------------------------------------------------------------------#include <stdio.h>
#include <stdlib.h>#include "kmeans.h"/*----< euclid_dist_2() >----------------------------------------------------*/
/* square of Euclid distance between two multi-dimensional points            */
// 计算两个向量之间的距离
__inline static
float euclid_dist_2(int    numdims,  /* no. dimensions */float *coord1,   /* [numdims] */float *coord2)   /* [numdims] */
{int i;float ans=0.0;for (i=0; i<numdims; i++)ans += (coord1[i]-coord2[i]) * (coord1[i]-coord2[i]);return(ans);
}// 找到最近的中心
/*----< find_nearest_cluster() >---------------------------------------------*/
__inline static
int find_nearest_cluster(int     numClusters, /* no. clusters */int     numCoords,   /* no. coordinates */float  *object,      /* [numCoords] */float **clusters)    /* [numClusters][numCoords] */
{int   index, i;float dist, min_dist;/* find the cluster id that has min distance to object */index    = 0;min_dist = euclid_dist_2(numCoords, object, clusters[0]);for (i=1; i<numClusters; i++) {dist = euclid_dist_2(numCoords, object, clusters[i]);/* no need square root */if (dist < min_dist) { /* find the min and its array index */min_dist = dist;index    = i;}}return(index);
}/*----< seq_kmeans() >-------------------------------------------------------*/
/* return an array of cluster centers of size [numClusters][numCoords]       */
float** seq_kmeans(float **objects,      /* in: [numObjs][numCoords] */int     numCoords,    /* no. features */int     numObjs,      /* no. objects */int     numClusters,  /* no. clusters */float   threshold,    /* % objects change membership */int    *membership,   /* out: [numObjs] */int    *loop_iterations)
{int      i, j, index, loop=0;int     *newClusterSize; /* [numClusters]: no. objects assigned in eachnew cluster */// 计算每个中心有多少个数据                            float    delta;          /* % of objects change their clusters */// 计算每次迭代过程中类别改变的数据点的平均比例float  **clusters;       /* out: [numClusters][numCoords] */// 输出的K个聚类中心float  **newClusters;    /* [numClusters][numCoords] *//* allocate a 2D space for returning variable clusters[] (coordinatesof cluster centers) */// 这种开辟二维数组的方法很特别   clusters    = (float**) malloc(numClusters * sizeof(float*));assert(clusters != NULL);clusters[0] = (float*)  malloc(numClusters * numCoords * sizeof(float));assert(clusters[0] != NULL);for (i=1; i<numClusters; i++)clusters[i] = clusters[i-1] + numCoords;/* pick first numClusters elements of objects[] as initial cluster centers*/for (i=0; i<numClusters; i++)for (j=0; j<numCoords; j++)clusters[i][j] = objects[i][j];/* initialize membership[] */for (i=0; i<numObjs; i++) membership[i] = -1;/* need to initialize newClusterSize and newClusters[0] to all 0 */newClusterSize = (int*) calloc(numClusters, sizeof(int));assert(newClusterSize != NULL);newClusters    = (float**) malloc(numClusters *            sizeof(float*));assert(newClusters != NULL);newClusters[0] = (float*)  calloc(numClusters * numCoords, sizeof(float));assert(newClusters[0] != NULL);for (i=1; i<numClusters; i++)newClusters[i] = newClusters[i-1] + numCoords;do {delta = 0.0;// 对数据点找到它们的类别,计算每个类别的每一维的向量之和用于后期计算新的中心for (i=0; i<numObjs; i++) {/* find the array index of nestest cluster center */index = find_nearest_cluster(numClusters, numCoords, objects[i],clusters);/* if membership changes, increase delta by 1 */if (membership[i] != index) delta += 1.0;/* assign the membership to object i */membership[i] = index;/* update new cluster centers : sum of objects located within */newClusterSize[index]++;for (j=0; j<numCoords; j++)newClusters[index][j] += objects[i][j];}/* average the sum and replace old cluster centers with newClusters */for (i=0; i<numClusters; i++) {for (j=0; j<numCoords; j++) {if (newClusterSize[i] > 0)clusters[i][j] = newClusters[i][j] / newClusterSize[i];newClusters[i][j] = 0.0;   /* set back to 0 */}newClusterSize[i] = 0;   /* set back to 0 */}delta /= numObjs;} while (delta > threshold && loop++ < 500);// 终止条件就是数据的类别变化没有很多或者迭代次数超过500次*loop_iterations = loop + 1;free(newClusters[0]);free(newClusters);free(newClusterSize);return clusters;
}

了解当前状态

平台、资源、目标(结果要正确)、满足需求(速度要快)。

平台:Linux + GCC, 86多核处理器
资源:2~8核,2~8路SIMD。
存储:32K L1缓存、256K L2缓存。

K均值中第二三步比较耗时间,这里面需要注意的变量就是特征的维度D、样本数N、中心个数K都会影响速度。

内在表示

把算法过程中的重要数据结构表示出来。

一开始是N*D个样本,有K*D个中心,要得到的是N个样本的类别N*1。

找到替代办法

替代办法就是把可并发部分对应到平行平台或者并行化某些操作。

映射到平台的方法有:SIMD、OpenMP

在K均值中比如计算距离,比如计算加减法之类的运算。

从替代办法中选择符合的

首先是结果是否正确,然后才是效率问题。
具体的标准有:

  • Efficiency: Runs quickly, makes good use of computational resources
  • Simplicity: Easy to understand code is easier to develop, debug, verify and modify
  • Portability: Should run on widest range of parallel computers
  • Scalability: Should be effective on a wide range of processing elements
    —  Other considerations: Practicality, Hardware, Engineering cost

其实这个步骤看起来很虚,所以需要实践。具体做法就是把上面的k均值改成OpenMP的加速代码。下面的代码也只是1x的加速,这当然还不够,实际上还可以做很多工作。

/* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */
/*   File:         kmeans_clustering.c  (OpenMP version)                     */
/*   Description:  Implementation of simple k-means clustering algorithm     */
/*                 This program takes an array of N data objects, each with  */
/*                 M coordinates and performs a k-means clustering given a   */
/*                 user-provided value of the number of clusters (K). The    */
/*                 clustering results are saved in 2 arrays:                 */
/*                 1. a returned array of size [K][N] indicating the center  */
/*                    coordinates of K clusters                              */
/*                 2. membership[N] stores the cluster center ids, each      */
/*                    corresponding to the cluster a data object is assigned */
/*                                                                           */
/*   Author:  Wei-keng Liao                                                  */
/*            ECE Department, Northwestern University                        */
/*            email: wkliao@ece.northwestern.edu                             */
/*   Copyright, 2005, Wei-keng Liao                                          */
/*                                                                           */
/* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */#include <stdio.h>
#include <stdlib.h>#include <omp.h>
#include "kmeans.h"/*----< euclid_dist_2() >----------------------------------------------------*/
/* square of Euclid distance between two multi-dimensional points            */
__inline static
float euclid_dist_2(int    numdims,  /* no. dimensions */float *coord1,   /* [numdims] */float *coord2)   /* [numdims] */
{int i;float ans=0.0;for (i=0; i<numdims; i++)ans += (coord1[i]-coord2[i]) * (coord1[i]-coord2[i]);return(ans);
}/*----< find_nearest_cluster() >---------------------------------------------*/
__inline static
int find_nearest_cluster(int     numClusters, /* no. clusters */int     numCoords,   /* no. coordinates */float  *object,      /* [numCoords] */float **clusters)    /* [numClusters][numCoords] */
{int   index, i;float dist, min_dist;/* find the cluster id that has min distance to object */index    = 0;min_dist = euclid_dist_2(numCoords, object, clusters[0]);for (i=1; i<numClusters; i++) {dist = euclid_dist_2(numCoords, object, clusters[i]);/* no need square root */if (dist < min_dist) { /* find the min and its array index */min_dist = dist;index    = i;}}return(index);
}/*----< kmeans_clustering() >------------------------------------------------*/
/* return an array of cluster centers of size [numClusters][numCoords]       */
float** omp_kmeans(int     is_perform_atomic, /* in: */float **objects,           /* in: [numObjs][numCoords] */int     numCoords,         /* no. coordinates */int     numObjs,           /* no. objects */int     numClusters,       /* no. clusters */float   threshold,         /* % objects change membership */int    *membership)        /* out: [numObjs] */
{int      i, j, k, index, loop=0;int     *newClusterSize; /* [numClusters]: no. objects assigned in eachnew cluster */float    delta;          /* % of objects change their clusters */float  **clusters;       /* out: [numClusters][numCoords] */float  **newClusters;    /* [numClusters][numCoords] */double   timing;int      nthreads;             /* no. threads */int    **local_newClusterSize; /* [nthreads][numClusters] */float ***local_newClusters;    /* [nthreads][numClusters][numCoords] */nthreads = omp_get_max_threads();/* allocate a 2D space for returning variable clusters[] (coordinatesof cluster centers) */clusters    = (float**) malloc(numClusters *             sizeof(float*));assert(clusters != NULL);clusters[0] = (float*)  malloc(numClusters * numCoords * sizeof(float));assert(clusters[0] != NULL);for (i=1; i<numClusters; i++)clusters[i] = clusters[i-1] + numCoords;/* pick first numClusters elements of objects[] as initial cluster centers*/for (i=0; i<numClusters; i++)for (j=0; j<numCoords; j++)clusters[i][j] = objects[i][j];/* initialize membership[] */for (i=0; i<numObjs; i++) membership[i] = -1;/* need to initialize newClusterSize and newClusters[0] to all 0 */newClusterSize = (int*) calloc(numClusters, sizeof(int));assert(newClusterSize != NULL);newClusters    = (float**) malloc(numClusters *            sizeof(float*));assert(newClusters != NULL);newClusters[0] = (float*)  calloc(numClusters * numCoords, sizeof(float));assert(newClusters[0] != NULL);for (i=1; i<numClusters; i++)newClusters[i] = newClusters[i-1] + numCoords;if (!is_perform_atomic) {/* each thread calculates new centers using a private space,then thread 0 does an array reduction on them. This approachshould be faster */local_newClusterSize    = (int**) malloc(nthreads * sizeof(int*));assert(local_newClusterSize != NULL);local_newClusterSize[0] = (int*)  calloc(nthreads*numClusters,sizeof(int));assert(local_newClusterSize[0] != NULL);for (i=1; i<nthreads; i++)local_newClusterSize[i] = local_newClusterSize[i-1]+numClusters;/* local_newClusters is a 3D array */local_newClusters    =(float***)malloc(nthreads * sizeof(float**));assert(local_newClusters != NULL);local_newClusters[0] =(float**) malloc(nthreads * numClusters *sizeof(float*));assert(local_newClusters[0] != NULL);for (i=1; i<nthreads; i++)local_newClusters[i] = local_newClusters[i-1] + numClusters;for (i=0; i<nthreads; i++) {for (j=0; j<numClusters; j++) {local_newClusters[i][j] = (float*)calloc(numCoords,sizeof(float));assert(local_newClusters[i][j] != NULL);}}}if (_debug) timing = omp_get_wtime();do {delta = 0.0;if (is_perform_atomic) {#pragma omp parallel for \private(i,j,index) \firstprivate(numObjs,numClusters,numCoords) \shared(objects,clusters,membership,newClusters,newClusterSize) \schedule(static) \reduction(+:delta)for (i=0; i<numObjs; i++) {/* find the array index of nestest cluster center */index = find_nearest_cluster(numClusters, numCoords, objects[i],clusters);/* if membership changes, increase delta by 1 */if (membership[i] != index) delta += 1.0;/* assign the membership to object i */membership[i] = index;/* update new cluster centers : sum of objects located within */#pragma omp atomicnewClusterSize[index]++;for (j=0; j<numCoords; j++)#pragma omp atomicnewClusters[index][j] += objects[i][j];}}else {#pragma omp parallel \shared(objects,clusters,membership,local_newClusters,local_newClusterSize){int tid = omp_get_thread_num();#pragma omp for \private(i,j,index) \firstprivate(numObjs,numClusters,numCoords) \schedule(static) \reduction(+:delta)for (i=0; i<numObjs; i++) {/* find the array index of nestest cluster center */index = find_nearest_cluster(numClusters, numCoords,objects[i], clusters);/* if membership changes, increase delta by 1 */if (membership[i] != index) delta += 1.0;/* assign the membership to object i */membership[i] = index;/* update new cluster centers : sum of all objects locatedwithin (average will be performed later) */local_newClusterSize[tid][index]++;for (j=0; j<numCoords; j++)local_newClusters[tid][index][j] += objects[i][j];}} /* end of #pragma omp parallel *//* let the main thread perform the array reduction */for (i=0; i<numClusters; i++) {for (j=0; j<nthreads; j++) {newClusterSize[i] += local_newClusterSize[j][i];local_newClusterSize[j][i] = 0.0;for (k=0; k<numCoords; k++) {newClusters[i][k] += local_newClusters[j][i][k];local_newClusters[j][i][k] = 0.0;}}}}/* average the sum and replace old cluster centers with newClusters */for (i=0; i<numClusters; i++) {for (j=0; j<numCoords; j++) {if (newClusterSize[i] > 1)clusters[i][j] = newClusters[i][j] / newClusterSize[i];newClusters[i][j] = 0.0;   /* set back to 0 */}newClusterSize[i] = 0;   /* set back to 0 */}delta /= numObjs;} while (delta > threshold && loop++ < 500);if (_debug) {timing = omp_get_wtime() - timing;printf("nloops = %2d (T = %7.4f)",loop,timing);}if (!is_perform_atomic) {free(local_newClusterSize[0]);free(local_newClusterSize);for (i=0; i<nthreads; i++)for (j=0; j<numClusters; j++)free(local_newClusters[i][j]);free(local_newClusters[0]);free(local_newClusters);}free(newClusters[0]);free(newClusters);free(newClusterSize);return clusters;
}

如何写出运行速度更快的代码:软件篇——加速k均值的代码(OpenMP)相关推荐

  1. 用集合return多个值_Python拾珍:用这些功能写出更简洁、更可读或更高效的代码

    本章我会带领大家回顾那些遗漏的地方.Python提供了不少并不是完全必需的功能(不用它们也能写出好代码),但有时候,使用这些功能可以写出更简洁.更可读或者更高效的代码,甚至有时候三者兼得. 19.1 ...

  2. Python拾珍:用这些功能写出更简洁、更可读或更高效的代码

    本章我会带领大家回顾那些遗漏的地方.Python提供了不少并不是完全必需的功能(不用它们也能写出好代码),但有时候,使用这些功能可以写出更简洁.更可读或者更高效的代码,甚至有时候三者兼得. 19.1 ...

  3. 怎样才能写出一坨一坨高质量低bug数量的代码呢?

    不止一次这样,一口气写出一坨一坨的大量代码,由于时间关系,平时过程中很少测,到最后写完逻辑,联测的时候,问题一大堆: 1.业务bug 2.语法低级bug 3.设计不够好 ··· 后来的修改,意料之中的 ...

  4. 如何写出更”优质“的文章-排版篇

    前言 本系列文章收录于专栏<写作技巧>,专栏服务于第三届新星计划参赛Android通道的学员,作者结合自身写作经验,总结出来的写作技巧. 目前CSDN第三届新星计划正在火热进行中,欢迎参与 ...

  5. 如何写出更”优质“的文章-内容篇

    前言 本系列文章收录于专栏<写作技巧>,专栏服务于第三届新星计划参赛Android通道的学员,作者结合自身写作经验,总结出来的写作技巧. 目前CSDN第三届新星计划正在火热进行中, 欢迎参 ...

  6. 如何写出一份优秀的软件设计文档

    作为一名软件工程师,我花了很多时间阅读和编写设计文档.在完成了数百篇这些文档之后,我亲眼目睹了优秀设计文档与项目最终成功之间的强烈关联. 本文试图描述什么使设计文档变得更好. 本文分为4个部分: · ...

  7. 【译】如何写出一份优秀的软件设计文档

    作为一名软件工程师,我花了很多时间阅读和编写设计文档.在完成了数百篇这些文档之后,我亲眼目睹了优秀设计文档与项目最终成功之间的强烈关联. 本文试图描述什么使设计文档变得更好. 本文分为4个部分: 为什 ...

  8. 用JavaSwing也能写出win10扁平风的软件

    ALL_DONE This is a To-do list program written in Java. It references Microsoft's TO-DO. 代码总量:18个类,16 ...

  9. 如何从数据库中筛选出达成指定里程碑节点的项目_【译】如何写出一份优秀的软件设计文档...

    作为一名软件工程师,我花了很多时间阅读和编写设计文档.在完成了数百篇这些文档之后,我亲眼目睹了优秀设计文档与项目最终成功之间的强烈关联. 本文试图描述什么使设计文档变得更好. 本文分为4个部分: 为什 ...

最新文章

  1. 【SLAM后端】—— ceres优化相机位姿求解
  2. 在线协作沟通,以目标分解成任务树基础的团队配合
  3. 【 Notes 】WLLS Algorithm of TOA - Based Positioning (include the two - step WLS estimator)
  4. 浅谈对腾讯云微信小程序解决方案服务端的理解(主要针对信道服务)
  5. 自定义ORM系列(三)工具雏形及基本用法
  6. DIV盒子模型介绍 div用法
  7. Android简单调用相机Camera功能,实现打开照相功能
  8. linux function
  9. python print(len(pi_string))_Python如何从文件中读取数据
  10. gdiplustypes min max找不到标识符_当年月销过万的比亚迪宋MAX,为何突然不香了?...
  11. python指定目录生成.csv文件_python文件处理-根据csv文件内容,将对应图像拷贝到指定文件夹...
  12. MySQL安全***实战指南之体系结构篇
  13. matlab2c使用c++实现matlab函数系列教程-unique函数
  14. 数据预处理之抽取文本信息
  15. cdt规约报文用程序解析_程序员必备的学习笔记《TCP/IP详解(二)》
  16. Web Server的启动过程
  17. 快用苹果助手安装失败_最新建行信用卡调额失败后的抓包详细教程
  18. 手把手教你将小米手机刷机!
  19. <PCI-E> PCI-E的 x1/x4/x8/x16 四种插槽区别
  20. python将多个列表合并_Python中将两个或多个list合成一个list的方法小结

热门文章

  1. E盘显示无法访问位置不可用的资料找回办法
  2. extjs4mvc增删改查_Web应用增删改查的实现技巧
  3. 教你如何正确的使用word 打印ppt
  4. 12个超好用的免费在线工具,大大提高生产力,建议收藏!
  5. 李永乐(三)伴随矩阵、可逆矩阵——笔记
  6. 毕业设计 单片机智慧农业管理系统 -大棚管理系统 自动灌溉系统
  7. uLab Systems在中国上海开设办事处以支持全球扩张
  8. 4.4.4求解迷宫问题
  9. Linux内核学习遇到的汇编指令
  10. 2017年8月9日 星期三 --出埃及记 Exodus 28:31