文章目录

  • OMP parallel
    • OpenMP安装
    • OpenMP示例
      • 1) OMP Hello World
      • 2) OMP for 并行
      • 3) OMP 官方示例
      • 4) map使用OMP遍历
  • TBB的安装和使用
    • Gcc9的安装
    • TBB 安装
    • TBB使用
  • OpenCV parallel_for_ 遍历map测试
  • opencv Mat.forEach遍历测试

在图像处理等应用中,我们经常需要对矩阵,大数量STL对象进行遍历操作,因此并行化对算法的加速也非常重要。
除了使用opencv提供的**parallel_for_**函数可对普通STL容器进行并行遍历,如vector。
参见 https://blog.csdn.net/weixin_41469272/article/details/126617752
本文介绍其他两种并行办法。 TBB和OMP

OMP parallel

OpenMP安装

sudo apt install libomp-dev

OpenMP示例

1) OMP Hello World

OMP是相对使用较为简洁的并行工具,仅需在需要并行的语句前加入#pragma omp parallel,便可实现并行。

      #pragma omp parallel{每个线程都会执行大括号里的代码}

说明:以下出现c++代码c的写法
参考:https://blog.csdn.net/ab0902cd/article/details/108770396
https://blog.csdn.net/zhongkejingwang/article/details/40350027
omp_test.cpp

#include <omp.h>int main(){printf("The output:\n");#pragma omp parallel     /* define multi-thread section */{printf("Hello World\n");}/* Resume Serial section*/printf("Done\n");
}
g++ omp_test.cpp -fopenmp -o omptest
./test

Result:

The output:
Hello World
Hello World
Hello World
Hello World
Hello World
Hello World
Hello World
Hello World
Done

2) OMP for 并行

当需要将for循环并行,则可在for语句之前加上:#pragma omp parallel for

int main(int argc, char *argv[]) {int length = 6;float *buf = new float[length];#pragma omp parallel for num_threads(3)for(int i = 0; i < length; i++) {int tid = omp_get_thread_num();printf("i:%d is handled on thread %d\n", i, tid);buf[i] = i;}
}

其中num_threads用于指定线程个数。
Result

i:0 is handled on thread 0
i:1 is handled on thread 0
i:4 is handled on thread 2
i:5 is handled on thread 2
i:2 is handled on thread 1
i:3 is handled on thread 1

3) OMP 官方示例

#include <stdlib.h>   //malloc and free
#include <stdio.h>    //printf
#include <omp.h>      //OpenMP// Very small values for this simple illustrative example
#define ARRAY_SIZE 8     //Size of arrays whose elements will be added together.
#define NUM_THREADS 4    //Number of threads to use for vector addition./**  Classic vector addition using openMP default data decomposition.**  Compile using gcc like this:*      gcc -o va-omp-simple VA-OMP-simple.c -fopenmp**  Execute:*      ./va-omp-simple*/
int main (int argc, char *argv[])
{// elements of arrays a and b will be added// and placed in array cint * a;int * b;int * c;int n = ARRAY_SIZE;                 // number of array elementsint n_per_thread;                   // elements per threadint total_threads = NUM_THREADS;    // number of threads to use  int i;       // loop index// allocate spce for the arraysa = (int *) malloc(sizeof(int)*n);b = (int *) malloc(sizeof(int)*n);c = (int *) malloc(sizeof(int)*n);// initialize arrays a and b with consecutive integer values// as a simple examplefor(i=0; i<n; i++) {a[i] = i;}for(i=0; i<n; i++) {b[i] = i;}// Additional work to set the number of threads.// We hard-code to 4 for illustration purposes only.omp_set_num_threads(total_threads);// determine how many elements each process will work onn_per_thread = n/total_threads;// Compute the vector addition// Here is where the 4 threads are specifically 'forked' to// execute in parallel. This is directed by the pragma and// thread forking is compiled into the resulting exacutable.// Here we use a 'static schedule' so each thread works on  // a 2-element chunk of the original 8-element arrays.#pragma omp parallel for shared(a, b, c) private(i) schedule(static, n_per_thread)for(i=0; i<n; i++) {c[i] = a[i]+b[i];// Which thread am I? Show who works on what for this samll exampleprintf("Thread %d works on element%d\n", omp_get_thread_num(), i);}// Check for correctness (only plausible for small vector size)// A test we would eventually leave outprintf("i\ta[i]\t+\tb[i]\t=\tc[i]\n");for(i=0; i<n; i++) {printf("%d\t%d\t\t%d\t\t%d\n", i, a[i], b[i], c[i]);}// clean up memoryfree(a);  free(b); free(c);return 0;
}

Result:

其中,shared括号中说明所有线程公用的变量名,private括号中的变量为各个线程均独立的变量。
schedule()用于指定循环的线程分布策略,默认为static。
具体不同:
schedule(kind [, chunk_size])

kind:
• static: Iterations are divided into chunks of size chunk_size. Chunks are assigned to threads in the team in round-robin fashion in order of thread number.
• dynamic: Each thread executes a chunk of iterations then requests another chunk until no chunks remain to be distributed.
• guided: Each thread executes a chunk of iterations then requests another chunk until no chunks remain to be assigned. The chunk sizes start large and shrink to the indicated chunk_size as chunks are scheduled.
• auto: The decision regarding scheduling is delegated to the compiler and/or runtime system.
• runtime: The schedule and chunk size are taken from the run-sched-var ICV

static: OpenMP会给每个线程分配chunk_size次迭代计算。这个分配是静态的,线程分配规则根据for的遍历的顺序。
dynamic:动态调度迭代的分配是依赖于运行状态进行动态确定的,当需要分配新线程时,已有线程结束,则直接使用完成的线程,而不开辟新的线程。
guided:循环迭代划分成块的大小与未分配迭代次数除以线程数成比例,然后随着循环迭代的分配,块大小会减小为chunk值。chunk的默认值为1。开始比较大,以后逐渐减小。
runtime:将调度决策推迟到指定时开始,这选项不能指定块大小?(暂未测试)

参考:
https://blog.csdn.net/gengshenghong/article/details/7000979
https://blog.csdn.net/yiguagua/article/details/107053043

4) map使用OMP遍历

关于invalid controlling predicate的问题
OMP不支持终止条件为“!=”或者“==”的for循环,因为无法判断循环的数量。

int main()
{map<int,int> mii;map<int, string> mis;for (int i = 0; i < 50; i++) {mis[i] = to_string(i);}clock_t start,end;start = clock();#if 1mutex m;auto it = mis.begin();#pragma omp parallel for num_threads(3) shared(it)//Error '!=" can not be used in omp: invalid controlling predicatefor (int i = 0; i < mis.size(); i++){int tid = omp_get_thread_num();m.lock();mii[it->first] = atoi(it->second.c_str());cout << "Thread " << tid << " handle " << it->first << endl;m.unlock();it++;}#elsefor (auto it : mis){int tid = omp_get_thread_num();mii[it.first] = atoi(it.second.c_str());cout << "Thread " << tid << " handle " << it.first << endl;}
#endifend = clock();cout<<"time = "<<double(end-start)/CLOCKS_PER_SEC<<"s"<<endl;for (auto it = mii.begin(); it != mii.end(); it++){cout << "it->first: " << it->first << " it->second: " << it->second << endl;}
}

Result:

加OMP:time = 0.000877s
不加并行:time = 0.001862s

TBB的安装和使用

关于Intel的oneTBB工具与g++版本相互制约,因此在安装时较为麻烦
以下测试选择的工具版本:
TBB:v2020.0
Gcc:9.4

Gcc9的安装

sudo add-apt-repository ppa:ubuntu-toolchain-r/test
sudo apt-get update
sudo apt-get install gcc-9 g++-9

TBB 安装

wget https://github.com/oneapi-src/oneTBB/archive/refs/tags/v2020.0.tar.gz
tar zxvf v2020.0.tar.gz
cd oneTBBcp build/linux.gcc.inc build/linux.gcc-9.inc
修改 build/linux.gcc-9.inc 15,16 行:
CPLUS ?= g++-9
CONLY ?= gcc-9 #build
make compiler=gcc-9 stdver=c++17 -j20 DESTDIR=install tbb_build_prefix=build#***************************** TBB examples build *****************************************
#build test code:
g++-9 -std=c++17  -I ~/Download/softpackages/oneTBB/install/include/ -L/home/niebaozhen/Download/    softpackages/oneTBB/install/lib/ std_for_each.cpp -ltbb -Wl,-rpath=/home/niebaozhen/Download/soft    packages/oneTBB/install/lib/

参考链接:https://blog.csdn.net/weixin_32207065/article/details/112270765

Tips:
当TBB版本大于v2021.1.1时,cmake被引入,但是该版本TBB不支持gcc9/10
但是gcc版本高等于9时,才支持对TBB的使用,且编译标准建议为c++17。

说明链接
v2021.1.1版本编译命令

#tbb version >= v2021.1.1: cmake employed, however,
#libc++9&10 are incompatible with TBB version > 2021.xxmkdir build install
cd build
cmake -DCMAKE_INSTALL_PREFIX=../install/ -DCMAKE_CXX_STANDARD=17 -DCMAKE_CXX_COMPILER=/usr/bin/g++-9  -DTBB_TEST=OFF ..cmake -DCMAKE_INSTALL_PREFIX=../install/  -DTBB_TEST=OFF ..
make -j30
make install

TBB使用

std_for_each.cpp

#include <iostream>
#include <unistd.h>
#include <map>
#include <algorithm>
#include <chrono>#define __MUTEX__ 0
#if __MUTEX__
#include <mutex>
#endif#if __GNUC__ >= 9
#include <execution>
#endifusing namespace std; int main ()
{//cout << "gnu version: " << __GNUC__ << endl;int a[] = {0,1,3,4,5,6,7,8,9};map<int, int> mii;#if __MUTEX__mutex m;#endif  auto tt1 = chrono::steady_clock::now();#if __GNUC__ >= 9for_each(execution::par, begin(a), std::end(a), [&](int i) {#elsefor_each(begin(a), std::end(a), [&](int i) {#endif#if __MUTEX__lock_guard<mutex> guard(m);#endifmii[i] = i*2+1;//sleep(1);//cout << "Sleep one second" << endl;}); auto tt2 = chrono::steady_clock::now();auto dt = chrono::duration_cast<chrono::duration<double>>(tt2 - tt1);cout << "time = " << dt.count() << "s" <<endl;for(auto it = mii.begin(); it != mii.end(); it++) {cout << "mii[" << it->first << "] = " << it->second << endl;}
}

build:

g++ std_for_each.cpp
或:
g++-9 -std=c++17  -I ~/Download/softpackages/oneTBB/install/include/ -L/home/niebaozhen/Download/softpackages/oneTBB/install/lib/ std_for_each.cpp -ltbb -Wl,-rpath=/home/niebaozhen/Download/softpackages/oneTBB/install/lib/

Result:

可以看出,当遍历所操作的工作比较少时,并行反而会带来更多的耗时。
当遍历的操作较多,这里sleep来模拟较多的工作,并行体现出优势。

OpenCV parallel_for_ 遍历map测试

#include <iostream>
#include <opencv2/core.hpp>
#include <mutex>
#include <map>using namespace std;
using namespace cv;map<int, string> mis;
map<int, string>::iterator it;
mutex m;void fun (const Range range)
{//cout << "test*******" << endl; for (size_t i = range.start; i < range.end; i++) {m.lock();cout << "it->first: " << it->first << " it->second: " << it->second << endl;m.unlock();it++;}
}void parallel()
{parallel_for_(cv::Range(0, mis.size()), &fun);
}void oneline()
{for (auto it : mis) {m.lock();cout << "it.first: " << it.first << " it.second: " << it.second << endl;m.unlock();}
}int main ()
{for (int i = 0; i < 50; i++) {mis[i] = to_string(i);}it = mis.begin();clock_t start,end;start = clock();#if 0parallel();
#elseoneline();
#endifend = clock();cout<<"time = "<<double(end-start)/CLOCKS_PER_SEC<<"s"<<endl;return 0;
}

build:

g++ parallel_for_.cpp `pkg-config --libs --cflags opencv`

Result:
并行:
time = 0.002147s
不并行:
time = 0.000168s

结论:可能是it的限制导致并行变慢。

opencv Mat.forEach遍历测试

#include <pcl/point_cloud.h>
#include <pcl/point_types.h>
#include <pcl/pcl_base.h>#include <opencv2/imgproc.hpp>
#include <opencv2/opencv.hpp>#include <omp.h>
#include <mutex>using namespace pcl;
using namespace std;#define __NOTHING__
using namespace cv; PointCloud<PointXYZI>::Ptr dcp(new PointCloud<PointXYZI>());int main()
{Mat img = imread("img.png", IMREAD_ANYDEPTH);clock_t start,end;start = clock();pcl::PointXYZI point;#pragma omp parallel for private(point)for (int i = 0; i < img.rows; i++) {#pragma omp parallel for private(point)for (int j = 0; j < img.cols; j++) {#ifndef __NOTHING__ float val = img.at<uchar>(i, j) / 5000;if (val <= 0 || isnan(val)) {/* cout <<"val is unavailable"*/; continue; }point.x = (320 - j) / 500 / val * 10; point.y = (240 - i) / 500 / val * 10; point.z = 10; point.intensity = val;dcp->push_back(point);#endif}   }end = clock();cout<<"0000000000time = "<<double(end-start)/CLOCKS_PER_SEC<<"s"<<endl;dcp->clear();start = clock();for (int i = 0; i < img.rows; i++) {for (int j = 0; j < img.cols; j++) {#ifndef __NOTHING__ float val = img.at<uchar>(i, j) / 5000;if (val <= 0 || isnan(val)) {/* cout <<"val is unavailable"*/; continue; }pcl::PointXYZI point;point.x = (320 - j) / 500 / val * 10;point.y = (240 - i) / 500 / val * 10;point.z = 10;point.intensity = val;dcp->push_back(point);#endif}}end = clock();cout<<"1111111111time = "<<double(end-start)/CLOCKS_PER_SEC<<"s"<<endl;dcp->clear();start = clock();mutex m;img.forEach<float>([&m] (float &val, const int *position) {#ifndef __NOTHING__ pcl::PointXYZI point;//return in forEach, jump out of this loop, continue the nextval /= 5000;if (val <= 0 || isnan(val)) {/* cout <<"val is unavailable"*/; return; }point.x = (320 - position[0]) / 500 / val * 10;point.y = (240 - position[1]) / 500 / val * 10;point.z = 10;point.intensity = val;m.lock();dcp->push_back(point);m.unlock();#endif});end = clock();cout<<"222222222time = "<<double(end-start)/CLOCKS_PER_SEC<<"s"<<endl;//int n = dcp->points.size();//cout << "points num: " << n << endl;//for (int i = 0; i < n; i++) {//  cout << dcp->points[i].x << " " << dcp->points[i].y << " " << dcp->points[i].z << endl;//}
}

build:
g++ pcl_new_test.cpp pkg-config --cflags pcl_ros pkg-config --libs --cflags opencv``
result:

0000000000time = 0.000607s
1111111111time = 0.000622s
222222222time = 0.001397s

STL的并行遍历:for_each(依赖TBB)和omp parallel相关推荐

  1. C++17标准STL库并行策略在GCC编译器中的替代实现方法

    C++17标准STL库并行策略在GCC编译器中的替代实现方法 严正声明:本文系作者davidhopper原创,未经许可,不得转载.  2019年8月5日更新: GCC 9.1.0可支持C++ 17标 ...

  2. STL常用的遍历算法

    STL常用的遍历算法 for_each() transform() for_each() for_each: 用指定函数依次对指定范围内所有元素进行迭代访问.该函数不得修改序列中的元素. void s ...

  3. python基础:并行遍历ZIP()函数介绍

    2019独角兽企业重金招聘Python工程师标准>>> 前言 大家都知道range()可以在for循环中使用,这大大的便利了我们,我们今天为大家介绍的就并行遍历ZIP()函数介绍,它 ...

  4. python双循环zip_Python 并行遍历zip()函数使用方法

    今天我们要讲主题是python并行遍历zip()函数使用方法.在讲range()函数使用方法时我们知道了,range()可以在for循环中是以非完备的方式遍历序列,那么zip()并行遍历又是怎么工作的 ...

  5. 并行for #pragma omp parallel for

    #pragma omp parallel for 并行for 添加 #pragma omp parallel for 可以让for并行计算,提高效率. 首先要项目开启对openmp的支持 属性-> ...

  6. 【C/C++】从API学习STL algorithm 001(for_each、find、find_if、find_end、find_first_of 快到碗里来(◕ᴗ◕✿)

    今天的主题是, STL algorithm :)~ algorithm概述 The header <algorithm> defines a collection of functions ...

  7. 第九层(11):STL之常用遍历算法

    文章目录 前情回顾 常用算法 常用遍历算法 for_each transform 下一座石碑

  8. stl map高效遍历删除的方法

    for(:iter!=mapStudent.end():) { if((iter->second)>=aa) { //满足删除条件,删除当前结点,并指向下面一个结点 mapStudent. ...

  9. stl Vecotr中遍历方法

    1.通过随机存取方式读取     vector<int> num;     num.push_back( 1 );     num.push_back( 2 );     num.push ...

最新文章

  1. mysql的数据表操作
  2. FireFox不支持cursor:hand
  3. ai画面怎么调大小_ai如何调整对象大小
  4. Java多线程(一)之volatile深入分析
  5. GreenPlum数据库故障恢复测试
  6. 组卷积(group convolution)
  7. python笔试编程题_编程笔试题(四)栈
  8. 经济学有必要学python吗_学习经济学用啥软件
  9. matlab 色温图,LED色温图谱详解
  10. WinDriver Kernel Plugin开发
  11. 该如何提高个人影响力
  12. easyui treegrid 操作
  13. 【微分方程】微分算子法求微分方程特解
  14. 【XSY2753】LCM
  15. C和C++和Java的一些区别
  16. 小贝SEO博客_专注于SEO优化_软件_活动分享
  17. Linux动态链接库编程入门
  18. 想成为高级程序员MYSQL的那些知识你需要全懂
  19. 网易视频云余利华:以用户体验为核心,深耕PaaS云生态
  20. Henry捡钱 Java 动态规划

热门文章

  1. matlab 汽车振动,matlab在汽车振动分析
  2. mac不限速下载百度网盘
  3. 这可能就是你苦苦寻找开源、高颜值、功能强大的 Markdown 编辑器(共5款)
  4. 基于android的团购app设计与实现,基于Android平台的团购系统设计与实现
  5. linux文件大小限制6G,Linux或者Win服务器,极限情况下一个文件夹能放多少文件
  6. 什么是EDI许可证办理需要哪些条件
  7. C#实验五——编制写字板
  8. MCU独立看门狗 vs 窗口看门狗
  9. win11玩绝地求生闪退怎么办 windows11玩绝地求生闪退的解决方法
  10. 宽带路由器上五个被“漠视”的功能