Original article link
Recently I am reading multi-core programming. To put it simply, because the computer CPU usually has two cores, the 4-core and 8-core CPUs have gradually entered the ordinary home, the traditional single-Thread Programming Method is difficult to use the powerful functions of multi-core CPU, so multi-core programming came into being. According to my understanding, multi-core programming can be considered as a certain degree of abstraction for multi-threaded programming. It provides some simple APIs, so that users do not have to spend too much effort to understand the underlying knowledge of multithreading, this improves programming efficiency. The multi-core programming tools that have been focused on these two days include OpenMP and TBB. According to the current discussion on the internet, TBB needs to overwrite OpenMP. For example, opencv used OpenMP in the past, but since Version 2.3, it has abandoned OpenMP and switched to TBB. However, TBB is still complicated. In contrast, OpenMP is very easy to use. Due to limited energy and time, you cannot spend too much time learning TBB. Here we will share some of the OpenMP knowledge you have learned over the past two days and discuss it with you.
OpenMP supports C, C ++, and FORTRAN programming languages. OpenMP compilers include Sun Studio, Intel compiler, Microsoft Visual Studio, and GCC. I am using Microsoft Visual Studio 2008 and the CPU is Intel I5 quad core. First, let's talk about the OpenMP configuration on Microsoft Visual Studio 2008. Very simple. There are two steps in total:
(1) create a project. I will not talk more about this.
(2) After creating a project, click "project"> "properties" in the menu bar and choose "configuration properties"> "C/C ++"> "language"> "OpenMP support, select Yes from the drop-down menu.
Now the configuration is complete. Below is a small example to illustrate the ease of use of OpenMP. In this example, there is a simple test () function, and then in main (), use a for loop to run the test () function eight times.
1 #include <iostream>
2 #include <time.h>
3 void test()
4 {
5 int a = 0;
6 for (int i=0;i<100000000;i++)
7 a++;
8 }
9 int main()
10 {
11 clock_t t1 = clock();
12 for (int i=0;i<8;i++)
13 test();
14 clock_t t2 = clock();
15 std::cout<<"time: "<<t2-t1<<std::endl;
16 }
After compilation and running, the printed time is 1.971 seconds. Next we will use one sentence to convert the above Code into multi-core running.
1 #include <iostream>
2 #include <time.h>
3 void test()
4 {
5 int a = 0;
6 for (int i=0;i<100000000;i++)
7 a++;
8 }
9 int main()
10 {
11 clock_t t1 = clock();
12 #pragma omp parallel for
13 for (int i=0;i<8;i++)
14 test();
15 clock_t t2 = clock();
16 std::cout<<"time: "<<t2-t1<<std::endl;
17 }
After compilation and running, the printed time is 0.546 seconds, which is almost 1/4 of the above time.
We can see that OpenMP is easy to use. In the above Code, we do not have any additional include header files, and no additional link library files. We just added the # pragma OMP parallel for statement before the for loop. In addition, this code can be compiled on a single-core machine, or on a machine where the compiler does not set OpenMP to yes. It will automatically ignore this line of code # pragma, then compile and run it in the traditional single-core serial mode! The only step we need to do is from c: \ Program Files \ Microsoft Visual Studio 9.0 \ Vc \ redist \ x86 \ Microsoft. vc90.openmp and c: \ Program Files \ Microsoft Visual Studio 9.0 \ Vc \ redist \ debug_nonredist \ x86 \ Microsoft. copy vcomp90d in the vc90.debugopenmp directory respectively. DLL and vcomp90.dll files to the current directory of the project file.
Make a simple analysis of the above Code according to my understanding.
When the compiler finds # pragma OMP parallel for, it automatically divides the following for loop into N parts (n is the number of computer CPU cores), and then assigns each part to one core for execution, parallel Execution is performed between multiple cores. The following code verifies this analysis.
1 #include <iostream>
2 int main()
3 {
4 #pragma omp parallel for
5 for (int i=0;i<10;i++)
6 std::cout<<i<<std::endl;
7 return 0;
8 }
The console prints 0 3 4 5 8 9 6 7 1 2. Note: because each core is executed in parallel, the order of output during each execution may be different.
Next we will talk about race condition, which is the most tricky issue in all multi-threaded programming. This problem can be expressed as: when multiple threads execute in parallel, multiple threads may simultaneously perform read/write operations on a variable, resulting in unpredictable results. For example, in the following example, for Array a that contains 10 integer elements, we use the for loop to calculate the sum of its elements and save the result in the sum variable.
1 #include <iostream>
2 int main()
3 {
4 int sum = 0;
5 int a[10] = {1,2,3,4,5,6,7,8,9,10};
6 #pragma omp parallel for
7 for (int i=0;i<10;i++)
8 sum = sum + a[i];
9 std::cout<<"sum: "<<sum<<std::endl;
10 return 0;
11 }
If we comment out # pragma OMP parallel for, let the program first run in the traditional serial mode, obviously, sum = 55. However, after execution in parallel mode, sum is changed to another value. For example, in a running process, sum = 49. The reason is that when a thread a executes sum = sum + A [I], another thread B is updating sum, and A is still accumulating with the old sum, so an error occurs.
So how does OpenMP achieve Parallel Array summation? Next we will provide a basic solution. The idea of this solution is to first generate an array sumarray whose length is the number of threads for parallel execution (by default, this number is equal to the number of CPU cores). In the for loop, let each thread update the elements in the sumarray corresponding to its own thread, and then accumulate the elements in the sumarray into sum. The Code is as follows:
1 # include <iostream>
2 # include <OMP. h>
3 int main (){
4 int sum = 0;
5 int A [10] = {1, 2, 4, 5, 6, 7, 8, 9, 10 };
6 int corenum = omp_get_num_procs (); // obtain the number of processors
7 int * sumarray = new int [corenum]; // corresponding processor count, which is an array
8 For (INT I = 0; I <corenum; I ++) // initialize each element of the array to 0
9 sumarray [I] = 0;
10 # pragma OMP parallel
11 For (INT I = 0; I <10; I ++)
12 {
13 int K = omp_get_thread_num (); // obtain the ID of each thread
14 sumarray [k] = sumarray [k] + A [I];
15}
16 For (INT I = 0; I <corenum; I ++)
17 sum = sum + sumarray [I];
18 STD: cout <"sum:" <sum <STD: Endl;
19 Return 0;
20}
Note that in the code above, we use the omp_get_num_procs () function to obtain the number of processors and use the omp_get_thread_num () function to obtain the ID of each thread. to use these two functions, we need to include <OMP. h>.
Although the above Code achieves the goal, it produces a lot of additional operations, for example, to form an array sumarray, and then use a for loop to accumulate all its elements, is there a simpler way? The answer is: OpenMP provides us with another tool, reduction. For details, see the following code:
1 #include <iostream>
2 int main(){
3 int sum = 0;
4 int a[10] = {1,2,3,4,5,6,7,8,9,10};
5 #pragma omp parallel for reduction(+:sum)
6 for (int i=0;i<10;i++)
7 sum = sum + a[i];
8 std::cout<<"sum: "<<sum<<std::endl;
9 return 0;
10 }
In the above Code, we add the direction (+: Sum) after # pragma OMP parallel for, which means to tell the compiler: The for loop below you want to run in multiple threads, however, each thread needs to save the copy of the sum variable. After the loop ends, all threads accumulate their own sum as the final output.
Function is convenient, but it only supports some basic operations, such as +,-, *, &, |, &, |. In some cases, we need to avoid race condition, but the operations involved are beyond the limit function. What should we do? This requires another OpenMP tool, critical. Let's take a look at the following example. In this example, we calculate the maximum value of array A and save the result in Max.
1 #include <iostream>
2 int main(){
3 int max = 0;
4 int a[10] = {11,2,33,49,113,20,321,250,689,16};
5 #pragma omp parallel for
6 for (int i=0;i<10;i++)
7 {
8 int temp = a[i];
9 #pragma omp critical
10 {
11 if (temp > max)
12 max = temp;
13 }
14 }
15 std::cout<<"max: "<<max<<std::endl;
16 return 0;
17 }
In the above example, The for loop is automatically divided into N parts for parallel execution, but we use # pragma OMP critical to include if (temp> MAX) max = temp, it means that each thread executes the statements in for in parallel, but when you execute the statements in critical, be sure that there are other threads executing the statements in it. If yes, wait until the execution of other threads is completed before the execution. This avoids the race condition problem, but it is obvious that its execution speed will decrease because there may be thread waits.
With the above basic knowledge, it is enough for me to do many things. Next we will look at a specific application example. We will read two images from the hard disk, extract the feature points from the two images, match the feature points, and finally draw the image and the matching feature points. Understanding this example requires some basic knowledge of image processing, which I will not detail here. In addition, compiling this example requires opencv. the version I used is 2.3.1, and the installation and configuration of opencv are not described here. First, let's look at the traditional serial programming method.
1 # include "opencv2/highgui. HPP"
2 # include "opencv2/features2d/features2d. HPP"
3 # include <iostream>
4 # include <OMP. h>
5 Int main (){
6 CV: surffeaturedetector detector (400 );
7 CV: surfdescriptorextractor extractor;
8 CV: bruteforcematcher <CV: L2 <float> matcher;
9 STD: vector <CV: dmatch> matches;
10 CV: mat im0, im1;
11 STD: vector <CV: keypoint> keypoints0, keypoints1;
12 CV: mat descriptors0, descriptors1;
13 double T1 = omp_get_wtime ();
14 // process the first image first
15 im0 = CV: imread ("rgb0.jpg", cv_load_image_grayscale );
16 detector. Detect (im0, keypoints0 );
17 extractor. Compute (im0, keypoints0, descriptors0 );
18 STD: cout <"find" <keypoints0.size () <"keypoints in im0" <STD: Endl;
19 // re-process the second image
20 im1 = CV: imread ("rgb1.jpg", cv_load_image_grayscale );
21 detector. Detect (im1, keypoints1 );
22 extractor. Compute (im1, keypoints1, descriptors1 );
23 STD: cout <"find" <keypoints1.size () <"keypoints in im1" <STD: Endl;
24 Double t2 = omp_get_wtime ();
25 STD: cout <"time:" <t2-t1 <STD: Endl;
26 matcher. Match (descriptors0, descriptors1, matches );
27 CV: mat img_matches;
28 CV: drawmatches (im0, keypoints0, im1, keypoints1, matches, img_matches );
29 CV: namedwindow ("matches", cv_window_autosize );
30 CV: imshow ("matches", img_matches );
31 CV: waitkey (0 );
32 return 1;
33}
Obviously, read the image and extract the feature points and feature descriptions in parallel. The modifications are as follows:
1 #include "opencv2/highgui/highgui.hpp"
2 #include "opencv2/features2d/features2d.hpp"
3 #include <iostream>
4 #include <vector>
5 #include <omp.h>
6 int main( ){
7 int imNum = 2;
8 std::vector<cv::Mat> imVec(imNum);
9 std::vector<std::vector<cv::KeyPoint>>keypointVec(imNum);
10 std::vector<cv::Mat> descriptorsVec(imNum);
11 cv::SurfFeatureDetector detector( 400 ); cv::SurfDescriptorExtractor extractor;
12 cv::BruteForceMatcher<cv::L2<float> > matcher;
13 std::vector< cv::DMatch > matches;
14 char filename[100];
15 double t1 = omp_get_wtime( );
16 #pragma omp parallel for
17 for (int i=0;i<imNum;i++){
18 sprintf(filename,"rgb%d.jpg",i);
19 imVec[i] = cv::imread( filename, CV_LOAD_IMAGE_GRAYSCALE );
20 detector.detect( imVec[i], keypointVec[i] );
21 extractor.compute( imVec[i],keypointVec[i],descriptorsVec[i]);
22 std::cout<<"find "<<keypointVec[i].size()<<"keypoints in im"<<i<<std::endl;
23 }
24 double t2 = omp_get_wtime( );
25 std::cout<<"time: "<<t2-t1<<std::endl;
26 matcher.match( descriptorsVec[0], descriptorsVec[1], matches );
27 cv::Mat img_matches;
28 cv::drawMatches( imVec[0], keypointVec[0], imVec[1], keypointVec[1], matches, img_matches );
29 cv::namedWindow("Matches",CV_WINDOW_AUTOSIZE);
30 cv::imshow( "Matches", img_matches );
31 cv::waitKey(0);
32 return 1;
33 }
Compare the two execution modes, with the time being 2.343 seconds v. S. 1.2441 seconds
In the code above, we used STL vector to store two images, feature points, and feature descriptors to adapt to # pragma OMP parallel for execution, but in some cases, variables may not fit in the vector. What should I do? This requires another OpenMP tool, Section. The Code is as follows:
1 #include "opencv2/highgui/highgui.hpp"
2 #include "opencv2/features2d/features2d.hpp"
3 #include <iostream>
4 #include <omp.h>
5 int main( ){
6 cv::SurfFeatureDetector detector( 400 ); cv::SurfDescriptorExtractor extractor;
7 cv::BruteForceMatcher<cv::L2<float> > matcher;
8 std::vector< cv::DMatch > matches;
9 cv::Mat im0,im1;
10 std::vector<cv::KeyPoint> keypoints0,keypoints1;
11 cv::Mat descriptors0, descriptors1;
12 double t1 = omp_get_wtime( );
13 #pragma omp parallel sections
14 {
15 #pragma omp section
16 {
17 std::cout<<"processing im0"<<std::endl;
18 im0 = cv::imread("rgb0.jpg", CV_LOAD_IMAGE_GRAYSCALE );
19 detector.detect( im0, keypoints0);
20 extractor.compute( im0,keypoints0,descriptors0);
21 std::cout<<"find "<<keypoints0.size()<<"keypoints in im0"<<std::endl;
22 }
23 #pragma omp section
24 {
25 std::cout<<"processing im1"<<std::endl;
26 im1 = cv::imread("rgb1.jpg", CV_LOAD_IMAGE_GRAYSCALE );
27 detector.detect( im1, keypoints1);
28 extractor.compute( im1,keypoints1,descriptors1);
29 std::cout<<"find "<<keypoints1.size()<<"keypoints in im1"<<std::endl;
30 }
31 }
32 double t2 = omp_get_wtime( );
33 std::cout<<"time: "<<t2-t1<<std::endl;
34 matcher.match( descriptors0, descriptors1, matches );
35 cv::Mat img_matches;
36 cv::drawMatches( im0, keypoints0, im1, keypoints1, matches, img_matches );
37 cv::namedWindow("Matches",CV_WINDOW_AUTOSIZE);
38 cv::imshow( "Matches", img_matches );
39 cv::waitKey(0);
40 return 1;
41 }
In the above Code, we first use # pragma OMP parallel sections to include the content to be executed in parallel. In it, we use two # pragma OMP sections, each part contains image reading, feature points, and feature description sub-extraction. It is simplified to pseudo-code:
1 #pragma omp parallel sections
2 {
3 #pragma omp section
4 {
5 function1();
6 }
7 #pragma omp section
8 {
9 function2();
10 }
11 }
It means that the content in parallel sections must be executed in parallel. In terms of division of labor, each thread executes one of the sections. If the number of sections is greater than the number of threads, after a thread executes its section, it will continue to execute the remaining section. In terms of time, this method is similar to the way for humans to construct a for loop using a vector, but it is undoubtedly more convenient, and on a single-core machine or the OpenMP compiler is not enabled, this method can be correctly compiled without any changes and executed in Single-core serial mode.
I have shared my experience on OpenMP over the past two days. It is inevitable that there will be errors. please correct me. Another question is that we often use private, shared, and so on to modify variables in various OpenMP tutorials. I understand the meanings and functions of these modifiers, but in all my examples above, without these modifiers, it does not seem to affect the running results. I don't know what to do here.
In the process of writing the above, I have referred to the resources in multiple places including the following two URLs and will not list them any more. I would like to express my gratitude here.
Http://blog.csdn.net/drzhouweiming/article/details/4093624
Http://software.intel.com/zh-cn/articles/more-work-sharing-with-openmp
One-point OpenMP experience