Recently looking at multi-core programming. Simply put, because the computer CPU now generally have two cores, 4 cores and 8 cores of the CPU also gradually into the ordinary people's homes, traditional single-threaded programming way difficult to play the powerful multi-core CPU, so multicore programming came into being. As I understand it, multi-core programming can be thought of as a degree of abstraction of multithreaded programming, providing some simple APIs that allow users to avoid having to spend too much time understanding the underlying knowledge of multithreading to improve programming efficiency. These two days of focused multi-core programming tools include OpenMP and TBB. According to the current discussion on the Internet, TBB has to cover OpenMP, such as OpenCV used to be OpenMP, but from the 2.3 version began to abandon OpenMP, to TBB. But as I try, TBB is a bit more complicated, and in contrast, OpenMP is very easy to get started with. Because of the limited energy and time, there is no way to spend too much time learning TBB, here to share the two-day learning of the OpenMP a little bit of knowledge, and we discuss together.
OpenMP-supported programming languages include C, C + + and Fortran, and the compilers that support OpenMP include the Sun Studio,intel compiler,microsoft Visual STUDIO,GCC. I'm using Microsoft Visual Studio 2008,CPU for the Intel i5 four core and first talk about the configuration of OpenMP on Microsoft Visual Studio 2008. Very simple, a total of 2 steps:
(1) Create a new project. This is not much to say.
(2) After establishing the project, click on the menu bar->project->properties, pop-up menu, click Configuration properties->c/c++->language->openmp Support, select Yes from the drop-down menu.
The configuration ends at this point. Let's take a small example to illustrate the ease of use of OpenMP. This example has a simple test () function, and then in Main (), use a For loop to run the test () function 8 times.
1 #include <iostream>
2 #include <time.h>
3 void Test ()
4 {
5 int a = 0;
6 for (int i=0;i<100000000;i++)
7 a++;
8}
9 int Main ()
10 {
One clock_t t1 = clock ();
for (int i=0;i<8;i++)
test ();
clock_t t2 = clock ();
std::cout<< "Time:" <<t2-t1<<std::endl;
16}
After the compilation is run, the print time is: 1.971 seconds. Below we use a sentence to turn the above code into multicore operation.
1 #include <iostream>
2 #include <time.h>
3 void Test ()
6 9
5 int a = 0;
6 for (int i=0;i<100000000;i++)
7 a++;
8}
9 int Main ()
10 {
One clock_t t1 = clock ();
#pragma omp parallel for
for (int i=0;i<8;i++)
test ();
clock_t t2 = clock ();
std::cout<< "Time:" <<t2-t1<<std::endl;
17}
After compiling and running, the printing time is 0.546 seconds, almost 1/4 of the above time.
This allows us to see the simplicity and ease of use of OpenMP. In the above code, we have no additional include header file (personal: Can add #include<omp.h>), two no additional link library file, just before the for loop added a sentence #pragma omp parallel for. And this code on a single-core machine, or the compiler does not set the OpenMP to Yes machine on the compiler will not error, will automatically ignore #pragma this line of code, and then follow the traditional single-core serial way to compile and run! The only step we have to take is from C:\Program Files\Microsoft Visual Studio 9.0\vc\redist\x86\microsoft.vc90.openmp and C:\Program files\ Microsoft Visual Studio 9.0\vc\redist\debug_nonredist\x86\ The Microsoft.VC90.DebugOpenMP directory copies the Vcomp90d.dll and Vcomp90.dll files to the current directory of the project files. (Personal: corresponding to their own compiler path to find, because did not involve their own projects, just tested the next, so did not do this step copy operation.) )
Make a simple analysis of the above code according to my understanding.
When the compiler discovers #pragma omp parallel for, the following for loop is automatically divided into n parts (n is the number of computer CPU cores), then each copy is assigned to a kernel to execute, and the multicore is executed in parallel. The following code validates this analysis.
1 #include <iostream>
2 int Main ()
3 {
4 #pragma omp parallel for
5 for (int i=0;i<10;i++)
6 std::cout<<i<<std::endl;
7 return 0;
8}
You will find that the console prints 0 3 4 5 8 9 6 7 1 2. Note: Because each core is executed in parallel, the order in which each execution is printed may be different.
Let's talk about the race condition (race condition), which is the toughest problem for all multithreaded programming. The problem can be expressed as that when multiple threads are executing in parallel, it is possible for multiple threads to read and write to a variable at the same time, leading to unpredictable results. For example, for array A with 10 integral elements, we use a for loop to find the sum of its elements and save the result in the variable sum.
1 #include <iostream>
2 int Main ()
3 {
4 int sum = 0;
5 int a[10] = {1,2,3,4,5,6,7,8,9,10};
6 #pragma omp parallel for
7 for (int i=0;i<10;i++)
8 sum = sum + a[i];
9 std::cout<< "sum:" <<sum<<std::endl;
return 0;
11}
If we comment out #pragma omp parallel for, let the program execute in the traditional serial way first, it is obvious that sum = 55. However, when executed in parallel, sum becomes another value, such as Sum = 49, during a run. The reason for this is that when a thread a executes sum = SUM + a[i], the other line B is just updating sum, and a is still accumulating with the old sum, and an error occurs.
So how do you implement parallel array summation with OpenMP? Let's start with a basic solution. The idea of the scheme is to first generate an array of Sumarray, whose length is the number of threads executing in parallel (by default, that number equals the number of cores of the CPU), and in the For loop, let each thread update its own thread's corresponding sumarray element. Finally, add the elements in the Sumarray to sum, the code is as follows
1 #include <iostream>
2 #include <omp.h>
3 int main () {
4 int sum = 0;
5 int a[10] = {1,2,3,4,5,6,7,8,9,10};
6 int corenum = Omp_get_num_procs ();//Get the number of processors
7 int* sumarray = new int[corenum];//corresponds to the number of processors, Sir, into an array
8 for (int i=0;i<corenum;i++)//Initialize array elements to 0
9 Sumarray[i] = 0;
#pragma omp parallel for
one for (int i=0;i<10;i++)
{
int k = Omp_get_thread_num ();//Get the ID of each thread
Sumarray[k] = sumarray[k]+a[i];//The corresponding values on each thread are added independently
}
for (int i = 0;i<corenum;i++)
sum = sum + sumarray[i];
std::cout<< "sum:" <<sum<<std::endl;
return 0;
20}
Note that in the above code, we use the Omp_get_num_procs () function to get the number of processors, with the Omp_get_thread_num () function to get the ID of each thread, in order to use these two functions, we need an include < Omp.h>.
Although the above code has achieved the goal, it has produced more extra operations, such as the Sumarray of the group, and finally a for loop to add up its elements, is there a more convenient way? The answer is yes, OpenMP provides us with another tool, the reduction, as shown in the following code:
1 #include <iostream>
2 int main () {
3 int sum = 0;
4 int a[10] = {1,2,3,4,5,6,7,8,9,10};
5 #pragma omp parallel for reduction (+:sum)
6 for (int i=0;i<10;i++)
7 sum = sum + a[i];
8 std::cout<< "sum:" <<sum<<std::endl;
9 return 0;
10}
In the above code, we add reduction (+:sum) to the #pragma omp parallel for, which means to tell the compiler that you want to run the following for loop, but each thread has to save a copy of the variable sum, and after the loop is over, All threads add up their sum as the final output.
Reduction is convenient, but it only supports some basic operations, such as +,-, *,&,|,&&,| | such as In some cases, we need to avoid race condition, but the operation involved is beyond the scope of reduction capabilities, what should be done? This will use the other tool of OpenMP, critical. Take a look at the following example, in which we find the maximum value of array A and save the result in Max.
1 #include <iostream>
2 int main () {
3 int max = 0;
4 int a[10] = {11,2,33,49,113,20,321,250,689,16};
5 #pragma omp parallel for
6 for (int i=0;i<10;i++)
7 {
8 int temp = a[i];
9 #pragma omp Critical
Ten {
if (temp > Max)
max = temp;
- }
+ }
std::cout<< "Max:" <<max<<std::endl;
return 0;
17}
In the example above, the for loop is automatically divided into n parts to execute in parallel, but we use the #pragma omp critical to enclose the IF (Temp > max) max = temp, which means that each thread executes the for inside statement in parallel, But when you execute into critical, be aware that there is no other thread is impersonating in it, and if so, wait for the other threads to execute. This avoids the race condition problem, but it is clear that it will be executed at a lower speed because of possible thread waits. (Personal: This uses the same function as the __syncthreads () fence synchronization function in Cuda)
With the above basic knowledge, it is enough for me to do a lot of things. Here we look at a specific application, from the hard disk read into two images, the two images extracted feature points, feature points matching, and finally the image and matching feature points are drawn out. Understanding This example requires some basic knowledge of image processing, which I am not going to cover in detail here. In addition, compiling this example requires OPENCV, I use the version is 2.3.1, about OPENCV installation and configuration is not introduced here. We first look at the traditional way of serial programming.
1 #include "opencv2/highgui/highgui.hpp"
2 #include "opencv2/features2d/features2d.hpp"
3 #include <iostream>
4 #include <omp.h>
5 int main () {
6 Cv::surffeaturedetector Detector (400);
7 Cv::surfdescriptorextractor Extractor;
8 cv::bruteforcematcher<cv::l2<float> > Matcher;
9 std::vector< CV::D match > matches;
Ten Cv::mat im0,im1;
One by one std::vector<cv::keypoint> keypoints0,keypoints1;
Cv::mat Descriptors0, descriptors1;
Double T1 = omp_get_wtime ();
14//Process first Image first
IM0 = Cv::imread ("rgb0.jpg", Cv_load_image_grayscale);
Detector.detect (IM0, KEYPOINTS0);
Extractor.compute (IM0,KEYPOINTS0,DESCRIPTORS0);
std::cout<< "Find" <<keypoints0.size () << "keypoints in Im0" <<std::endl;
19//re-processing the second image
IM1 = Cv::imread ("rgb1.jpg", Cv_load_image_grayscale);
Detector.detect (IM1, keypoints1);
Extractor.compute (IM1,KEYPOINTS1,DESCRIPTORS1);
std::cout<< "Find" <<keypoints1.size () << "keypoints in Im1" <<std::endl;
Double t2 = Omp_get_wtime ();
std::cout<< "Time:" <<t2-t1<<std::endl;
Matcher.match (Descriptors0, descriptors1, matches);
Cv::mat img_matches;
-CV::d rawmatches (im0, Keypoints0, IM1, keypoints1, matches, img_matches);
Cv::namedwindow ("Matches", cv_window_autosize);
Cv::imshow ("Matches", img_matches);
Cv::waitkey (0);
return 1;
33}
Obviously, the part that reads the image, extracts the feature point and the feature descriptor can be executed in parallel, modified as follows:
1 #include "opencv2/highgui/highgui.hpp"
2 #include "opencv2/features2d/features2d.hpp"
3 #include <iostream>
4 #include <vector>
5 #include <omp.h>
6 int main () {
7 int imnum = 2;
8 std::vector<cv::mat> Imvec (imnum);
9 Std::vector<std::vector<cv::keypoint>>keypointvec (Imnum);
Ten std::vector<cv::mat> Descriptorsvec (imnum);
Cv::surffeaturedetector detector (400); Cv::surfdescriptorextractor extractor;
cv::bruteforcematcher<cv::l2<float> > Matcher;
std::vector< CV::D match > matches;
Char filename[100];//picture Path
Double T1 = omp_get_wtime ();
#pragma omp parallel for
+ for (int i=0;i<imnum;i++) {
sprintf (filename, "rgb%d.jpg", I);//Set the full name of the first picture
Imvec[i] = cv::imread (filename, cv_load_image_grayscale);
Detector.detect (Imvec[i], keypointvec[i]);
Extractor.compute (Imvec[i],keypointvec[i],descriptorsvec[i]);
std::cout<< "Find" <<keypointvec[i].size () << "keypoints in IM" <<i<<std::endl;
23}
Double t2 = Omp_get_wtime ();
std::cout<< "Time:" <<t2-t1<<std::endl;
Matcher.match (Descriptorsvec[0], descriptorsvec[1], matches);
Cv::mat img_matches;
-CV::d rawmatches (imvec[0], keypointvec[0], imvec[1], keypointvec[1], matches, img_matches);
Cv::namedwindow ("Matches", cv_window_autosize);
Cv::imshow ("Matches", img_matches);
Cv::waitkey (0);
return 1;
33}
The two modes of execution are compared, the time is: 2.343 seconds v.s. 1.2441 seconds
In the above code, in order to adapt to the #pragma omp parallel for execution, we used the STL vector to store two images, feature points and feature descriptors, but in some cases, variables may not fit in the vector, what should be done now? This will use another tool for OpenMP, section, with the following code:
1 #include "opencv2/highgui/highgui.hpp"
2 #include "opencv2/features2d/features2d.hpp"
3 #include <iostream>
4 #include <omp.h>
5 int main () {
6 Cv::surffeaturedetector Detector (400); Cv::surfdescriptorextractor extractor;
7 cv::bruteforcematcher<cv::l2<float> > Matcher;
8 std::vector< CV::D match > matches;
9 Cv::mat im0,im1;
Ten std::vector<cv::keypoint> Keypoints0,keypoints1;
One by one cv::mat Descriptors0, descriptors1;
Double T1 = omp_get_wtime ();
#pragma omp parallel sections
14 {
#pragma omp section
16 {
std::cout<< "Processing im0" <<std::endl;
IM0 = Cv::imread ("rgb0.jpg", Cv_load_image_grayscale);
Detector.detect (IM0, KEYPOINTS0);
Extractor.compute (IM0,KEYPOINTS0,DESCRIPTORS0);
std::cout<< "Find" <<keypoints0.size () << "keypoints in Im0" <<std::endl;
22}
#pragma omp section
24 {
std::cout<< "Processing im1" <<std::endl;
IM1 = Cv::imread ("rgb1.jpg", Cv_load_image_grayscale);
Detector.detect (IM1, keypoints1);
Extractor.compute (IM1,KEYPOINTS1,DESCRIPTORS1);
std::cout<< "Find" <<keypoints1.size () << "keypoints in Im1" <<std::endl;
30}
31}
Double t2 = Omp_get_wtime ();
std::cout<< "Time:" <<t2-t1<<std::endl;
Matcher.match (Descriptors0, descriptors1, matches);
Cv::mat img_matches;
Rawmatches CV::d (IM0, Keypoints0, IM1, keypoints1, matches, img_matches);
Panax Notoginseng Cv::namedwindow ("Matches", cv_window_autosize);
Cv::imshow ("Matches", img_matches);
Cv::waitkey (0);
return 1;
41}
In the above code, we first use the content of #pragma omp parallel sections to be executed in parallel, inside it, using two #pragma omp section, each inside performs the image reads, the feature points and the feature description sub-extract. Simplifying it to pseudo-code forms is:
1 #pragma omp parallel sections
2 {
3 #pragma omp section
4 {
5 function1 ();
6 }
7 #pragma omp section
8 {
9 function2 ();
Ten }
11}
Parallel sections inside the content to be executed in parallel, the specific division of labor, each thread executes one of the sections, if the number of the section is greater than the number of threads, then a thread after the execution of its parts, and then continue to execute the remaining section. In time, this approach is similar to the way humans use vectors to construct a for loop, but it is undoubtedly more convenient, and on a single-core machine or on a compiler that does not have an OpenMP on, it compiles correctly without any changes and executes in a single-core serial manner.
The above share of these two days about OpenMP a little learning experience, which inevitably there are mistakes, please correct me. Another point of doubt is that the various OpenMP tutorials often use private,shared and so on to modify variables, the meaning and role of these modifiers I generally understand, but in all of my examples above, without these modifiers does not seem to affect the results of the operation, do not know what is fastidious.
In the course of writing, I would like to thank you for referring to the resources in various places, including the following two URLs, which are no longer listed.
http://blog.csdn.net/drzhouweiming/article/details/4093624
Http://software.intel.com/zh-cn/articles/more-work-sharing-with-openmp
The blog post has several valuable questions:
[1] using clock () timing, parallel than the serial time is longer. I experimented with other timekeeping methods, the effect is also possible. See the "Run Time or run performance analysis of OpenMP in CentOS6"
Http://www.cnblogs.com/diploma/p/openmpclock_gettime.html
[2]
OpenMP's point of use experience "non-original"