OpenMP parallel programming-For Loop parallelization

Source: Internet
Author: User

Reprinted please declare source http://blog.csdn.net/zhongkejingwang/article/details/40018735

It is convenient and simple to use OpenMP optimization code in C/C ++. in code, parallel processing is often a time-consuming for loop, so we will focus on the application of the For Loop in OpenMP. I personally feel that it is enough to master the content mentioned in this Article. If you want to learn about OpenMP, you can check the information on the Internet.

To do well, you must first sharpen your tools. If the OMP development environment has not been set up, you can take a look at the OpenMP parallel programming-the establishment of the eclipse Development Environment

First, how can we make a piece of code be processed in parallel? In OMP, the parallel guidance instruction is used to identify parallel segments in the Code, in the form:

# Pragma OMP parallel

{

Each thread executes the code in braces.

}

For example, the following code:

# Include <iostream> # include "OMP. H "using namespace STD; int main (INT argc, char ** argv) {// set the number of threads. Generally, the number of threads cannot exceed the number of CPU cores, here we open four threads to execute the parallel code segment omp_set_num_threads (4); # pragma OMP parallel {cout <"hello" <", I am thread" <omp_get_thread_num () <Endl ;}}
Omp_get_thread_num () is used to obtain the ID number of the current thread.

The code execution result is:

Hello, I am Thread 1Hello, I am Thread 0Hello, I am Thread 2Hello, I am Thread 3
We can see that all four threads execute the code in braces and the order is uncertain. This is a parallel block.


Guidance instructions with:

The for guidance statement assigns a for loop to each thread for execution.Requires no data dependency.

Usage:

(1) # pragma OMP parallel

For ()

(2) # pragma OMP parallel

{// Note: the braces must start with another line.

# Pragma OMP

For ()

}

Note: In the second form, the parallel guidance instruction should not appear in the parallel block. For example, it cannot be written as follows:

# Pragma OMP parallel

{

# Pragma parallel OMP

For ()

}

The first form scope is only the for loop that follows, while the second form can contain multiple for Guidance instructions in the entire Parallel Block. The following uses an example program to explain the notes for loop parallelization.


If you do not use the for guidance statement, but directly use the parallel statement before the For Loop: (in order to avoid confusion in the output, use printf instead of cout)

# Include <iostream> # include <stdio. h> # include "OMP. H "using namespace STD; int main (INT argc, char ** argv) {// set the number of threads. Generally, the number of threads cannot exceed the number of CPU cores, here we open four threads to execute the parallel code segment omp_set_num_threads (4); # pragma OMP parallelfor (INT I = 0; I <2; I ++) // cout <"I =" <I <", I am thread" <omp_get_thread_num () <Endl; printf ("I = % d, I am thread % d \ n ", I, omp_get_thread_num ());}

Output result:

i = 0, I am Thread 0i = 0, I am Thread 1i = 1, I am Thread 0i = 1, I am Thread 1i = 0, I am Thread 2i = 1, I am Thread 2i = 0, I am Thread 3i = 1, I am Thread 3

The output result shows that if the for guidance statement is not used, each thread executes the entire for loop. Therefore, use the for guidance statement to split the for loop and allocate it to each thread as evenly as possible. After changing the parallel code to the following:

#pragma omp parallel forfor (int i = 0; i < 6; i++)printf("i = %d, I am Thread %d\n", i, omp_get_thread_num());
Output result:

i = 4, I am Thread 2i = 2, I am Thread 1i = 0, I am Thread 0i = 1, I am Thread 0i = 3, I am Thread 1i = 5, I am Thread 3
We can see that Thread 0 executes I = 0 and 1, thread 1 executes I = 2 and 3, thread 2 executes I = 4, and thread 3 executes I = 5. Thread 0 is the main thread

In this way, the entire for loop is split and executed in parallel. In the above Code, parallel and for are used together. They can only act on the following for loop. When the loop ends, the parallel block is exited.

The above code can be changed to the following:

#pragma omp parallel{#pragma omp forfor (int i = 0; i < 6; i++)printf("i = %d, I am Thread %d\n", i, omp_get_thread_num());}
This write method has the same effect as above. Note: What if parallel appears in the parallel Parallel Block? The best way to answer this question is to run the code again, so change the code to the following:

#pragma omp parallel{#pragma omp parallel forfor (int i = 0; i < 6; i++)printf("i = %d, I am Thread %d\n", i, omp_get_thread_num());}
Output result:

i = 0, I am Thread 0i = 0, I am Thread 0i = 1, I am Thread 0i = 1, I am Thread 0i = 2, I am Thread 0i = 2, I am Thread 0i = 3, I am Thread 0i = 3, I am Thread 0i = 4, I am Thread 0i = 4, I am Thread 0i = 5, I am Thread 0i = 5, I am Thread 0i = 0, I am Thread 0i = 1, I am Thread 0i = 0, I am Thread 0i = 2, I am Thread 0i = 1, I am Thread 0i = 3, I am Thread 0i = 2, I am Thread 0i = 4, I am Thread 0i = 3, I am Thread 0i = 5, I am Thread 0i = 4, I am Thread 0i = 5, I am Thread 0
We can see that there is only one thread 0, that is, only the main thread executes the for loop, and a total of four executions, each of which executes the entire for loop! Therefore, this is not correct.


Of course, the two statements for guidance statements mentioned above are different. For example, some code between two for loops can only be executed by one thread, in the first method, you only need:

# Pragma OMP parallel forfor (INT I = 0; I <6; I ++) printf ("I = % d, I am thread % d \ n", I, omp_get_thread_num (); // here is the code between two for loops. The main thread 0 will execute printf ("I am thread % d \ n ", omp_get_thread_num (); # pragma OMP parallel forfor (INT I = 0; I <6; I ++) printf ("I = % d, I am thread % d \ n ", I, omp_get_thread_num ());
When the for loop is left, the main thread is left. Therefore, the code between the two loops is executed by Thread 0, and the output result is as follows:

i = 0, I am Thread 0i = 2, I am Thread 1i = 1, I am Thread 0i = 3, I am Thread 1i = 4, I am Thread 2i = 5, I am Thread 3I am Thread 0i = 4, I am Thread 2i = 2, I am Thread 1i = 5, I am Thread 3i = 0, I am Thread 0i = 3, I am Thread 1i = 1, I am Thread 0
But if you use the second method to write the for loop into the parallel block, you need to pay attention to it!

Because every line of code in the parallel block identified by parallel is processed by multiple threads, therefore, if you want to run the code between two for loops by one thread, You need to mark the code with a single or master guidance statement before the code. The master thread is executed by the main thread, single is to select a thread for execution. Which thread to choose is unknown. The code above can be written as follows:

# Pragma OMP parallel {# pragma OMP forfor (INT I = 0; I <6; I ++) printf ("I = % d, I am thread % d \ n ", i, omp_get_thread_num (); # pragma OMP master {// The code here is run by the main thread printf ("I am thread % d \ n", omp_get_thread_num ());} # pragma OMP forfor (INT I = 0; I <6; I ++) printf ("I = % d, I am thread % d \ n", I, omp_get_thread_num ());}
The effect is the same as the above. If the master thread is not specified for execution, change the master to single.

Here, the usage of parallel and for is clear. Next we will start to talk about Data Synchronization during parallel processing, which is a problem in multi-threaded programming.


To illustrate the data synchronization problem, we will start with an example:

#include <iostream>#include "omp.h"using namespace std;int main(int argc, char **argv) {int n = 100000;int sum = 0;omp_set_num_threads(4);#pragma omp parallel{#pragma omp forfor (int i = 0; i < n; i++) {{sum += 1;}}}cout << " sum = " << sum << endl;}
The expected correct result is 100000, but this is incorrect. Check the code. By default, the sum variable is shared by every thread. Therefore, when multiple threads operate on sum simultaneously, the result is incorrect due to data synchronization issues. Obviously, the output results are different each time, which is unpredictable, as shown below:

First output sum = 58544 second output sum = 77015 third output sum = 78423


So how can we solve this data synchronization problem? The solution is as follows:

Method 1: synchronously mark the code segment that shares the variable

The code is modified as follows:

#pragma omp parallel{#pragma omp forfor (int i = 0; i < n; i++) {{#pragma omp criticalsum += 1;}}}cout << " sum = " << sum << endl;
The next line of code identified by the critical guidance statement can also be a code segment enclosed in braces for synchronous processing. Output 100000

Method 2: Each thread copies a sum variable and then adds the sum of each thread when exiting the Parallel Block.

The parallel code is modified as follows:

#pragma omp parallel{#pragma omp for reduction(+:sum)for (int i = 0; i < n; i++) {{sum += 1;}}}
Operation is to save the sum of each sum to the sum outside when the exit operation is performed, so the output result is 100000 ~~

Method 3: This method looks less elegant

The code is modified as follows:

int n = 100000;int sum[4] = { 0 };omp_set_num_threads(4);#pragma omp parallel{#pragma omp forfor (int i = 0; i < n; i++) {{sum[omp_get_thread_num()] += 1;}}}cout << " sum = " << sum[0] + sum[1] + sum[2] + sum[3] << endl;
Each thread operates on an array location identified by its own thread ID, so the result is correct.

Data Synchronization is complete. In the above Code, The for loop is evenly allocated to each thread one by one. What should I do if I want to allocate the loop to the thread one by one? Schedule guidance statements are used at this time. The following code demonstrates the usage of Schedule:

#include <iostream>#include "omp.h"#include <stdio.h>using namespace std;int main(int argc, char **argv) {int n = 12;omp_set_num_threads(4);#pragma omp parallel{#pragma omp for schedule(static, 3)for (int i = 0; i < n; i++) {{printf("i = %d, I am Thread %d\n", i, omp_get_thread_num());}}}}
In the code above, for loop parallelization will loop a lot of blocks, each of which is 3, and then evenly allocated to each thread for execution.

The output result is as follows:

i = 6, I am Thread 2i = 3, I am Thread 1i = 7, I am Thread 2i = 4, I am Thread 1i = 8, I am Thread 2i = 5, I am Thread 1i = 0, I am Thread 0i = 9, I am Thread 3i = 1, I am Thread 0i = 10, I am Thread 3i = 2, I am Thread 0i = 11, I am Thread 3
The output result shows that Thread 0 executes I = 0 1 2, thread 1 executes I = 3 4 5, thread 2 executes I = 6 7 8, thread 3 executes I = 9 10 11. If there is a later thread, it is allocated again from Thread 0.


Okay, the knowledge of for loop parallelization is basically finished. There is also a useful guidance statement barrier, which can be used to set up a roadblock in the parallel block and must wait until all threads arrive before they can pass, this is generally used when there are dependent tasks before and after the parallel processing loop.

Is it easy?












 

OpenMP parallel programming-For Loop parallelization

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.