Two tips for OpenMP Programming

Source: Internet
Author: User

Two tips for OpenMP Programming 1. dynamically set the number of threads in a parallel LoopIn practice, programs may run in different machine environments. Some machines are dual-core, and some machines are 4-core or even more multi-core. In addition, in the future, hardware upgrades may cause more and more CPU cores. It is very important to automatically set the appropriate number of threads according to the hardware of the machine. Otherwise, the program must be modified after the hardware upgrade. For example, if the number of threads of software developed in the dual-core system is set to 2 by default, the number of threads cannot meet the requirements after the machine is upgraded to 4 or 8 cores, unless you modify the program. In addition to meeting the scalability of machine hardware upgrades, the number of threads must also be considered. When the Program Computing workload increases or decreases, the number of threads can still meet the requirements. Obviously, this cannot be solved by setting the number of static threads. When calculating the number of threads required, consider the following two points: 1) when the number of loops is small, if the number of threads is too large, the total running time may be higher than that of a few threads or one thread. It also increases energy consumption. 2) If the number of threads is greater than the number of CPU cores, there will be a lot of overhead such as task switching and scheduling, which will also reduce the overall efficiency. So how can we dynamically set the number of threads based on the number of cycles and the number of CPU cores? The following uses an example to illustrate the algorithm for dynamically setting the number of threads. Assume that a requirement for dynamically setting the number of threads is: 1. The number of cycles that each thread runs when multiple threads run is no less than 4 times. 2. The total number of running threads cannot exceed 2 times the number of CPU cores. The following code is used to meet the above requirements. example of dynamically setting the number of threads: const int MIN_ITERATOR_NUM = 4; int ncore = omp_get_num_procs (); // get the number of execution cores int max_tn = n/MIN_ITERATOR_NUM; int tn = max_tn> 2 * ncore? 2 * ncore: max_tn; // tn indicates the number of threads to be set # pragma omp parallel for if (tn> 1) num_threads (tn) for (I = 0; I <n; I ++) {printf ("Thread Id = % ld/n", omp_get_thread_num (); // Do some work here} in the above Code, calculate the maximum number of threads max_tn based on the number of cycles that each thread runs no less than 4, and then calculate the required number of threads tn, the tn value is equal to a small value in max_tn and 2 times the number of CPU cores. Then, in the parallel for construction, use the if clause to determine whether tn is greater than 1. if tn is greater than 1, use a single thread. Otherwise, use tn threads ,, in this way, the number of threads can meet the requirements. For example, on a dual-core CPU, n = 64 will eventually run in 2 times the number of CPU cores (4), instead of running in max_tn = 64/4 = 16 threads. In actual situations, of course, you cannot write several lines of code to compute each loop as above. You can write it as an independent function as follows: const int g_ncore = omp_get_num_procs (); // obtain the number of execution cores/** calculate the number of threads required for loop iterations. The number of required threads is calculated based on the number of loop iterations, the number of CPU cores, and the number of loop iterations required by one thread., the maximum number of threads calculated cannot exceed the number of CPU cores @ param int n-Number of loop iterations @ param int min_n-minimum number of iterations required by a single thread @ return int-Number of threads */int dtn (int n, int min_n) {int max_tn = n/min_n; int tn = max_tn> g_ncore? G_ncore: max_tn; // tn indicates the number of threads to be set. if (tn <1) {tn = 1;} return tn ;} in this way, the function dtn () can be used to obtain the appropriate number of threads for each parallel loop. The preceding code can be abbreviated as follows: # pragma omp parallel for num_threads (dtn (n, MIN_ITERATOR_NUM) for (I = 0; I <n; I ++) {printf ("Thread Id = % ld/n", omp_get_thread_num ()); // Do some work here} Of course, the specific number of threads to be set depends on the situation. Generally, the number of threads is equal to the number of CPU cores, which can achieve better performance, when the number of threads is equal to the number of CPU cores, each core executes a task without Task Switching overhead. 2. parallelization of nested loopsIn a nested loop, if the number of iterations in the outer loop is small, the number of threads created may be smaller than the number of CPU cores if the number of CPU cores increases to a certain extent in the future. In addition, if there is load balancing in the inner loop, it is difficult to schedule the outer loop to achieve load balancing. The following uses matrix multiplication as an example to describe how to parallelize nested loops to meet the above scalability and load balancing requirements. The code for a serial matrix multiplication function is as follows: /** matrix serial multiplication function @ param int * a-pointer to the first matrix to be multiplied @ param int row_a-number of rows in matrix a @ param int col_a-Number of columns in matrix @ param int * B-pointer to the first matrix to be multiplied @ param int row_ B-number of rows in matrix B @ param int col_ B-Number of columns in matrix B @ param int * c-Calculation pointer to the matrix of the result @ param int c_size-space size of matrix c (total number of elements) @ return void-none */void Matrix_Multiply (int * a, int row_a, int col_a, int * B, int row_ B, int col_ B, int * c, int c_size) {if (col_a! = Row_ B | c_size <row_a * col_ B) {return;} int I, j, k; // # pragma omp for private (I, j, k) for (I = 0; I <row_a; I ++) {int row_ I = I * col_a; int row_c = I * col_ B; for (j = 0; j <col_ B; j ++) {c [row_c + j] = 0; for (k = 0; k <row_ B; k ++) {c [row_c + j] + = a [row_ I + k] * B [k * col_ B + j] ;}}} if the OpenMP for statement is added before the outer loop, it becomes a parallel matrix multiplication function, but simply parallelizing it obviously cannot meet the expansion requirements described above. In fact, a simple method can be used to combine the outermost layer loop and the 2nd layer loop into a loop. The following is the parallel implementation after the merge loop. Void Parallel_Matrix_Multiply (int * a, int row_a, int col_a, int * B, int row_ B, int col_ B, int * c, int c_size) {if (col_a! = Row_ B) {return;} int I, j, k; int index; int border = row_a * col_ B; I = 0; j = 0; # pragma omp parallel private (I, j, k) num_threads (dtn (border, 1) for (index = 0; index <border; index ++) {I = index/col_ B; j = index % col_ B; int row_ I = I * col_a; int row_c = I * col_ B; c [row_c + j] = 0; for (k = 0; k <row_ B; k ++) {c [row_c + j] + = a [row_ I + k] * B [k * col_ B + j] ;}} as shown in the code above, combined loop boundary border = Row_a * col_ B; that is, it is equal to the product of the original two loop boundaries. Then, the iteration variables I and j of the original outer loop and the 2nd-layer loop are calculated in the loop, use Division and remainder to obtain the values of I and j. Note that the values of I and j must ensure the independence of loop iterations, that is, there must be no dependency between loop iterations. I and j values cannot be optimized into the following forms: if (j = col_ B) {j = 0; I ++ ;}//...... In this example, the actual matrix multiplication code j ++ is used. The above optimization saves division, which is highly efficient, but can only be used in serial code because it has dependency between loop iterations, it cannot be correctly parallelized.
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.