As a software developer using a multi-core processor, you will face the following challenges: Determine whether Threading Technology helps improve performance, whether it is worth your effort, or whether it can be implemented.
Support OpenMP * Intel compiler and thread tools (Intel thread recorder and Intel thread checker) it helps you quickly evaluate the performance of a threaded application running on two, four, or more processors, and determine the location of the data in the code that supports threading and needs to be protected. All these evaluations can be executed in code using the intuitive OpenMP compilation Directive (Pragma) supported by the compiler.
Using these tools, you can run code in single-threaded mode and evaluate the code running in a multi-core or multi-processor system without having to implement real-threaded code in advance. This evaluation method, combined with OpenMP, Intel thread archivers, and Intel thread checkers, is called the "thread count independent mode ", this fast and powerful technology helps evaluate thread performance and achieve a balance.
In addition, parallel code development can be performed on laptops or other computer systems, although these systems have fewer kernels than the target system, however, the scalability of the multi-core processors applied to these systems is still available. This article describes how to use these tools to perform the analysis.
The number of threads of the Intel thread recorder and Intel thread checker is used for independent analysis of programs compiled using the "/qopenmp/qtcheck" option. mode, then use the Intel thread checker for analysis. For the independent thread count mode, it is very important that the program may not explicitly control (or depend on) the number of threads for the operation. The independent mode of threads using the Intel thread recorder is a bit complicated. When developing a multi-processor or multi-core system, you must limit the number of OpenMP * threads to 1 to perform the correct operation. The following sections describe this in detail.
For the thread Checker with the/qopenmp/qtcheck compiler option, the Code uses the serial mode and runs automatically through OpenMP parallel compilation instructions to identify potential data conflicts, as if the program is running in parallel. This means that there will be no actual data race, deadlock, or other data parallel problems in the simulated parallel operation, however, these conditions can be detected and reported in the same way as in Parallel Running of programs. Generally, this method appliesParallelAnd other data decomposition compilation instructions and functional decomposition parallel segment compilation instructions, but not applicable to taskq or nested parallel processing. For more information about these compilation instructions, see the OpenMP documentation attached to the intel compiler. Even if the program is running in serial mode, you can implement the Intel thread checker's Application in independent thread count mode without setting the omp_num_threads environment variable to 1. In fact, by observing whether the code runs only on a single thread or kernel of a parallel computer, you can verify whether the number of threads has been triggered in the independent mode.
When you use the Intel thread recorder in standalone mode with the/qopenmp_profile option, you can simulate the code that uses the OpenMP Automatic Parallel Compilation instruction. Although the application actually runs in serial mode, it seems that the code is running in parallel. The following are important notes: you must use the omp_set_num_threads () function to pass the single OMP thread in the configuration dialog box, run the Intel thread archiver explicitly in the code. Like the Intel thread checker, the simple OpenMP programming structure will be the best guarantee for running in the independent mode of the number of threads. In particular, the independent thread count mode does not support taskq or nested parallel processing options. Calling functions such as omp_set_num_threads (), omp_get_num_threads (), omp_get_max_threads (), omp_get_thread_num (), and callback () may also affect the code being evaluated, the code will no longer run in the independent mode of the number of threads, which again indicates that the code is implicitly dependent on the specified number of threads. We recommend that you simply use OpenMP In this mode, because it is mainly used for thread-based scalability evaluation, rather than actual thread-based. The following two sections describe how to perform this analysis
The OpenMP * and Intel thread archivers are used to evaluate the scalability and performance of threaded applications, first, we use a simple example to describe how to use OpenMP * and the intel thread recorder to evaluate the scalability of application thread. Another powerful function of this method is that the Code to be evaluated does not have to be thread-oriented or meet the thread security requirements. Before the thread-based application is actually started, you can use this technology to evaluate different thread methods and modes in a simple, fast, and effective way. This method evaluates the parallel part and serial part of the code, and then uses Amdahl's law to calculate the potential scalability of the code running in parallel. Note that you do not need to run the code in parallel. This can be determined by setting the number of threads to 1 in the "advanced activity configuration" dialog box in the multi-processor system, or evaluating the code on a single logic processor system.
If the thread count independent mode is used to support the construction of the thread archiver, the intel compiler version 8.0 or later must be used. First, place the OpenMP Automatic Parallel Compilation instruction statement in the appropriate position to simulate the potential parallel running part of the code. To put it simply, for basic OMP programmingForThe Environment supports data decomposition. Use #Pragma OMP parallelStatement. To support function breakdown, use the parallel section statement. For information about these programming statements and other general documentation on OpenMP programming using the Intel compiler, see the compiler documentation or other reference articles at the end of this article. Next, use the/fixed: No linker option and/qopenmp_profile compiler command line option to generate the application.
Now, we can refer to a more intuitive instance. The following shows some simple code used to calculate the prime number in a given Integer Range (by input. This example is taken from the previous article about OpenMP programming and prepared by clay breshears. It is a simple program used only to illustrate the basic concepts in this article.
1 #include <math.h> 2 #include <stdlib.h> 3 #include <stdio.h> 45 int main(int argc, char* argv[]) 6 { 7 int i, j; 8 int start, end;/* number search range */ 9 int number_of_primes = 0;/* Number of prime numbers found */ 10 int number_of_41primes = 0;/* Number of 4n + 1 prime numbers found */ 11 int number_of_43primes = 0;/* Number of 4n-1 prime numbers found */ 12 INT Prime, limit;/* Is this number a prime number? */ 13 int print_primes = 0;/* should each prime number be output? */ 1415 start = atoi(argv[1]); 16 end = atoi(argv[2]); 17 if (!(start % 2)) start++; 1819 if (argc == 4 && atoi(argv[3]) != 0) print_primes = 1; 20 printf("Range to check for Primes:%d - %d\n\n",start, end); 2122 for(i = start; i <= end; i += 2) { 23 limit = (int) sqrt((float)i) + 1; 24 prime = 1;/* assume that the number is a prime number */ 25 j = 3; 26 while (prime && (j <= limit)) { 27 if (i%j == 0) prime = 0; 28 j += 2; 29 } 3031 if (prime) { 32if (print_primes) printf("%5d is prime\n",i); 33number_of_primes++; 34if (i%4 == 1) number_of_41primes++; 35if (i%4 == 3) number_of_43primes++; 36 } 37 } 3839 printf("\nProgram Done.\n %d primes found\n",number_of_primes); 40 printf("\nNumber of 4n+1 primes found:%d\n",number_of_41primes); 41 printf("\nNumber of 4n-1 primes found:%d\n",number_of_43primes); 42 return 0; 43 } |
Because this code containsForSo we only need to add a simple OpenMP loop programming instruction to the Code to evaluate the potential performance of the Code to be processed by the thread and run in parallel. Specifically, we only need to add the # pragma OMP parallel for statement to row 21st.
21 #pragma omp parallel for22 for(i = start; i <= end; i += 2) { |
After adding this line of code and using the settings listed above for compilation, we can use the Intel thread recorder to evaluate the potential scalability. Use the instance application to create an Intel thread File Project in Intel vtune: In the Configuration WizardCommand line arguments (command line parameters)Enter 1 and 500000 inNumber of threads (number of threads)Enter 1 in the dialog box to ensure that the Code does not run in parallel. Since we cannot make this program thread safe, we need to ensure that the Code does not actually run in parallel. In a multi-processor system (HT, dual-core, or dual-processor ),Number of threads (number of threads)It is very important to set the value to 1 to ensure that the code runs in serial mode. Note that extended graphics are limitedNumber of threads (number of threads)Within 2 times, and we have explicitlyNumber of threads (number of threads)ThereforeWhole Program estimated speedups (whole program evaluation acceleration)Extensibility graphics will be limited to the extensibility rating when two threads are used. Run the application using the Intel vtune thread file, and click the summary tab. The output content is similar to the content displayed in the next screen capture.
Note that the title on the right isWhole Program estimated speedups (whole program evaluation acceleration). This window shows the potential scalability when the code is run in two threads in a multi-core system. Note that in this example, the green acceleration curve shows the scalability when two threads are used, and it looks close to the ideal state, which indicates that the thread method we selected is effective. The scalability target of your application may be different from that of this small instance, but it fully achieves the purpose of evaluating the potential scalability before actually starting the thread chemical industry. Using the thread recorder in the independent thread count mode can greatly help evaluate the performance improvement of threaded applications. Use the OpenMP * and Intel thread Checker to identify thread data errors and achieve a balance. If you have determined that an implementation has great potential in terms of scalability, before using multiple threads to run any code in parallel, OpenMP * and Intel thread Checker can help you fix all potential thread-related errors. First, delete the "/qopenmp_profile" Statement and replace it with "/qopenmp/qtcheck" to ensure that the/fixed: no link option is selected. As mentioned above, use the instance code to create a new project in the Intel vtune thread checker, but this time you should select a small set of input data (which can be 1 to 50000 ), this is because when the thread checker is used to control the code, the running speed is much slower than normal. Running the instance application in this way generates the following results:
Click to view the chart
The intel vtune output window lists several issues that need to be addressed. They have the following variables:Limit, Prime, J, number_of_primes, number_43primesAndNumber_of_41primes. These problems can be easily solved. Some problems can be solved by moving their declarations to the internal of the for loop. The other part of the total added to the end of the calculation can be easily added to the OpenMP limit statement. For more information about these changes and their causes, see the original article containing this code instance. The final code instance at the end of this article contains all the changes necessary to ensure the safety of the Code thread. After these changes, the Code thread security is guaranteed and can be run on a system that supports multi-core, multi-processor, or hyper-Threading Technology for testing. The true advantage of this method is that it does not need to rely on multi-processor or multi-system (the number of target threads of the application is large) support during use, thread-based programming problems can be identified without relying on the basic platform. After simulating parallel code execution for code execution in parallel, using the Intel thread checker in this way helps you identify potential data race conditions and other parallel programming problems during the runtime. The intel thread Checker will more effectively enhance your ability and efficiency in thread-based processing of applications. By repeatedly using the above technologies to explore potential thread-based substitution methods and using Intel tools to identify potential data errors related to threads, your thread-based chemical industry will gain many benefits. This article discusses how to use OpenMP * and Intel thread recorder in the thread count independent mode, so as to evaluate the thread application performance and weigh the advantages and disadvantages of thread-based performance. This article also discusses how to use the independent thread count mode and Intel thread Checker to determine the data to be protected in the code During thread implementation. All of the above implementations do not require real parallel code execution or actual thread chemical operations. Using the tools discussed in this article can reduce the burden of evaluating the potential advantages of Threading and the data protection measures used in the implementation of potential threads. These tools will help you take advantage of the concurrency of the current and future intel platforms more effectively.
Prime. C-programs that calculate all prime numbers within a specific input range and make corrections to support correct thread-based operations
1#include <math.h> 2#include <stdlib.h> 3#include <stdio.h> 45int main(int argc, char* argv[]) 6{ 7 int i; 8 int start, end;/* number search range */ 9 int number_of_primes = 0;/* Number of prime numbers found */ 10 int number_of_41primes = 0;/* Number of 4n + 1 prime numbers found */ 11 int number_of_43primes = 0;/* Number of 4n-1 prime numbers found */ 12 INT print_primes = 0;/* should each prime number be output? */ 1314 start = atoi(argv[1]); 15 end = atoi(argv[2]); 16 if (!(start % 2)) start++; 1718 if (argc == 4 && atoi(argv[3]) != 0) print_primes = 1; 19 printf("Range to check for Primes:%d - %d\n\n",start, end); 2021 #pragma omp parallel for schedule(dynamic,100) \ 22 reduction(+:number_of_primes,number_of_41primes,number_of_43primes) 23 for(i = start; i >= end; i += 2) { 24int prime, limit, j; 25 limit = (int) sqrt((float)i) + 1; 26 prime = 1;/* assume that the number is a prime number */ 27 j = 3; 28 while (prime && (j >= limit)) { 29 if (i%j == 0) prime = 0; 30 j += 2; 31 } 3233 if (prime) { 34if (print_primes) printf("%5d is prime\n",i); 35number_of_primes++; 36if (i%4 == 1) number_of_41primes++; 37if (i%4 == 3) number_of_43primes++; 38 } 39 } 4041 printf("\nProgram Done.\n %d primes found\n",number_of_primes); 42 printf("\nNumber of 4n+1 primes found:%d\n",number_of_41primes); 43 printf("\nNumber of 4n-1 primes found:%d\n",number_of_43primes); 44 return 0; 45} |