http://blog.csdn.net/real_myth/article/details/51556313
From:http://ee.ofweek.com/2014-11/art-11001-2808-28902672.html
At present, the embedded multi-core processor has been widely used in the field of embedded devices, but the technology of embedding human-type system is still in the traditional single kernel mode, and the performance of multi-core processor is not fully exploited. Program parallelization optimization At present on the PC platform has some use, but in the embedded platform is still very few, in addition, the embedded multi-core processor and PC platform multi-core processor is very different, so can not directly the PC platform parallel optimization method applied to the embedded platform. In this paper, the parallel optimization of task parallelism and cache optimization is studied, and the parallel optimization method of the program on the embedded multicore processor is explored.
1 embedded multi-core processor architecture
The structure of the embedded multi-core processor consists of isomorphism (symmetric) and heterogeneous (asymmetric). Isomorphism refers to the internal core structure is the same, this structure is currently widely used in PC multi-core processors, while heterogeneous refers to the internal core structure is different, this structure is often used in the embedded domain, common embedded processor +DSP core. In this paper, an embedded multi-core processor is used to implement the same code on different processors in a homogeneous structure.
Figure 1 ARM SMP processor architecture
In the current embedded domain, the most widely used is ARM processor, so the arm dual-core processor OMAP4430 as the research object. ARM symmetric multiprocessing (symmetric MULTI-PROCESSING,SMP) structure as shown in Figure 1, each processor has a private memory (local Memory) based on the principle of the program's locality, often a first-level cache (L1cache). However, multiple processors also involve mutual communication problems, so use level two caching (L2 cache) in common ARM processors to address this problem. Based on the Chenduo processor architecture, all processors (usually multiples of 2) are identical in hardware structure and equal in the use of system resources. More importantly, because all processors have the right to access the same memory space, in the shared memory area, any process or thread can run on any one processor, so that the parallelism of the program can be made possible. 2 in the embedded multi-core platform for parallel optimization, need to consider the following issues:
The performance of the ① parallelization program depends on the serialization part of the program, and the program performance does not increase with the number of parallel threads.
② Embedded multi-core processor, relative to the PC processor, its bus speed is slow, and the cache is smaller, resulting in a large number of data in the memory (Memory) and cached (cache) to continue to copy, so in the process of parallel optimization, should consider caching-friendly (cache Friendly);
The number of parallel execution threads of the ③ program should be less than or equal to the number of physical processors, and too many threads will cause the processor resources to preempt between threads, resulting in a decrease in parallelization performance.
2 OpenMP parallelization optimization
2.1 0penMP Working principle Introduction
OpenMP is a cross-platform, multithreaded, parallel programming interface based on shared memory mode. The main thread generates a series of child threads and maps the tasks to child threads for execution, which are executed in parallel by the runtime environment that assigns threads to different physical processors. By default, each thread executes code for a parallel region independently. You can use work-sharingconstructs to divide a task so that each thread executes its assigned portion of code. In this way, you can use OpenMP to implement task parallelism and data parallelism.
Figure 2 Task Parallel model
Task Parallel mode creates a series of independent threads, each running a task, and the threads are independent of each other, as shown in Figure 2. OpenMP uses the compilation primitive session Directive and Task Directive to implement task assignments, each of which can run separate code regions independently, while supporting nesting and recursion of tasks. Once a task is created, the task may execute on a thread that is idle in the online pool, which is equal to the number of physical threads.
Data parallelism is the parallel of data level, in which the data processed in the task is executed in parallel, as shown in Figure 3. The For loop in C language is best used for data parallelism.
Figure 3 Data parallel model
2.2 Fast sorting algorithm principle
The fast sorting algorithm is a recursive divide-and-conquer algorithm, the most critical of the algorithm is to determine the Sentinel elements (pivot data). Data in the data sequence that is smaller than the Sentinel will be placed on the left side of the Sentinel element, and the data in the sequence larger than the Sentinel will be placed on the right side of the Sentinel element. When the data scan is completed, the two parts of the Sentinel element are called by the fast sort algorithm recursion.
Recursive calls involving algorithms in a fast sorting algorithm produce a large number of tasks that are independent of each other and are well suited to the task parallelism of OpenMP; In addition, for a fast sort search algorithm, the Sentinel element plays a decisive role in the size of the data capacity of the left and right children. Considering that the cache space of the embedded platform is small, it is necessary to optimize the algorithm of the Sentinel element selection, so as to make the left and right subgroups more balanced to meet the requirements of load balancing.
2.3 Task parallelization optimization
By analyzing the fast sorting algorithm, the fast sort is a recursive call algorithm, which produces a large number of repetitive function calls in the process of execution, and the execution of functions is independent of each other. For a fast sort of one-scan operation, the algorithm first determines the Sentinel element (pivot), adjusts the data sequence, and then repeats the recursive call algorithm for the left and right intervals of the Sentinel element.
As shown below, in order to optimize the task parallelization, the parallel tasks of each scan are abstracted into one task, and the parallel operation of the task is implemented by the task parallel primitives #pragma omp task in OpenMP, thus realizing the task parallelization optimization of fast sequencing.
The size of the data in the task space depends on the sentinel elements, therefore, the algorithm selection algorithm (Partition algorithm) should try to equalize the data sequence, this paper uses a simple partitioning algorithm and ternary median method (Median-of-three methods) to test.
2.4 Cache Optimization
The goal of cache friendly is to reduce the amount of data that is copied between memory and cache. For 220 integer data, the data size is 4 MB, and the test platform () MAP4430 of this article has a level two cache of 1 MB, which requires dividing the data into 4 parts.
As shown below, the algorithm divides 4 parts of the data into 4 quick sort tasks, 4 tasks are executed in parallel, and after each part of the data sequence is completed, 4 parts of the data need to be merged to complete the data sequence, so the data must be sorted after the parallel task is finished.
3 Analysis of parallel performance
3.1 Introduction of experimental environment
This paper adopts the OMAP4430 embedded development Platform of Texas instrument (Texas instruments). Omap443o is an embedded multi-core processor with symmetric multiprocessing dual-core ARM processors (Dual-core arm cortex-a, cache MB, level two cache 1 MB, The embedded operating system uses the Ubuntul2.o4 kernel, the compiler is ARM-LINUX-GNUEABIHF-GCC, and the GNU Gprof is used to get the algorithm execution time.
3.2 Performance Test
The performance of parallel optimization is analyzed by using the method of calculating speedup, as shown in the following formula the higher the acceleration ratio, the higher the parallelism of the algorithm is, the lowest is 1. Performance testing employs 4 algorithmic versions, including serial versions, parallel 2 threads, parallel 4 threads, and cache optimizations, which analyze performance from different angles.
As shown in Figure 4, from the line chart, we can see that the parallel performance of the 3 parallel optimization algorithms is greatly improved compared with the serial version, as shown in table 1, the parallel speedup ratio is 1.30, 1.29 and 1.21. For task parallel optimization scenarios, 2 threads and 4 thread versions are used respectively for testing, and the 2 thread version is slightly better than the 4 thread version from the analysis of the acceleration ratio. Theoretically, the more the number of parallel threads is, the better. But this article uses omap443o only two symmetric multiprocessing core, even if the algorithm has 4 parallel threads, but the actual execution of the thread is only 2, while 4 threads in the acquisition of 2 physical processors have a competitive relationship, resulting in performance compared to the 2-thread version of the decline.
Figure 4 Algorithm Execution time
The evaluation of the parallel algorithm needs to consider the load balance of the algorithm, as shown in table 1 and table 2, the standard deviation of the cache optimization scheme is much smaller than the task parallelization scheme. The reason, for task parallelization, different test data and partitioning algorithm (partition) have important influence on interval division, which results in a great change of task execution time. For caching optimization, the essence is data parallelism, each task is divided according to the cache size, Therefore each task processing data scale is basically consistent, each task executes the time to be more certain, but because the parallel task execution completes, needs to merge the data, causes certain performance to descend.
Conclusion
Based on the analysis of the hardware structure of the embedded multi-core processor, this paper makes a parallel optimization of the serial fast sorting algorithm from the symmetrical and multiple processing angle, and obtains a good result.
With arm dual-core processor (OMAP4430) as test platform, parallel optimization is realized from task parallelism and cache optimization, from the result of performance test, task parallelism has good speedup, but load balance is poor, the number of parallel threads should not exceed the number of physical processor cores, too many parallel threads compete processor resources, causing performance degradation. Cache optimization has good load balancing, but it needs subsequent merging operations, resulting in a decrease in performance.
In a word, parallel optimization on embedded multi-core processor should be exploited to fully explore the parallel performance and improve the parallelism of the embedding multi-core processor. On the other hand, the load balance of program algorithm should be considered to ensure the consistency of program performance in different application environments.