Parallel method of embedded arm multi-core processor

Last Update:2018-07-26 Source: Internet

Author: User

Tags versions

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

At present, embedded multi-core processor has been widely used in the field of embedded devices, but embedded human system software development technology still stay in the traditional single-core mode, and do not give full play to the performance of multi-core processor. Program parallelization optimization At present in the PC platform has certain use, but in the embedded platform is very few, in addition, the embedded multi-core processor and PC platform Multicore processor is very different, so can not directly apply the parallel optimization method of PC platform to the embedded human platform. In this paper, we study the parallel optimization from two aspects of task parallelism and cache optimization, and explore the method of parallel optimization of program on embedded multi-core processor.

　　1 embedded multi-core processor architecture

The structure of the embedded multi-core processor includes isomorphism (symmetric) and heterogeneous (asymmetric). Isomorphism refers to the internal core structure is the same, this structure is widely used in the PC Multicore processor, and heterogeneous refers to the internal core structure is different, this structure is often used in the embedded domain, common embedded processor +DSP core. In this paper, the embedded multi-core processor adopts isomorphic structure to realize parallel execution of the same piece of code on different processors.

Figure 1 ARM SMP processor architecture

In the current embedded field, the most widely used ARM processor, so the arm dual-core processor OMAP4430 as the research object. ARM symmetric multi-processing (symmetric MULTI-PROCESSING,SMP) structure, as shown in Figure 1, according to the local principle of the program, each processor has a private memory (local memory), the most common is a first-level cache (L1cache). However, there is a mutual communication problem between multiple processors, so using level two cache (L2 cache) in a common arm processor solves this problem. Based on the architecture of the Chenduo processor, all processors (usually multiples of 2) are the same on the hardware structure and equal in the use of system resources. More importantly, since all processors have the right to access the same memory space, any process or thread can run on any processor in the shared memory area, making it possible to parallelize the program. 2 The following issues need to be considered for parallelization optimizations on embedded multi-core platforms:

The performance of the ① parallelization program depends on the serialized part of the program, and the program performance does not increase with the number of parallel threads.

The ② embedded multicore processor has a slower bus speed than a PC processor, and the cache is smaller, causing large amounts of data to be copied in memory and cache, so cache friendliness should be considered during the Parallelization optimization (cache Friendly);

③ program parallelization The number of execution threads should be less than or equal to the number of physical processors, too many threads can cause the thread to preempt processor resources, resulting in a degraded performance of parallelism.

　2 OpenMP parallelization optimizations

2.1 0penMP Working principle Introduction

OpenMP is a cross-platform multithreaded parallel programming interface based on shared memory mode. The main thread generates a series of child threads and maps the tasks to child threads for execution, which are executed in parallel and assigned by the runtime environment to different physical processors. By default, each thread executes the code for the parallel region independently. You can use work-sharingconstructs to divide tasks so that each thread executes the code for its assigned portion. In this way, OpenMP can be used to achieve task parallelism and data parallelism.

Figure 2 Task Parallel model

The task Parallel mode creates a series of separate threads, each running a task and the threads being independent of each other, as shown in Figure 2. OpenMP uses the compile primitive Session Directive and Task Directive to implement task assignments, each of which can run different areas of code independently, while supporting nesting and recursion of tasks. Once a task is created, the task may execute on an idle thread in the online pool (its size equals the number of physical threads).

Data parallelism is data-level parallelism, and the data processed in the task is executed in parallel, as shown in Figure 3. The For loop in the C language is best used for data parallelism.

Figure 3 Data parallel model

2.2 Principle of fast sorting algorithm

The fast sorting algorithm is a recursive divide-and-conquer algorithm, the most critical of the algorithm is to determine the sentinel element (pivot data). Data in the data series that is smaller than the sentry will be placed on the left side of the Sentinel element, and the data greater than the Sentinel will be placed on the right side of the Sentinel element. When the data scan is completed, the two parts of the Sentinel element will be called by the quick Sort algorithm recursively.

The recursive invocation of algorithms in the fast sorting algorithm generates a lot of tasks, and these tasks are independent of each other and are well suited to the task parallelism pattern of OpenMP; In addition, the Sentinel element plays a decisive role in the data capacity size of the left and right sub-interval in terms of a fast sorting search algorithm. Considering that the cache space of the embedded platform is small, it is necessary to optimize the Sentinel element filtering algorithm so as to make the divided left and right sub-interval more balanced to meet the requirements of load balancing.

　2.3 Task Parallelism Optimization

Through the analysis of the fast sorting algorithm, the fast sorting is a recursive call algorithm, which produces a large number of repetitive function calls during the execution of the algorithm, and the function execution is independent of each other. For a scanning operation of fast sorting, the algorithm first determines the Sentinel element (pivot), adjusts the data sequence once and then repeats the recursive call algorithm to the left and right intervals of the Sentinel element.

As shown below, the task parallelism optimization for each scan adjusted to the left and right sub-interval, the operation of each sub-interval is abstracted into a task, and through the task of OpenMP in the primitive language #pragma omp task to implement the parallel execution of tasks, so as to achieve a fast-sorting task parallelization optimization.

The size of the data in the task space depends on the sentinel element, so the algorithm-selected partitioning algorithm (Partition algorithm) should try to equalize the data series, and test it with the simple partitioning algorithm and the ternary median method (Median-of-three method).

　2.4 Cache Optimization

The goal of cache friendly is to reduce the amount of data that is copied between the memory and the cache. For 220 integer data, the data size is 4 MB, and the test platform () MAP4430 of this article has a two-level cache of 1 MB, which requires dividing the data into 4 parts.

As shown below, the algorithm divides 4 parts of data into 4 quick sort tasks, 4 part of the task is executed in parallel, after the completion of each part of the data sequence is completed, 4 parts of the data need to be merged to form a complete data series, so after the end of the parallel task, the data need to be merged and sorted.

3 Parallelization Performance Analysis

3.1 Introduction to the experimental environment

This paper adopts the OMAP4430 embedded development platform of Texas Instruments (Texas Instruments). The omap443o is an embedded multi-core processor with a symmetric multiprocessor dual-core ARM processor (Dual-core arm cortex-a, first-class cache, level two cache, 1 MB, The embedded operating system uses the Ubuntul2.o4 kernel, the compiler is ARM-LINUX-GNUEABIHF-GCC, and the GNU Gprof gets the algorithm execution time.

　3.2 Performance Testing

As shown in the following form, the performance of parallel optimization is analyzed by the method of calculating speedup, and the higher the speedup is the higher the parallelism of the algorithm, the lowest is 1. The performance test uses 4 algorithmic versions, including serial versions, parallel 2 threads, parallel 4 threads, and cache-optimized versions, to analyze performance from different angles.

As shown in Figure 4, from the line graph can be seen, 3 parallel optimization algorithm compared to the serial version, the parallel performance of the algorithm has a large increase, as shown in table 1, the parallel acceleration ratio of 1.30, 1.29 and 1.21. For task-parallel optimization scenarios, test with 2 thread and 4 thread version, respectively, and from the analysis of Speedup, the 2 thread version is slightly better than the 4 thread version. Theoretically, the higher the number of parallel threads, the better, but this article uses omap443o only two symmetric multi-processing cores, even if the algorithm has 4 parallel threads, but the actual execution of only 2 threads, and 4 threads in the acquisition of 2 physical processors when there is a competitive relationship, resulting in performance compared to the 2 thread version has decreased.

Figure 4 Algorithm Execution time

The evaluation of parallel algorithms should also consider the load balance of the algorithm, as shown in table 1, table 2, the standard error of the cache optimization scheme is much smaller than the task parallelization scheme. The reason is that for the task parallelization scheme, different test data and partition algorithm (partition) have an important influence on the division of interval, which results in a great range of task execution time; for the cache optimization scheme, the essence is data parallelism, and each task is divided according to the cache size. Therefore, each task processing data size is basically consistent, each task execution time is more definite, but because the parallel task execution completes, needs to merge the data, causes the certain performance to degrade.

　Conclusion

Based on the analysis of the hardware structure of the embedded multi-core processor, the serial fast sequencing algorithm is optimized from the symmetric multi-processing angle, and the results are very good.

With arm dual-core processor (OMAP4430) as the test platform, parallel optimization from task parallelism and cache optimization, the performance test results show that the task parallelism has good speedup, but the load balance is poor, the number of parallel threads should not exceed the number of physical processor cores, too many parallel threads compete processor resources, Cause performance degradation. Cache optimization has good load balancing, but it needs to be merged later, resulting in degraded performance.

In short, the parallel optimization on the embedded multi-core processor, on the one hand, to fully explore the parallel performance of the embedded multi-core processor, improve the parallelism of the program, on the other hand, we should also consider the load balance of the program algorithm to ensure that the program performance is consistent in different application environments.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More