Original address
Modern high-performance computers are built from the following combination of resources: multicore processors, multi-core processors, large caches, highly bandwidth interprocess communication architectures, and high-speed I/O capabilities. High-performance software needs to be designed to take full advantage of these rich resources. Whether it's rebuilding and/or tuning existing applications to maximize performance, or building new applications for existing or future devices, understanding the interaction between the programming model and the efficient use of resources is critical. Take this as a starting point for a comprehensive understanding of code modernization. With regard to performance, your code is critical!
Building a parallel version of the software allows the application to run a specified set of data in a shorter time, run multiple datasets at a fixed time, or run large datasets that are not allowed to run by non-optimized software. The success of parallelization is usually quantified by measuring the speedup of the parallel version (as opposed to the serial version). In addition to the comparisons above, it is also useful to compare parallel version acceleration with the upper limit of the possible acceleration. This problem can be solved by Amdahl law and Gustafson law.
Excellent code design takes into account several different levels of parallelization.
- The first layer of parallelism is vector parallelization (in-code) that performs the same calculation instruction on large chunks of data. Both the scalar and parallel portions of the code will benefit from efficient vector computing.
- The second layer of parallelism is thread parallelization , which is characterized by the multiple threads of a single process communicating through shared memory and working together on a specified task.
- The third layer of parallelization is the development of multiple code in the way of independent cooperative processes, and the communication between each code through a message delivery system. This is called distributed memory queue parallelization , because each process specifies a unique queue number.
Developing an efficient use of this three-tier parallelization and high-performance code is the best choice for modernizing your code.
The combination of these points will have a positive impact on the memory mode of the device: main memory capacity and speed, memory access time associated with memory location, cache capacity and quantity, and memory consistency requirements.
When vector parallelization occurs, the performance is severely impacted if data is misaligned. Data should be organized in a cache-friendly manner. If this is not the case, performance will degrade when the app requests data that is not within the cache. When the required data is in the cache, the memory access speed is the fastest. Data transfers between caches are performed in cache rows, so the cache efficiency of the application is reduced if the next set of data is not within the current cache line or is dispersed across multiple cache rows.
Division and Transcendence mathematical functions are very expensive, even if the instruction set supports these functions directly. If your app uses multiple division and square root operations in run-time code, the performance is reduced because the functional units within the hardware are limited, and the pipelines that connect these units may dominate. Because these instructions are expensive, developers want the cache to use higher-frequency values to improve performance.
"One-size-fits-all" technology does not exist. People are too dependent on a problem being addressed and the long-term requirements for code, but good developers will focus on different levels of optimization, not only to meet current needs, but also to meet future needs.
Intel has built a complete set of tools to assist with the modernization of the Code, including compilers, repositories, debuggers, performance analyzers, parallel optimization tools, and more. In addition, as a leader in parallel computer development, Intel provides webinars, documentation, training samples, and best practices and case studies based on its over 30 years of experience.
Code Modernization 5-stage framework for multi-tier parallelism
The Code modernization framework optimizes application performance in a systematic manner. The framework divides the application into 5 optimization stages, each of which interacts with each other to enhance application performance together. However, before you start the optimization process, you should consider whether your app needs to be rebuilt (according to the following guidelines) for maximum performance, and then optimize it according to the code modernization optimization framework.
With this optimization framework, applications can achieve the highest performance on intel® architectures. This distributed approach helps developers achieve the highest application performance in the shortest possible time. In other words, it enables the program to maximize the use of all parallel hardware resources in the execution environment. These 5 phases were:
- leverage optimization tools and libraries: use intel® ®vtune™amplifier to analyze workloads to identify hotspots and identify vectorization and threading opportunities with intel® ®advisor XE. Use the Intel compilers to generate the best code and, where appropriate, utilize an optimized library of intel® math kernel libraries, intel® ®TBB, and openmp*.
- scalar serial optimization: maintain correct precision, enter constants, and mark with appropriate functions and precision.
- vectorization: leverages SIMD features and data layout optimizations. Using a cache-aligned data structure, the structure array is transformed into arrays and the conditional logic is minimized.
- thread parallelism: parses the thread extension and associates the thread with the kernel. Scaling problems are often caused by thread synchronization or low memory utilization.
- extending applications from multicore to multi-core (distributed Memory queue parallelization): Scaling is extremely important for highly parallelized applications. Minimize changes and maximize performance during the process of swapping execution objects from one preferred Intel architecture (intel® xeon® processor) to another (intel® Xeon Phi™ coprocessor).
Code Modernization – 5 phases of practical use
phase 1th When you start to optimize your project, you need to select an optimized development environment. This selection has important implications for the next steps. Not only does it affect the results you get, it also drastically reduces your workload. The right optimization development environment can provide you with excellent compiler tools, off-the-shelf optimization libraries, debugging tools, and performance evaluation tools to help you see exactly what your code is doing at runtime. Check out the webinar in the intel® ®advisor XE tool to identify vectorization and threading opportunities.
With the 2nd phase running out of an optimized solution for your application, you need to start the optimization process related to the source code of your application if you want to make your app more performant. Before you carry out active parallel programming, you need to ensure that the application provides the correct results before vectorization and parallelization are processed. Just as important, you need to make sure that the application can get the right results with the least number of calculations. You need to consider data and algorithm-related issues such as:
- Choosing the right floating point precision
- Choosing the right estimate method accuracy: polynomial or rational number
- Avoid jumping algorithms
- Shortening the length of cyclic operation by iterative calculation
- Avoid or minimize conditional branching in the algorithm
- Avoid repeated calculations and use previous results
You must also deal with language-related performance issues. If you are using C + +, issues related to that language include:
- Use explicit form method (explicit typing) for all constants to avoid automatic escalation
- Select the correct C run-time function class, such as doubles or floats:
exp()
with expf()
; abs()
fabs()
- Explicitly informs the compiler of a point alias
- Explicitly call inline functions to avoid overhead
The 3rd phase attempts vector-level parallelization . First, try to vectorize the inner loop. For efficient vector loops, ensure that the control flow is dispersed at a minimum, and that memory access remains consistent. Outer Loop vectorization is a technique for enhancing performance. By default, the compiler will vectorize the inner loop of the nested loop structure. In some cases, however, the number of iterations in the inner loop is small. At this point, vectorization of the inner loop is not worth the candle. However, if there is more work in the outer loop, you can use a combination of basic functions (strip-mining and compile indicator/instruction SIMD) to enforce vectorization in the outer loop for better results.
- SIMD behaves best on "packet" and aligned input data, but because of its nature, it can adversely affect control dispersion. In addition, modern hardware achieves excellent SIMD and threading performance if the application implementation focuses on data proximity.
- If the inner loop does not have enough work (for example, the number of runs is very low, the performance benefits of vectorization can be measured), or the data dependency interferes with vectorization of the inner loop, try to vectorize the outer layer loop. The outer loop may result in control flow dispersion, especially in cases where the number of runs of the inner loop varies with each iteration of the outer loop. This limits the performance improvements that are achieved through vectorization. The memory access for the outer loop may be different from the inner loop. This results in a collection/scatter instruction (rather than vector loading and storage), which greatly limits the scaling that is achieved through vectorization. Data transformations, such as transpose a two-dimensional array, can mitigate these problems or attempt to convert an array of structures into arrays.
- Because of the shallow cycle level, the guidelines above may cause the need to parallelize and vectorize the loops at the same time. In this case, the loop needs not only to provide enough parallel work to compensate for the overhead, but also to maintain control flow uniformity and memory access consistency.
- See vectorization features for more details.
Phase 4th Now we are going to thread-level parallelization . Determines the outermost layer, and attempts to parallelize it. Obviously, this requires maintaining a potential data race and moving the data declaration to the inside of the loop when needed. It also requires that data be maintained in a way that efficiently uses caching to reduce the overhead of data maintenance across multiple parallel paths. Parallelization of the outermost layer is the desire to provide as much work as possible for each individual thread. The Amdahl law shows that the speedup achieved by a program that uses multiple processors in parallel computing is constrained by the time required by the sequence fragment of the program. Because the workload needs to compensate for the overhead of parallelization, it is advantageous for each thread to have as much parallel work as possible. If the outermost layer cannot be parallelized due to an unavoidable data dependency, try to parallelize the next outermost layer that is able to parallelize correctly.
- If the outermost side of the workload can meet the needs of the target hardware, and can be extended with the reasonable increase of parallel resources, then you have achieved the goal of parallelization. Do not perform other parallelization, as this can significantly increase overhead (the thread control overhead will negate everything performance improvements), and no performance gain can be achieved.
- If parallel work is still not enough, for example, after a kernel extension test that can scale up to a few cores and not scale to the actual number of cores, try to parallelize the other layers (as far as possible to the outermost). Note that you do not need to extend the loop hierarchy to all available cores, because there may be other loop levels in parallel execution.
- If you cannot generate extensible code in phase 2nd, the reason may be that parallel work in the algorithm is not enough. This means that a fixed amount of work between multiple threads results in very little work for each thread, so the overhead incurred by starting and terminating the thread offsets useful effort. Perhaps the algorithm can scale to handle more work, such as trying to deal with larger problems.
- Make sure that your parallel algorithms are efficient at using caching. If not, re-design this as a cache-efficient algorithm, because the algorithm does not scale with parallelization.
- For more details, please see the Intel Multithreaded Application Development Guide series.
At the end of the 5th phase , we're going to do Multi-node (queue) parallelization . Many developers believe that the message passing Interface (MPI) is "run only" to the black box behind the scene, transferring data from one MPI task to another. For developers, the charm of MPI is that its algorithm code is independent of the hardware. What developers are concerned about is that because the core architecture employs more than 60 cores, communication between tasks can generate communication storms within a single node or across nodes. To mitigate these communication bottlenecks, applications can use hybrid techniques to mix several MPI tasks and several OpenMP threads.
- For more details, please see Using Intel ®mpi for parallelization.
Well-optimized applications can handle vector parallelization, multithreaded parallelization, and multi-node (queue) parallelization. However, to achieve these parallelism efficiently, a standard distributed approach can be used to ensure that all phases are taken into account. Depending on the specific needs of a standalone application, the above stages can be reordered (usually), and you can iterate over two times at a certain stage to achieve the expected performance.
Based on our experience, all phases must be implemented to ensure that applications not only deliver outstanding performance on today's scalable hardware, but also scale effectively on future hardware.
Try!
What is code modernization?