Increase the efficiency of SIMD usage in animations with SIMD data layout templates and data preprocessing

Source: Internet
Author: User
Tags scalar turbo boost

Original link

Brief introduction

In order to play the greatest role of SIMD1, we need to make other efforts in addition to vectorization 2. You can try adding #pragma omp SIMD3 to the loop to see if the compiler is vectorized successfully, and if performance improves, you are satisfied. However, performance may not improve at all, or even degrade. Regardless of the situation, in order to maximize the benefits of SIMD execution and achieve performance gains, it is often necessary to redesign algorithms and data layouts so that the generated SIMD code is as efficient as possible. There is also the additional effect that the scalar (non-vectorization) version of the code will perform better.

In this article, we'll step through a 3D animation algorithm example to introduce which methods we can use in addition to adding "#pragma". In this process, other techniques and methods can also help you with your next vectorization effort. We also integrate the algorithm with the SIMD Data Layout template (SDLT)-Intel? A feature of the C + + compiler to improve the efficiency of data layout and SIMD. All source code for this article is downloadable and contains additional details that are not mentioned here.

Background knowledge and problem statement

Sometimes, vectorization of loops is not enough to improve the performance of the algorithm. Intel? The C + + compiler may prompt that "can be vectorized but may be very inefficient." But just because loops can be vectorized does not mean that the generated code is more efficient than a loop that is not vectorized. If Vectorization does not improve performance, you can choose to pinpoint the reason behind it. In general, in order to obtain efficient SIMD code, a redesign of the data layout and algorithms is required. In many cases, the optimization method that is advantageous for SIMD, whether or not vectorized, is the main reason for performance improvement. However, SIMD performance will be significantly improved by increasing the efficiency of the algorithm.

This article describes the sample source code and its four other versions of the loop, explaining the changes we made to improve SIMD efficiency. Figure 1 will be used as the reference for this article as well as the downloaded source code. The section on version 0-3 is a core part of this article. The additional version 4 section describes the advanced SDLT features that eliminate the overhead of SIMD conversions.

Figure 1: The version number legend and the corresponding description of the code set modifications available in the source code. The version number also contains the modification order.

Algorithms that require data collection and dispersion have an impact on scalar and SIMD performance. Also, if you have a collection (scatter) chain, you can further reduce performance. If the loop contains indirect access (or non-unit step Memory Access 4), 2 shows that the compiler may generate collection instructions (explicit collection instructions or instructions for multiple simulated data collection). And because of the need for indirect access to large instructions, the number of collection instructions will grow exponentially as the number of data primitives increases. For example, if the instruction "A" contains 4 doubles, indirect access to this instruction will generate 4 collection instructions. In some cases, indirect access will inevitably occur in the algorithm. However, if possible, you should find a workaround to avoid indirect access. Avoiding inefficient behaviors such as collection (or dispersion) can significantly improve SIMD performance.

In addition, data alignment can also improve SIMD performance. If the loop runs on data that is not aligned to the SIMD data channel, performance is reduced.

Figure 2: Indirect memory addressing may be a collection or scatter behavior, and a circular index is supported for finding another index. Collection is the index payload. Dispersion is an index save.

We illustrate several techniques that can be used to improve the efficiency of generated code by using a simple 3D mesh deformation algorithm example, which provides advantages for both scalar and SIMD. In Figure 3, each vertex of the 3D mesh has an attachment that contains data that can affect the vertex deformation. Each attachment references 4 junction points indirectly. Attachments and junctions are stored in a 1D array.

Figure 3: Example algorithm for 3D mesh deformation.

Version 0: algorithm

In Figure 4, the algorithm loops through the "attachment" array. Each attachment contains 4 junction index values that are accessed indirectly in a junction array. And each junction contains a conversion matrix (3x4) consisting of 12 doubles. Therefore each iteration of the loop requires a collection of 48 doubles (12 doubles multiplied by 4 junctions). Collecting so much doubles can degrade SIMD performance. Therefore, if this collection is reduced or avoided, SIMD performance will be significantly improved.

Figure 4: Version 0: an example algorithm that contains 48 collections per loop iteration.

Version 1:SIMD

With regard to version 1, we vectorize the loop. In the example, the loop is successfully vectorized by adding "#pragma omp SIMD" (see Figure 5) because it satisfies the vectorization criteria (for example, there is no function call, single-input-out, straight-line code 5). In addition, it follows the vectorization strategy of SDLT, which is to restrict objects to help the compiler successfully complete privatization. 6 However, we should note that in many cases, simply adding a compilation instruction can result in a compilation error or an error generating code. 7 code refactoring is often required so that loops reach a vectorized state.

Figure 5: Version 1: Modify the 8th line of version 0 (see Figure 4) to vectorize the loop.

Figure 6 shows the Intel? C + + compiler (ICC) XE opt-Report on version 1 loops. To Intel? Advanced Vector Extensions (Intel? AVX) 9 As you can see, the opt-report shows that performance is expected to increase by only 5% even if the loop is vectorized. In our case, however, version 1 has a lower actual performance than version 0 15%. No matter how much performance improvement the opt-report predicts, you should test for actual performance.

In addition, Figure 6 shows a transformation matrix consisting of all 4 Joint, each doubling has 48 "indirect access" mask index load. At the same time, 48 "indirect access" notes are generated, and one of them is listed in Figure 7. We should not neglect the remarks of the opt-report, but should investigate the reasons and try to solve them.

Figure 6: Version 1: About loops for Intel? C + + compiler opt-report.

Figure 7: Version 1: Intel? The C + + compiler opt-reports and accesses comments indirectly.

Even if the loop is vectorized, the bulk of the collection behavior caused by indirect access can still hinder SIMD boost performance.

Solution Solutions

After a successful vectorization, performance may or may not be improved. In either case, vectorization of the loop should only be the starting point of the optimization process, not the end point. Instead, we can use tools such as opt-reports, assembler code, Intel? VTune? Amplifier XE, Intel? Advisor XE) helps investigate the causes of inefficiencies and improves SIMD code by implementing solutions.

Version 2 (part 1th): Preprocessing data to ensure data unification

In our example, the opt-report reports 48 collections and corresponding "indirect access" notes. We need to pay special attention to the indirect access notes, which are almost entirely in this report. After further investigation, we find that they correspond to the 4x3 matrix values of the junction of 4 (indirectly accessed internally by vectorization loops), which are collected 48 times in total. We know that collecting (or dispersing) can affect performance. So how do we solve this problem? Must they be collected, or should they be managed to avoid these collection practices?

For example, we ask ourselves, "Are there unified data that can be accessed from within the loop body and can be lifted to the outside of the loop?" The initial answer is no, and then we ask, "is it possible to redesign the algorithm so that it can become a unified data with cyclic invariant?" ”

Figure 8: Collating the algorithm data. On the left, the loop iterates through the junction index with completely different attachments. On the right, the attachments are collated, and each (inner) sub-cycle has the same uniform junction data set.

As shown in 8, many independent attachments share the same junction index values. By organizing attachments, all attachments that share the same index can be divided into groups, which will have the opportunity to iterate over a subset of the attachments, where the junction points are uniformly (cyclic invariant) distributed across the iteration space of the sub-loop. This will help to elevate the junction data to the outside of the vectorization inner loop. Such an inner vectorization loop will not have any collection behavior.

Figure 9: Version 2: Redesign the algorithm to create uniform (cyclic invariant) data.

Figure 9 shows the code generated after using the defragment data array, which groups the elements that share the unified data, where the original loop is transformed into an outer and inner (vectorization) loop that avoids the collection behavior. The Indiceset mindicesetarray Array can track the starting and ending indexes in an organized array. So we have an outer loop and an inner loop. Also, because the data is reordered, you need to add workidindex to track the original location in order to write out the results.

Now the OPT report (see Figure 10) no longer reports the 48 index mask load (or collection behavior) caused by the junction. and "expect" Intel? AVX performance will increase 2.35 times times . In our case, the actual performance was increased by 2.3 times times .

Figure Ten : version 2: About the re-engineered loops through the unified junction data for Intel? C + + compiler opt-report.

In Figure 10, we should note that the opt-report still reports 8 "collection" or "Mask step load". This result is caused by access to the fabric array memory layout of the msortedattatchments array. Ideally, we would like to implement a "non-masked snap unit step" load. We will later verify how to achieve this with SDLT. It is also noteworthy that the opt-report (see Figure 10) showed 3 dispersion. This is because we rearrange the input data, so we need to write the results out to the output in the correct order (as shown in line 29th of 9). But dispersing 3 values is better than collecting 48 values, because a small amount of overhead can avoid significant costs.

Version 2 (part 2nd): Data population

At this point, the OPT report anticipates a significant increase in performance. However, we have re-organized the initial large attachment loops into multiple small sub-loops, and we have noticed that performance is not optimal when dealing with short stroke counts during actual loop execution. In the case of short stroke counts, a large amount of execution time is consumed in a stripped or remaining loop that is not fully vectorized. The example shown in Figure 11 shows that unaligned data causes execution in a Peel loop, a main loop, or the remaining loop. This can occur if the starting or terminating index (or both) of the iteration space is not a multiple of the number of SIMD vector channels. Ideally, we want all execution times to appear in the main SIMD loop.

FigureOne: SIMD loop parsing. When the compiler performs vectorization, code is generated for Class 3 loops (main SIMD loops, peel loops, and remaining loops). In this figure, there is an example of a 4 vector channel, where the loop iteration space range is 3-18. The main loop will process 4 elements at a time (starting at SIMD channel Boundary 4, to 15), the Peel loop will process element 3, and the remaining loops will handle 16-18.

Figure: Intel? VTune? Amplifier XE (2016) can be used to view time consumption in the corresponding assembly code. When viewing the (executed) assembly in the Intel VTune amplifier XE, and its scroll bar, the blue bar indicates the execution time. By identifying the stripping loops, the main loops, and the remaining loops in the assembly, you can determine how long (if any) the external consumption of the vectorization main loop takes.

Therefore, in addition to collating attachments, populating the attachment data to make it a multiple of the number of SIMD vector channels can also improve SIMD performance. Figure 13 illustrates how populating a data array to support all executions occurs in the main SIMD loop, which is an ideal state. The results may vary, but populating the data is often a very useful technique.

Figure: Populating the data array. In the 4 vector channel example, it shows that the attachment is organized into two sets of sub-loops. (left) sub-loop 1, Attachment 0-3 is processed in the main loop, and the remaining two elements (4 and 5) are processed by the remaining loops. In sub-loop 2, there is only one stroke count of 3, and all 3 are processed by stripping cycles. (right) We populate all the sub-loops to make them a multiple of the 4 SIMD channels, thus enabling all attachments to be handled by the vectorization loop.

Version 3:SDLT container

Since we redesigned the algorithm to avoid collecting behavior and significantly improved SIMD performance, we can now leverage SDLT to further improve the efficiency of SIMD code. Until now, all loads have been "masked" and misaligned. Ideally, we want the payload to be non-masked, aligned, and unit-step loaded. We use SDLT primitives and containers to achieve this goal. The SDLT helps to successfully privatize local variables in the SIMD loop, where each SIMD channel has a private variable instance. The SDLT container and accessor automatically handles data conversion and alignment.

In Figure 14, the source code shows the modifications required to integrate the SDLT. The main change is to declare sdlt_primitivefor the directive attachmentsorted , and then the input data container for the attachment array from the std::vector container (The fabric array (AOS) Data layout) into SDLT containers. Programmers can use operators on SDLT accessors [], as if they were C arrays or std::vector. Initially we used the SDLT array structure (SOA) container (sdlt::soa1d_container), but the array structure array (ASA) container (Sdlt::asa1d_container) can also significantly improve performance. The method of converting (using typedef) SDLT container types to test for optimal performance is straightforward, so we recommend it. In Figure 14, we introduced the SDLT_SIMD_LOOP macro directive, the "preview" feature in ICC 16.2 (SDLT v2), and is compatible with the ASA and SOA two container types.

Figure: Version 3. Integrated SDLT containers (line 1–3, line 7th) and accessors (lines 8th and 19), as well as the preview feature of the Sdlt_simd_loop macro directive (lines 17th and 23). Only the differences of version 2 are shown.

Figure: Version 3: Intel when using SDLT primitives and containers? C + + compiler opt-report.

In Figure 15, the opt-report expects version 3 to achieve a 1.88 times-fold performance boost. But keep in mind that this is just a pre-valuation, not an actual performance boost. In fact, our case has achieved 3.17 times times the actual performance improvement. Also, recall that the OPT report for version 2 (Figure 10) reported the "Mask step" load. Now (Figure 15) The payload is "non-masked", "aligned", "unit stride". This is the ideal performance state, which can be achieved by improving data layout and improving memory access efficiency using the SDLT container.

Version 4:sdlt::uniform_soa_over1d

In the version 4 algorithm, we also found a number of other performance improvement opportunities. Note that from one sub-loop to the next, 3 of the 4 junction data can be extracted for unified access within the inner loop. It is also important to understand that it is expensive to prepare unified data for each entry into the SIMD cycle, and we incur this cost for each outer loop iteration of the unified data.

In the case of SIMD loops, the preparation of unified data before the start of the 10 iteration can incur overhead, depending on the SIMD instruction set. For each uniform value, the compiler can 1) load a scalar into the Register, 2) propagate the scalar value in the register to all channels of the SIMD Register, and 3) Save the SIMD registers on the stack in a new position for use by the SIMD loop body. This overhead can be easily amortized over a long stroke count. However, in the case of short stroke counts, it may affect performance. In version 3, each outer loop iteration generates overhead for 4 junctions, a total of 48 doubles (12 doubles per junction).

Figure: Find trip count: Intel? The Advisor XE (2016) has a practical function that provides a trip count for loop execution. It helps you to easily identify short stroke counts and long stroke counts.

In this scenario, SDLT can support the explicit management of this SIMD data conversion by determining when the overhead is incurred, without automatically generating overhead costs. This advanced feature of SDLT is sdlt::uniform_soa_over1d. It helps eliminate SIMD conversion overhead in loops and helps users control the time it takes to generate overhead. The principle is to save the cyclic invariant data in a SIMD-ready format so that the SIMD loop can access the data directly without conversion. It enables partial updates and reuse of unified data to help improve performance.

To illustrate when SIMD data conversion overhead occurs, and how SDLT helps eliminate this overhead, we provide a pseudo-code example in figures 17 and 18. Figure 17 shows the overhead for each outer loop iteration (line 18th) and doubling each time (through unified data access) (12 doubles). Figure 18 shows the purpose of sdlt::uniform_soa_over1d , which enables this overhead to occur only once (line 6th), reducing overall costs. This advanced functionality gives you a huge advantage over specific scenarios. The user should experiment. Actual results may vary.

Figure: Before entering the SIMD loop on line 12th, for each uniform value, the compiler can 1) load a scalar into the Register, 2) propagate the scalar value in the register to all channels of the SIMD Register, and 3) Save the SIMD registers on the stack to a new location for Used by the SIMD loop body. This overhead can be easily amortized over a long stroke count. However, in the case of short stroke counts, it may affect performance.

Figure: By using SDLT::UNIFORM_SOA_OVER1D to determine when the overhead is incurred, this SIMD data conversion can be explicitly managed without automatically generating costs. This advanced feature of SDLT helps eliminate SIMD conversion overhead in loops and helps you control the time it takes to generate overhead. The principle is to save the cyclic invariant data in a SIMD-ready format so that the SIMD loop can access the data directly without conversion.

So the first step is to further improve performance in the case of short stroke counts, we re-design the algorithm to reuse three of the 4 junctions from the outer loop index i to i+1 , as shown in 19. Use the SDLT feature to help eliminate the overhead of preparing SIMD data for a child loop.

Figure: Version 3: Unified data for 3 joints (total 4) can be reused for the next sub-loop to partially update the consolidated data to minimize load (or collection in the outer loop). It also uses SDLT::UNIFORM_SOA_OVER1D to save unified data in a SIMD-ready format and minimizes the SIMD conversion overhead involved in all sub-loops.

By refactoring loops to reuse unified data for each sub-loop, and with only a partial update, we only need to update 1/4 of the consolidated data on average. Therefore, the overhead of setting unified data for the SIMD loop will be reduced by 75%.

Conclusion

Figure: Based on Intel? Xeon? CPU Processor e5-2699 v3 (code Haswell) built by Intel? Performance improvements achieved by advanced vector extensions I

Vectorization of code is only the beginning of the implementation of SIMD performance acceleration, not the end. Vectorization should be followed by using available resources and tools (such as optimization reports, Intel VTune Amplifier XE, Intel Advisor Xe) to understand the efficiency of generated code. After analysis, you can find opportunities for optimization that are useful for scalar and SIMD code. Then use and test the techniques, both common techniques and the techniques mentioned in this article. You may need to reconsider the algorithm and data layout to maximize the efficiency of your code, especially the generated assembly code.

As we can see from the example, version 2 of the algorithm implementing data preprocessing to eliminate all indirect access (collection) received the most results. Version 3 uses SDLT to improve memory access through non-mask-aligned unit-step loads and to populate the data to align the SIMD channel boundaries, significantly improving performance. In the short stroke count scenario, we use the SDLT advanced functionality to minimize the overall cost of unified data overhead.

Resources
    1. Download an example that contains source code:
      Https://software.intel.com/sites/default/files/managed/de/3f/animation-simd-sdlt-whitepaper.tar.gz
    2. SDLT documentation (with partial code examples):
      Https://software.intel.com/zh-cn/code-samples/intel-compiler/intel-compiler-features/intel-sdlt
    3. SIGGRAPH 2015:dreamworks Animation (DWA): How to increase skin deformation performance by 4 times times with SIMD:
      Http://www.slideshare.net/IntelSoftware/dreamwork-animation-dwa
    4. "First try to buy" Intel? Parallel Studio XE Evaluation version:
      Http://software.intel.com/intel-parallel-studio-xe/try-buy
    5. Intel for qualified students, educators, academic researchers, and open source workers? Parallel Studio XE free version:
      Https://software.intel.com/zh-cn/qualify-for-free-software
    6. Intel? VTune? Amplifier 2016:
      Https://software.intel.com/zh-cn/intel-vtune-amplifier-xe
    7. Intel? Advisor:
      Https://software.intel.com/zh-cn/intel-advisor-xe
Footnote

1 single instruction multi-data (SIMD) refers to the use of data-tier parallelism, one instruction processing multiple data simultaneously. It is the opposite of traditional "scalar operations" (processing individual data using a single instruction).

2 vectorization refers to the conversion of a computer program from a scalar implementation to a vector (or SIMD) implementation.

3 pragma simd: https://software.intel.com/zh-cn/node/583427. pragma omp SIMD: https://software.intel.com/zh-cn/node/583456

4 Non-unit step memory access refers to accessing data from a non-contiguous location in memory during a continuous loop increment. This can seriously affect performance. Conversely, accessing memory in the form of unit steps (or sequential) can significantly increase efficiency.

5 for reference: Https://software.intel.com/sites/default/files/8c/a9/CompilerAutovectorizationGuide.pdf

6 SDLT Primitives can restrict objects, helping the compiler successfully complete the privatization of local variables in the SIMD loop, where each SIMD channel has a private variable instance. To meet this standard, the object must be Plain old Data (POD), have inline object members, have no nested arrays, and have no virtual functions.

7 during vectorization, developers should experiment with various compilation instructions (such as SIMD, Omp SIMD, IVDEP, and vector {always [assert]}) and use the opt-report.

8 Are we for Linux * based Intel? The C + + compiler 16.0 (2016) added the command-line option "-qopt-report=5–qopt-report-phase=vec" to generate the opt-report (*.OPTRPT).

9 using Intel? C + + compiler 16.0 build Intel? Advanced Vector Extensions (Intel? AVX) Directive, you can add the option "-XAVX" to the compile command line.

TheAVX512 instruction set contains the propagation load directive, which reduces the SIMD overhead generated by preparing the consolidated data before the iteration begins.

The software and performance involved in the performance testing process can be optimized only under the architecture of Intel microprocessors.

Tests such as sysmark* and mobilemark* are based on specific computer systems, hardware, software, operating systems, and functions. Changes to any of these elements can lead to changes in test results. Please refer to other information and performance tests (including operating performance in conjunction with other products) for a comprehensive assessment of the target product. For more information, please visit http://www.intel.com/content/www/cn/zh/benchmarks/intel-product-performance.html.

Configuration: Intel? Xeon? Processor e5-2699 v3 (45M cache, 2.30 GHz). CPU: Two x 18-core c1,2.3 GHz. Non-core: 2.3 GHz. Intel? Fast Channel Interconnect Technology: 9.6 gt/sec. ram:128 gb,ddr4-2133 MHz (x 8 GB). Disk: 7200 RPM SATA disk. GB solid-State drive. Turn off Hyper-Threading technology and turn off turbo boost. Red Hat Enterprise Linux * Server version 7.0. 3.10.0-123.el7.x86_64.

Increase the efficiency of SIMD usage in animations with SIMD data layout templates and data preprocessing

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.