intel® Xeon Phi? Processor Optimization Tutorials

Last Update:2016-12-30 Source: Internet

Author: User

Tags benchmark

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Original link Download file

1. Introduction

This tutorial will cover a variety of optimization applications to support their intel? Xeon Phi? Run on the processor. The optimization process in this tutorial is divided into three parts:

The first section describes general optimization techniques for vectorization (data parallelization) processing of code.
The second section describes how to add thread-level parallelism to take advantage of all the available cores in the processor.
The third section optimizes the code by enabling memory optimization on the Intel Xeon Phi processor.

The final conclusion section will demonstrate the performance improvements achieved by each optimization step in a graphical way.

The optimization process is as follows: Based on the serial, performance-enhancing sample code. It then uses the optimization technique to process the code, obtains the vectorized version of the code, and further thread parallelization of the vectorization code to make it a parallel version of the code. Finally using Intel? VTune? Amplifier analyzes the memory bandwidth of this parallel code to further optimize performance with high-bandwidth memory. This tutorial provides these three versions of the code (myserialapp.c,myvectorizedapp.c , and myparallelapp.c) as an attachment.

The sample code is a stream-processing application with two buffers containing inputs and outputs. The first input dataset contains two equation coefficients. The second output dataset is used to preserve the root of each two-time equation. For simplicity, the chosen factor ensures that the two-time equation always evaluates to two real roots.

Consider the equation two times:

Two root is the solution of a known formula:

The condition of obtaining two real roots is and

2. Hardware and Software

is the program in pre-production of Intel? Xeon Phi? Processor (model 7250, 68 cores, clock speed of 1.4 ghz,96 GB DDR4 RAM, up to GB multi-channel dynamic random access Memory (MCDRAM)) run. 4 hardware threads per core, so the system runs with a total of 272 hardware threads. We have Red Hat Enterprise Linux * 7.2, Intel installed in this system? Xeon Phi? Processor software version 1.3.1 and Intel? Parallel Studio XE 2016 update 3.

If you need to see the processor type and number of the system, you can use the /proc/cpuinfo display output. For example:

The full output of the test system shows 272 CPU or hardware threads. Note that the tag fields show instruction extensions avx512f ,, avx512pf , avx512er and avx512cd they are all supported instruction extensions for Intel Xeon Phi processors.

You can also run lscpu to display information about the CPU:

The above command shows that the system consists of 1 sockets, 68 cores, and 272 CPUs. It also shows that the system has 2 NUMA nodes, and 272 CPUs all belong to NUMA node 0. For more information on NUMA please see the MCDRAM (High bandwidth memory) profile on Knights Landing.

Before you parse and optimize the sample program, compile the program and run the binary code to get the baseline performance.

3. Benchmark Code Evaluation

A simple solution implementation is shown in the accompanying program myserialapp.c . Coefficients A, b , and C are divided into structural coefficients groups, and Root X1 and X2 are divided into structural Roots groups. Coefficients and roots are single-precision floating-point numbers. Each coefficient tuple corresponds to one root tuple. The program allocates n coefficient tuples and n root tuples. N is a large number (n = 512M element, exactly 512*1024*1024 = 536,870,912 elements). The coefficient structure and root structure are as follows:

The simple program calculates real roots x1 and X2 According to the above formula. We also use standard system timers to measure calculation time. The buffer allocation time and initialization time are not measured. The simple program repeats the calculation process 10 times.

At the beginning, by using Intel? The C + + compiler compiles baseline code to benchmark the application's baseline performance:

$ icc mySerialApp.c

By default, the compiler compiles with the switch -O2 (optimized for maximum speed). Then run the app:

$ ./a.out No. of Elements : 512M Repetitions = 10 Start allocating buffers and initializing .... SERIAL Elapsed time in msec: 461,222 (after 10 iterations)

The output shows that the system took 461,222 milliseconds to iterate over a large number of entries (N = 512M elements) for 10 iterations to stream data, compute the root, and save the results. The program calculates the root tuple for each coefficient tuple. Note that the benchmark code does not take full advantage of the large number of available cores or SIMD instructions in the system because it runs in both serial and scalar mode (one per thread at a time is processed by one tuple element). Therefore, only one hardware thread (CPU) is in the running state and the other CPUs are idle. You can use the compiler option -qopt-report=5 -qopt-report-phase:vec to generate a vectorization report (*.optrpt) to verify this.

$ icc mySerialApp.c -qopt-report=5 -qopt-report-phase:vec

After measuring benchmark code performance, we began to vectorize the code.

4. Code vectorization 4.1. Change the structure array to the pattern structure. Do not use multiple levels in the buffer allocation.

The first way to improve code performance is to change the structure array (AoS) to an array structure (SoA). (SoA) increases the amount of data that is accessed by the unit step. Instead of defining a large number of coefficient tuples (A, B, C) and root tuples (x1, x2), we rearrange the data structures to allocate them to 5 large arrays:A, B, C, X1, and X2 (reference programs Myvectorizedapp.c). Also, malloc instead of allocating memory, we use _mm_malloc to align the data to 64-bit boundaries (see next section).

4.2. Other performance improvement methods: Eliminate type conversion, data alignment

The next step is to eliminate unnecessary type conversions. For example, sqrt() a function would treat double as input. But since we use single precision as input to this program, the compiler needs to convert the single precision to double precision. In order to eliminate unnecessary data type conversions, we can use sqrtf() , rather than sqrt() . Again, we don't use integers, but we use single precision. For example, we do not use 4 , while using 4.0f . Note that 4.0 (no suffix f) is a double-precision floating-point number, while 4.0f is a single-precision floating-point number.

Data alignment helps data to move efficiently between memory. For Intel Xeon Phi processors, when the data start address is at the 64-byte boundary, the memory data movement is optimal, like Intel? Xeon Phi? Co-processor. To help the compiler vectorize, you need to allocate memory with 64-bit alignment and use the compile instructions/directives where the data can be used to tell the compiler that memory access is aligned. It is best to vectorize with properly aligned data. Vectorization described in this article refers to the ability to process multiple data (SIMD) with a single instruction.

In the example above, for the data assigned to Zishing, we use _mm_malloc() and _mm_free() to allocate the array. Note that the _mm_malloc() equivalent malloc() , but it uses the alignment parameter (in bits) as the second parameter, which is 64 bits for the Intel Xeon Phi processor. We need to insert a clause in front of the data that tells the compiler to use the assume_aligned(a, 64) expression array a is aligned. To tell the compiler that all the arrays accessed in a particular loop are aligned, you can add clauses before the loop #pragma vector aligned .

4.3. Use automatic vectorization, run compiler reports, and disable vectorization with compiler switches

Vectorization refers to the programming technique of using a vector processing unit (VPU) to operate multiple values at the same time. Automatic vectorization means that the compiler is able to identify opportunities in the loop and perform the appropriate vectorization. You can take advantage of the automatic vectorization capabilities of the Intel compilers, because automatic vectorization supports the optimization layer –O2 or higher level by default.

For example, icc when compiling the myserialapp.c sample code with the Intel compiler, the compiler defaults to finding the vectorization opportunity in the loop. However, the compiler needs to follow certain rules (you must know the loop runs, single-input-out, straight-line code, nested inner-most loops, and so on) to vectorize these loops. You can provide more information to help the compiler vectorize the loop.

To determine if the code has been vectorized, you can generate a -qopt-report=5 -qopt-report-phase:vec vectorization report by specifying options. The compiler then generates a vectorization report (*.optrpt). The report tells you whether the loops have been vectorized and briefly describes the loop vectorization. Note that the vectorization report option is –qopt-report=<n> , where n is used to specify the level of detail.

4.4. Using the optimization layer –O3To compile

Now we can –O3 compile with the help of the optimization layer. The optimization layer achieves the highest speed and is much more optimized than the default optimization layer –O2 .

With auto-vectorization, the compiler packages 16 single-precision floating-point numbers in vector registers and performs operations on the vector, rather than processing one element at a time in each iteration loop.

$ icc myVectorizedApp.c –O3 -qopt-report -qopt-report-phase:vec -o myVectorizedApp

The compiler generates the following output file: Binary file Myvectorizedapp and vectorization report myvectorizedapp.optrpt. To run binary files:

$ ./myVectorizedApp No. of Elements : 512M Repetitions = 10 Start allocating buffers and initializing .... Elapsed time in msec: 30496 (after 10 iterations)

The binary file runs with only one thread, but vectorization is used. The myvectorizedapp.optrpt report should confirm that all inner loops have been vectorized.

When making comparisons, you also need to -no-vec compile the program with options:

$ icc myVectorizedApp.c –O3 -qopt-report -qopt-report-phase:vec -o myVectorizedApp-noVEC -no-vec icc: remark #10397: optimization reports are generated in *.optrpt files in the output location

Now run the Myvectorizedapp–novec binary file:

$ ./myVectorizedApp-noVEC No. of Elements : 512M Repetitions = 10 Start allocating buffers and initializing .... Elapsed time in msec: 180375 (after 10 iterations)

This myvectorizedapp.optrpt report shows that the loop is not vectorized, as expected, because automatic vectorization is disabled.

Now we can see two performance gains. The performance gains from the original version (461,222 milliseconds) to the No-vec version (version 180,375) were mainly due to the adoption of general optimization techniques. The performance gains from the never-vectorized version (180,375 milliseconds) to the vectorized version (30,496 milliseconds) are mainly due to automatic vectorization.

Even if performance gains are achieved, only one thread is still performing the operation. Enables multiple threads to run in parallel to further enhance the code to take advantage of multicore architectures.

5. Enable Multithreading 5.1. Threading Layer Parallelism: openmp*

To take full advantage of the large number of cores in the Intel Xeon Phi processor (68 cores in this system), you can extend the application by running OpenMP threads in parallel. OpenMP is a standard API and programming model for shared memory.

When using OpenMP threads, you need to include the header file " omp.h " and connect the code and tags –qopenmp . In the myparallelapp.c program, you can add the following commands before For-loop:

#pragma omp parallel for simd

The compiler instructions added prior to for-loop tell the compiler to generate a set of threads and divide the work in for-loop into chunks of data. Each thread executes several blocks of data according to the OpenMP runtime schedule. The SIMD structure shows only multiple loop iterations that can be executed concurrently using SIMD instructions. It tells the compiler to ignore the assumed vector dependencies in the loop, so use it with caution.

In this program, thread parallelism and vectorization are performed in the same loop. Each thread starts at the lower limit of the loop. To ensure that OpenMP (static dispatch) has good alignment effects, we can limit the number of parallel loops and process other loops serially.

In addition, the function that computes the root becomes

Now you can compile the program and connect it to –qopenmp :

$ icc myParallelApp.c –O3 -qopt-report=5 -qopt-report-phase:vec,openmp -o myParallelAppl -qopenmp

Review the myparallelapp.optrpt report to verify that the loop has been vectorized and parallelized with OpenMP.

5.2. Setting the number and similarity of threads using environment variables

OpenMP implementations can start several threads at the same time. By default, the number of threads is set to the maximum hardware thread in the system. In this case, 272 OpenMP threads will be run by default. However, we can also use OMP_NUM_THREADS environment variables to set the number of OpenMP threads. For example, the following command can start 68 OpenMP threads:

$ export OMP_NUM_THREADS=68

Use KMP_AFFINITY environment variables to set thread affinity (the ability to bind OpenMP threads to the CPU). To distribute the threads evenly across the system, you can set the variable to scatter (scatter):

$ export KMP_AFFINITY=scatter

You can now use all the cores in the system to run the program and change the number of threads running per core. Here is the test output, which compares the performance of each core running 1, 2, 3, and 4 threads, respectively.

The test system runs 1 threads per core:

$ export KMP_AFFINITY=scatter $ export OMP_NUM_THREADS=68 $ ./myParallelApp No. of Elements : 512M Repetitions = 10 Start allocating buffers and initializing .... thread num=0 Initializing numthreads = 68, N = 536870912, N1 = 536870336, num-iters in remainder serial loop = 576, parallel-pct = 99.999893 Starting Compute on 68 threads Elapsed time in msec: 1722 (after 10 iterations)

Run 2 threads per core:

$ export OMP_NUM_THREADS=136 $ ./myParallelApp No. of Elements : 512M Repetitions = 10 Start allocating buffers and initializing .... thread num=0 Initializing numthreads = 136, N = 536870912, N1 = 536869248, num-iters in remainder serial loop = 1664, parallel-pct = 99.999690 Starting Compute on 136 threads Elapsed time in msec: 1781 (after 10 iterations)

Run 3 threads per core:

$ export OMP_NUM_THREADS=204 $ ./myParallelApp No. of Elements : 512M Repetitions = 10 Start allocating buffers and initializing .... thread num=0 Initializing numthreads = 204, N = 536870912, N1 = 536869248, num-iters in remainder serial loop = 1664, parallel-pct = 99.999690 Starting Compute on 204 threads Elapsed time in msec: 1878 (after 10 iterations)

Run 4 threads per core:

$ export OMP_NUM_THREADS=272 $ ./myParallelApp No. of Elements : 512M Repetitions = 10 Start allocating buffers and initializing .... thread num=0 Initializing numthreads = 272, N = 536870912, N1 = 536867072, num-iters in remainder serial loop = 3840, parallel-pct = 99.999285 Starting Compute on 272 threads Elapsed time in msec: 1940 (after 10 iterations)

As you can see from the above results, the performance is optimal when running 1 threads per core and using all 68 cores.

6. Optimized core 6.1 for Intel Xeon Phi processor. Memory bandwidth Optimization

There are two types of memory in the system: up to 6 GB of memory MCDRAM and GB legacy platform, with up to 384 GB of DDR4 RAM. The MCDRAM bandwidth is approximately GB/s, while the DDR4 peak performance bandwidth is approximately GB/s.

There are three configuration modes for MCDRAM: Flat mode, cache mode, or mixed mode. If the MCDRAM is configured as addressable memory (flat mode), the user can explicitly allocate memory in the MCDRAM. If MCDRAM is configured in cache mode, the entire MCDRAM can be used as the last-level cache between level two cache and DDR4 memory. If the MCDRAM is configured in mixed mode, some MCDRAM can be used as a cache and the remainder as addressable memory. The following table lists the advantages and disadvantages of these configuration modes:

Memory mode	Advantages	Disadvantages
Flat mode	User-controllable MCDRAM leverages high-bandwidth memory	Users need to use `numactl` or modify the code
Cache mode	is transparent to the user Extended Cache Level	Latency may increase when loading/saving memory in DDR 4
Mixed mode	Applications can take full advantage of flat mode and cache mode	Disadvantages of flat mode and cache mode

With respect to non-conforming memory access (NUMA) architectures, the Intel Xeon Phi processor appears as one or two nodes, depending on how the MCDRAM is configured. If the MCDRAM is configured as a cache mode, the Intel Xeon Phi processor will appear as 1 NUMA nodes. If the MCDRAM is configured for flat or mixed mode, the Intel Xeon Phi processor will appear as 2 NUMA nodes. Note that cluster mode can further divide the Intel Xeon Phi processor into 8 NUMA nodes, but this tutorial does not cover cluster mode at this stage.

Use numactl the utility to display NUMA nodes in the system. For example, in the system execution " numactl –H "-where MCDRAM is configured in flat mode, 2 NUMA nodes will be displayed. Node 0 contains 272 CPUs and up to GB of DDR4 memory, and Node 1 contains the three GB MCDRAM.

$ numactl-h Available:2 nodes (0-1) Node 0 cpus:0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 8 0 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 1 46 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 2 07 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 2 68 269 270 271 Node 0 size:98200 MB Node 0 free:92888 MB Node 1 CPUs: Node 1 size:16384 MB Node 1 free:15926 MB Node Distances: Node 0 1 0:10 31 1:31 10

In partial NUMA mode, you can use the numactl tools to allocate memory. In this example, node 0 contains all CPU and platform memory DDR4, and Node 1 contains the encapsulated memory MCDRAM. You can use –m a switch or –-membind force a program to allocate memory to a NUMA node.

If you force the application to allocate DDR memory (node 0), run the following command:
$ numactl -m 0 ./myParallelApp

This is equivalent to:
$ ./myParallelApp

Now run the app with 68 threads:

$ export KMP_AFFINITY=scatter $ export OMP_NUM_THREADS=68

$ numactl -m 0 ./myParallelApp No. of Elements : 512M Repetitions = 10 Start allocating buffers and initializing .... thread num=0 Initializing numthreads = 68, N = 536870912, N1 = 536870336, num-iters in remainder serial loop = 576, parallel-pct = 99.999893 Starting Compute on 68 threads Elapsed time in msec: 1730 (after 10 iterations)

If other views of the NUMA node are displayed, run the command "Lstopo". This command displays not only the NUMA nodes, but also the first-level cache and level two cache associated with those nodes.

6.2. Analyze Memory utilization

is the app limited by bandwidth? Analyze memory access using Intel VTune amplifier. The DDR4 DRAM Peak performance bandwidth is approximately GB/s, and the peak memory performance of MCDRAM is approximately GB/s.

Install Intel VTune amplifier on your system, and then run the following Intel VTune Amplifier command to collect memory access information when the application allocates DDR memory:

$ export KMP_AFFINITY=scatter; export OMP_NUM_THREADS=68; amplxe-cl -collect memory-access -- numactl -m 0 ./myParallelApp

You can see the bandwidth utilization of your app by looking at the bandwidth utilization histogram field. This histogram shows high DDR bandwidth utilization.

By looking at the memory access analysis we found that the DDR4 has a maximum bandwidth of ~ GB/s, which is almost DDR4 peak performance bandwidth. The result shows that the application is constrained by bandwidth.

By looking at the memory allocations for the app, we found that 5 large arrays were allocated that contain the elements of the element (that is, 512 * 1024 * 1024 elements). Each element is a single-precision floating-point (4-byte) number, so the size of each array is approximately 4*512 M or 2 GB. Total memory allocation is 2 GB * 5 = ten GB. This memory size is ideal for MCDRAM (GB capacity), so allocating memory in MCDRAM (flat mode) will benefit the application.

When allocating memory in MCDRAM (node 1), you –m pass parameter 1 to the command numactl , as follows:

$ numactl -m 1 ./myParallelApp No. of Elements : 512M Repetitions = 10 Start allocating buffers and initializing .... thread num=0 Initializing numthreads = 68, N = 536870912, N1 = 536870336, num-iters in remainder serial loop = 576, parallel-pct = 99.999893 Starting Compute on 68 threads Elapsed time in msec: 498 (after 10 iterations)

Obviously, performance is significantly improved when you apply the memory allocated in MCDRAM.

For comparison, we run the Intel VTune Amplifier command to collect memory access information when the app allocates MCDRAM memory:

$ export KMP_AFFINITY=scatter; export OMP_NUM_THREADS=68; amplxe-cl -collect memory-access -- numactl -m 1 ./myParallelApp

This histogram shows that DDR bandwidth utilization is low, while MCDRAM utilization is high:

By looking at the memory access analysis we found that the DDR4 peak bandwidth was 2.3 GB/s, while the MCDRAM peak bandwidth reached 437 GB/s.

6.3. Using the compiler handle –xMIC-AVX512To compile

Intel Xeon Phi processors support x87, Intel? SIMD Streaming instruction extensions (Intel? SSE), Intel? SSE2, Intel? SSE3, SIMD Extensions 3 supplemental version, Intel? SSE4.1, Intel? SSE4.2, Intel? Advanced Vector Extension Instruction set (Intel? AVX), Intel? Advanced Vector Extension Instruction set 2 (Intel? AVX2) and Intel? Advanced Vector Extension Instruction set 512 (Intel? AVX-512) instruction set architecture (ISA), but does not support Intel? Trade sync extension.

Intel AVX-512 is implemented in the Intel Xeon Phi processor. Intel Xeon Phi processors support the following groups: Intel® avx-512f, intel® avx-512cd, intel® avx-512er, and Intel AVX-FP. intel® avx-512f (Intel AVX-512 Basic Directive) includes Intel AVX and Intel AVX2 streaming instructions for 512-bit registers; Intel AVX-512CD (Intel AVX-512 Conflict Detection) helps to detect conflicts efficiently to support more loops Complete vectorization; the Intel Avx-512er (Intel AVX-512 Index and reciprocal instruction) provides instructions for the reciprocal of the exponential function, reciprocal, and square root of the base 2. intel® AVX-512PF (intel® AVX-512 prefetch Directive) helps reduce memory operation latency.

To get the most out of intel® AVX-512, you can compile the program with the compiler handle –xMIC-AVX512
$ ICC Myparallelapp.c-o myparallelapp-avx512-qopenmp-o3-xmic-avx512

$ export KMP_AFFINITY=scatter $ export OMP_NUM_THREADS=68

$ numactl -m 1 ./myParallelApp-AVX512 No. of Elements : 512M Repetitions = 10 Start allocating buffers and initializing .... thread num=0 Initializing numthreads = 68, N = 536870912, N1 = 536870336, num-iters in remainder serial loop = 576, parallel-pct = 99.999893 Starting Compute on 68 threads Elapsed time in msec: 316 (after 10 iterations)

Note that you can now run the following command to generate myParallelApp.s the assembly file named:

$ icc -O3 myParallelApp.c -qopenmp -xMIC-AVX512 -S -fsource-asm

By examining the assembly file, you can confirm that the Intel AVX512 ISA is generated.

6.4. Use –no-prec-div -fp-model fast=2Optimization tag.

If high precision is not required, we can use it -fp-model fast=2 to compile and use the floating-point model boldly to further refine the floating-point number (but not too secure). The compiler implements a faster, less accurate square root and division operation. For example:

$ icc myParallelApp.c -o myParallelApp-AVX512-FAST -qopenmp -O3 -xMIC-AVX512 -no-prec-div -no-prec-sqrt -fp-model fast=2 $ export OMP_NUM_THREADS=68

$ numactl -m 1 ./myParallelApp-AVX512-FAST No. of Elements : 512M Repetitions = 10 Start allocating buffers and initializing .... thread num=0 Initializing numthreads = 68, N = 536870912, N1 = 536870336, num-iters in remainder serial loop = 576, parallel-pct = 99.999893 Starting Compute on 68 threads Elapsed time in msec: 310 (after 10 iterations)

6.5. Configure MCDRAM as a cache

In the BIOS Setup, configure the MCDRAM to cache and reboot the system. The numactl utility can confirm that there is only one NUMA node, since MCDRAM is configured to be cached and is transparent to the utility:

$ numactl-h Available:1 nodes (0) Node 0 cpus:0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 8 0 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 1 46 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 2 07 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 2 68 269 270 271 Node 0 size:98200 MB Node 0 free:94409 MB Node Distances: Node 0 0:10

Recompile the program:

$ icc myParalledApp.c -o myParalledApp -qopenmp -O3 -xMIC-AVX512 -no-prec-div -no-prec-sqrt -fp-model fast=2

and run the program:

$ export OMP_NUM_THREADS=68 $ ./myParalledApp-AVX512-FAST No. of Elements : 512M Repetitions = 10 Start allocating buffers and initializing .... thread num=0 Initializing numthreads = 68, N = 536870912, N1 = 536870336, num-iters in remainder serial loop = 576, parallel-pct = 99.999893 Starting Compute on 68 threads Elapsed time in msec: 325 (after 10 iterations)

See if there is no other advantage in using MCDRAM as a cache in this app.

7. Summary and conclusions

The following topics are covered in this tutorial:

Memory alignment
Vectorization
Generate compiler reports to assist with Code analysis
Using the command-line utilitycpuinfo、lscpu、numactl、lstopo
Adding thread-level parallelism using OpenMP
Setting environment variables
Analyzing bandwidth utilization with Intel VTune amplifier
Using numactl allocated MCDRAM memory
Compiling with the intel® AVX512 tag for improved performance

The following table shows the performance gains achieved by completing each optimization step from the baseline code: Universal optimization based on data alignment, vectorization, adding thread-level parallelism, allocating MCDRAM memory in flat mode, compiling with Intel AVX512, compiling with no-precision markup, and MCDRAM Used as a cache.

By using all available cores, intel® AVX-512 vectorization, and MCDRAM bandwidth, we can significantly reduce execution time.

Resources:

For Intel? Xeon Phi? Processors and Intel? AVX-512 the compilation of ISA
Introduction to MCDRAM (high bandwidth memory) on Knights Landing
Intel? C + + compiler Vectorization Guide
Optimize the memory bandwidth of the Knights Landing on Stream Triad
Intel? Xeon Phi? Stream processing on the coprocessor
Data alignment helps to achieve vectorization
For the Intel? Xeon? Memory management for optimal performance on the processor: alignment and prefetching
Intel? Advanced MIC Architecture Optimization
Enabling SIMD in your program using OpenMP 4.0
Intel? Xeon? Processor-memory mode and cluster mode: Configuration and use cases

About the author

Loc Q Nguyen holds an MBA from the University of Ronda, a master's degree in electrical engineering from McGill University and a bachelor's degree in electrical engineering from the Montreal Institute of Technology. He is currently a software engineer at Intel Corporation's software and services division. Research areas include computer networks, parallel computing, and computer graphics.

Software and workloads used in performance testing may only be in Intel? Optimized for performance on microprocessors. Tests such as Sysmark and Mobilemark are based on specific computer systems, hardware, software, operating systems and functions, and any changes to these elements may lead to changes in test results. Please refer to other information and performance tests (including operating performance in conjunction with other products) for a comprehensive assessment of the target product. For more information, please visit http://www.intel.com/performance.

Intel sample source Code license Agreement

intel® Xeon Phi? Processor Optimization Tutorials

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More