MIC Performance Optimization

MIC Optimization Method:

-- optimization of parallelism

-- Memory management optimization

-- Data transmission optimization

-- Memory access Optimization

-- vectorization optimization

-- Load Balancing optimization

--mic Threading Extensibility Optimizations

One: Degree of parallelism optimization

To ensure that there is enough parallelism, the effect is good (data parallelism, task parallelism)

Optimization steps:

1. Write an OpenMP program

2. Test his extensibility, for example with two Tests, then 4 ,6 ,8 Thread tests

3. Then transplant to MIC

MIC Optimization criteria: outer layer parallelism, inner vectorization

Example one:

for (i=0;i<m;i++)

{

for (j=0;j<n;j++) {

......

}

}

Two parallel modes:

One

#pragma omp parallel for num_threads (thread_num)// open M - times

for (i=0;i<m;i++) {

for (j=0;j<n;j++)

{......}

}

Second:

#pragma omp parallel num_threads (thread_num)

for (i=0;i<m;i++) {

#pragma omp for// open m*n times, with higher overhead

for (j=0;j<n;j++)

{......}

}

Example two (nested parallelism):

for (i=0;i<m;i++)

{

for (j=0;j<n;j++)// problem: The number of cycles per layer is too small to play the role of the MIC card very well

{......}

}

Can be changed into the following two types:

One

#pragma omp parallel for num_threads (Thread_num)

for (k=0;k<m*n;k++) {

i=k/m;

j=k%m;

}

Second:

Omp_set_nested (TRUE); declaration allows nested parallelism

#pragma omp parallel for num_threads (THREAD_NUM1)

for (i=0;i<m;i++) {

#pragma omp parallel for num_threads (THREAD_NUM2)

for (j=0;j<n;j++) {...}

}

Two: Memory management optimization

MIC Memory Size:6GB~16GB

Block processing: For example, the program needs 28GB, assuming that the MIC card can use the memory of 7GB, then need to divide 4 Secondary calculation

Change the parallel hierarchy:

-- parallel inner-layer loops

Outer Loop --Inner loop (data sharing between threads)

-- task-level parallelism -- data-level parallelism

Reduce the number of applications

Key: Open space operations are placed outside the loop

Three: Data transfer optimization:Nocopy(see the previous section, "MIC C Programming")

Asynchronous transfer: When the MIC is done, the CPU can do it .

Example:

#pragma offload_transfer Target (mic:0) in (In1:length (count) alloc_if (0) free_if (0)) signal (in1)

for (i=0;i<iter;i++)

{

if (i%2==0) {

#pragma offload target (mic:0) nocopy (in1) Wait (in1) out (Out1:length (count) alloc_if (0) free_if (0))

Compute (IN1,OUT1);}

else{

#pragma offload_transfer Target (mic:0) if (i!=iter-1) in (In1:length (count) alloc_if (0) free_if (0)) signal (in1)

#pragma offload target (mic:0) nocopy (in2) Wait (in2) out (Out2:length (count) alloc_if (0) free_if (0))

Compute (IN2,OUT2);

}

}

Asynchronous compute:CPU and MIC async, COMPUTE and I/o async

int counter;

float *in1;

counter=10000;

_attributes_ (Target (MIC))) Mic_compute;

while (counter>0)

{

#pragma offload target (mic:0) signal (in1)

{

Mic_compute ();

}

Cpu_compute (); This function is executed in parallel with the MIC function above

#pragma offload_wait target (mic:0) wait (in)

counter--;

}

SCIF (VS Offload)

Good at small data transfer

If the transmitted data is 1K,2K,3K, then the scif/offload=80%

If the transmitted data is >4k, the efficiency of the two is almost

If the transmitted data >6k,theSCIF drops substantially

Five: Memory access optimization

MIC Memory Access optimization Strategy

-- hidden memory Access delay

Multithreading

Pre-fetching

-- using Cache optimization

Time locality

Spatial locality

-- Alignment

Cache Optimization Method:

Code transformations

-- Cyclic fusion

-- Cyclic segmentation

-- cyclic chunking

-- Circular exchange

Data exchange

-- Data placement

-- Data reorganization

Cyclic fusion

Original loop;

for (i=0;i<n;i++)

a[i]=b[i]+1;

for (i=0;i<n;i++)

C[I]=A[I]/2;

Fusion:

Fused loop

for (i=0;i<n;i++)

{

a[i]=b[i]+1;

C[I]=A[I]/2;

}

Loop split:

Original loop

for (i=1;i<n;i++)

{

A[I]=A[I]+B[I-1];

B[i]=c[i-1]*x*y;

C[i]=1/b[i];

D[I]=SQRT (C[i]);

}

Segmentation:

Splitted Loop

for (i=1;i<n;i++)

{

B[i]=c[i-1]*x*y;

C[i]=1/b[i];

}

for (i=1;i<n;i++)

A[I]=A[I]+B[I-1];

for (i=1;i<n;i++)

D[I]=SQRT (C[i]);

Cyclic chunking:

Original loop

for (i=0;i<n;i++)

for (j=0;j<m;j++)

X[I][J]=Y[I]+Z[J];

Sub-block:

Tiled loop

for (IT=0;IT<N;IT+=NB)

for (JT=0;JT<M;JT+=MB)

For (I=it;i<min (it+nb,n); i++)

For (J=jt;jt<min (jt+mb,m); j + +)

X[I][J]=Y[I]+Z[J];

Circular Exchange:

Original loop

for (j=0;j<m;j++)

for (i=0;i<n;i++)

C[i][j]=a[i][j]+b[j][i];

Generally used for matrix operations, problem: Access discontinuity

Interchanged Loop

for (i=0;i<n;i++)

for (j=0;j<m;j++)

C[I][J]=A[I][J]+B[I][J];

Six: vectorization optimization (VPU for batch operations)

Intel automatic vectorization,512/32 processing of the single precision

Self-trend quantization:

What kind of loops can be automatically quantified?

1. The compiler considers that there is no dependency between each statement within the loop and that there is no cyclic dependency

2. most inner loop

3. data types as consistent as possible

Generally does not automatically quantify

1.for (int i=0;i<n;i++)

A[i]=a[i-1]+b[i];

2.for (int i=0;i<n;i++)

A[c[i]]=b[d[i]];

3.for (int i=0;i<n;i++)

A[i]=foo (B[i]);

4. The number of iterations is indeterminate

Check to see if you are really vectorized:

-qopt-report=[=n]

-qopt-report[n] |
Meaning |

N=0 |
Do not display diagnostic information |

N=1 |
Show only vectorized loops (default values) |

n=2 |
Show vectorized and non-quantifiable loops |

N=3 |
Display both vectorized and non-quantifiable loops and data dependency information |

N=4 |
Show only for vectorization loops |

N=5 |
Display non-vectorized loops and data dependency information |

Guided vectorization strategy

Insert quote self-trend quantization: Do not change the original program structure, only need to insert pre-compiled instructions (quotations) can be automatically quantified.

Adjust the program loop structure and insert the speech self-trend quantization: Make some structural adjustments to the source program, such as nested loop Exchange order, and then insert the speech can be automatically quantified.

Writing SIMD instructions:SIMD Instructions can achieve better performance than auto-vectorization, but SIMD instructions written for different hardware platforms are also different, and SIMD Poor command readability, so SIMD directives can be used selectively.

Self-trend Quantification: Advantages

Improved performance: vectorization, enabling single instruction cycles to process multiple batches of data simultaneously

Writing a single version of the code reduces the use of compilations to simplify coding: fewer compilations mean significantly less work for specific system programming, which can easily be upgraded and used for the latest mainstream systems without having to rewrite those assembly code.

#pragma IVDEP// recommended vectorization

#pragma SIMD// forced vectorization

C99 plus -restrict

#pragma vector always// if the quote is unsuccessful

Specifies how the loop is vectorized to avoid some operations without memory being vectorized

Seven: Load Balancing optimization

For multi-node multi-card

Parallel framework between multiple nodes in a cluster

Main function:

#define N (10000)

_global_void kernel ();

void Main ()

{

int calc_num,calc_len,rank,size;

Mpi_init (ARGC,ARGV);

Mpi_comm_size (mpi_comm_world,&size);// total number of nodes

Mpi_comm_rank (Mpi_comm_world,&rank);// node number

Calc_num= ...; calc_len=......;// The number of elements of the calculation operation

int *a= (int *) malloc (calc_len*sizeof (int));

if (rank==0) mpi_send ();// Master node distribution data

else Mpi_recv ();// child nodes receive data

Main_calc ();

Mpi::finalize (); Free (a);

return 0;

}

Programming framework for single-node multi-card

Perimeter frame Pseudo-code

int device_num=m+1;//m isthe number of GPU(MIC) cards on a single node

Omp_set_nested (TRUE);// allow OpenMP nesting

#pragma omp parallel for private (...), num_threads (device_num)// card with CPU simultaneous calculation

{for (i=0;i<device_num;i++)

{

if (i==0)

{

CPU -Side computing

Int J;

#pragma omp parallel for private (...)

for (int j=0;j<device_num;j++)

a[k]=k+6;

}

else{

Mic_kernel ();

}

}

}

In-Device load balancing

Through the scheduling algorithm (static,dynamic,guided , etc., have learned, skip)

The remaining methods are slightly.

Mic Performance Optimization