MIC Performance Optimization
MIC Optimization Method:
-- optimization of parallelism
-- Memory management optimization
-- Data transmission optimization
-- Memory access Optimization
-- vectorization optimization
-- Load Balancing optimization
--mic Threading Extensibility Optimizations
One: Degree of parallelism optimization
To ensure that there is enough parallelism, the effect is good (data parallelism, task parallelism)
Optimization steps:
1. Write an OpenMP program
2. Test his extensibility, for example with two Tests, then 4 ,6 ,8 Thread tests
3. Then transplant to MIC
MIC Optimization criteria: outer layer parallelism, inner vectorization
Example one:
for (i=0;i<m;i++)
{
for (j=0;j<n;j++) {
......
}
}
Two parallel modes:
One
#pragma omp parallel for num_threads (thread_num)// open M - times
for (i=0;i<m;i++) {
for (j=0;j<n;j++)
{......}
}
Second:
#pragma omp parallel num_threads (thread_num)
for (i=0;i<m;i++) {
#pragma omp for// open m*n times, with higher overhead
for (j=0;j<n;j++)
{......}
}
Example two (nested parallelism):
for (i=0;i<m;i++)
{
for (j=0;j<n;j++)// problem: The number of cycles per layer is too small to play the role of the MIC card very well
{......}
}
Can be changed into the following two types:
One
#pragma omp parallel for num_threads (Thread_num)
for (k=0;k<m*n;k++) {
i=k/m;
j=k%m;
}
Second:
Omp_set_nested (TRUE); declaration allows nested parallelism
#pragma omp parallel for num_threads (THREAD_NUM1)
for (i=0;i<m;i++) {
#pragma omp parallel for num_threads (THREAD_NUM2)
for (j=0;j<n;j++) {...}
}
Two: Memory management optimization
MIC Memory Size:6GB~16GB
Block processing: For example, the program needs 28GB, assuming that the MIC card can use the memory of 7GB, then need to divide 4 Secondary calculation
Change the parallel hierarchy:
-- parallel inner-layer loops
Outer Loop --Inner loop (data sharing between threads)
-- task-level parallelism -- data-level parallelism
Reduce the number of applications
Key: Open space operations are placed outside the loop
Three: Data transfer optimization:Nocopy(see the previous section, "MIC C Programming")
Asynchronous transfer: When the MIC is done, the CPU can do it .
Example:
#pragma offload_transfer Target (mic:0) in (In1:length (count) alloc_if (0) free_if (0)) signal (in1)
for (i=0;i<iter;i++)
{
if (i%2==0) {
#pragma offload target (mic:0) nocopy (in1) Wait (in1) out (Out1:length (count) alloc_if (0) free_if (0))
Compute (IN1,OUT1);}
else{
#pragma offload_transfer Target (mic:0) if (i!=iter-1) in (In1:length (count) alloc_if (0) free_if (0)) signal (in1)
#pragma offload target (mic:0) nocopy (in2) Wait (in2) out (Out2:length (count) alloc_if (0) free_if (0))
Compute (IN2,OUT2);
}
}
Asynchronous compute:CPU and MIC async, COMPUTE and I/o async
int counter;
float *in1;
counter=10000;
_attributes_ (Target (MIC))) Mic_compute;
while (counter>0)
{
#pragma offload target (mic:0) signal (in1)
{
Mic_compute ();
}
Cpu_compute (); This function is executed in parallel with the MIC function above
#pragma offload_wait target (mic:0) wait (in)
counter--;
}
SCIF (VS Offload)
Good at small data transfer
If the transmitted data is 1K,2K,3K, then the scif/offload=80%
If the transmitted data is >4k, the efficiency of the two is almost
If the transmitted data >6k,theSCIF drops substantially
Five: Memory access optimization
MIC Memory Access optimization Strategy
-- hidden memory Access delay
Multithreading
Pre-fetching
-- using Cache optimization
Time locality
Spatial locality
-- Alignment
Cache Optimization Method:
Code transformations
-- Cyclic fusion
-- Cyclic segmentation
-- cyclic chunking
-- Circular exchange
Data exchange
-- Data placement
-- Data reorganization
Cyclic fusion
Original loop;
for (i=0;i<n;i++)
a[i]=b[i]+1;
for (i=0;i<n;i++)
C[I]=A[I]/2;
Fusion:
Fused loop
for (i=0;i<n;i++)
{
a[i]=b[i]+1;
C[I]=A[I]/2;
}
Loop split:
Original loop
for (i=1;i<n;i++)
{
A[I]=A[I]+B[I-1];
B[i]=c[i-1]*x*y;
C[i]=1/b[i];
D[I]=SQRT (C[i]);
}
Segmentation:
Splitted Loop
for (i=1;i<n;i++)
{
B[i]=c[i-1]*x*y;
C[i]=1/b[i];
}
for (i=1;i<n;i++)
A[I]=A[I]+B[I-1];
for (i=1;i<n;i++)
D[I]=SQRT (C[i]);
Cyclic chunking:
Original loop
for (i=0;i<n;i++)
for (j=0;j<m;j++)
X[I][J]=Y[I]+Z[J];
Sub-block:
Tiled loop
for (IT=0;IT<N;IT+=NB)
for (JT=0;JT<M;JT+=MB)
For (I=it;i<min (it+nb,n); i++)
For (J=jt;jt<min (jt+mb,m); j + +)
X[I][J]=Y[I]+Z[J];
Circular Exchange:
Original loop
for (j=0;j<m;j++)
for (i=0;i<n;i++)
C[i][j]=a[i][j]+b[j][i];
Generally used for matrix operations, problem: Access discontinuity
Interchanged Loop
for (i=0;i<n;i++)
for (j=0;j<m;j++)
C[I][J]=A[I][J]+B[I][J];
Six: vectorization optimization (VPU for batch operations)
Intel automatic vectorization,512/32 processing of the single precision
Self-trend quantization:
What kind of loops can be automatically quantified?
1. The compiler considers that there is no dependency between each statement within the loop and that there is no cyclic dependency
2. most inner loop
3. data types as consistent as possible
Generally does not automatically quantify
1.for (int i=0;i<n;i++)
A[i]=a[i-1]+b[i];
2.for (int i=0;i<n;i++)
A[c[i]]=b[d[i]];
3.for (int i=0;i<n;i++)
A[i]=foo (B[i]);
4. The number of iterations is indeterminate
Check to see if you are really vectorized:
-qopt-report=[=n]
-qopt-report[n] |
Meaning |
N=0 |
Do not display diagnostic information |
N=1 |
Show only vectorized loops (default values) |
n=2 |
Show vectorized and non-quantifiable loops |
N=3 |
Display both vectorized and non-quantifiable loops and data dependency information |
N=4 |
Show only for vectorization loops |
N=5 |
Display non-vectorized loops and data dependency information |
Guided vectorization strategy
Insert quote self-trend quantization: Do not change the original program structure, only need to insert pre-compiled instructions (quotations) can be automatically quantified.
Adjust the program loop structure and insert the speech self-trend quantization: Make some structural adjustments to the source program, such as nested loop Exchange order, and then insert the speech can be automatically quantified.
Writing SIMD instructions:SIMD Instructions can achieve better performance than auto-vectorization, but SIMD instructions written for different hardware platforms are also different, and SIMD Poor command readability, so SIMD directives can be used selectively.
Self-trend Quantification: Advantages
Improved performance: vectorization, enabling single instruction cycles to process multiple batches of data simultaneously
Writing a single version of the code reduces the use of compilations to simplify coding: fewer compilations mean significantly less work for specific system programming, which can easily be upgraded and used for the latest mainstream systems without having to rewrite those assembly code.
#pragma IVDEP// recommended vectorization
#pragma SIMD// forced vectorization
C99 plus -restrict
#pragma vector always// if the quote is unsuccessful
Specifies how the loop is vectorized to avoid some operations without memory being vectorized
Seven: Load Balancing optimization
For multi-node multi-card
Parallel framework between multiple nodes in a cluster
Main function:
#define N (10000)
_global_void kernel ();
void Main ()
{
int calc_num,calc_len,rank,size;
Mpi_init (ARGC,ARGV);
Mpi_comm_size (mpi_comm_world,&size);// total number of nodes
Mpi_comm_rank (Mpi_comm_world,&rank);// node number
Calc_num= ...; calc_len=......;// The number of elements of the calculation operation
int *a= (int *) malloc (calc_len*sizeof (int));
if (rank==0) mpi_send ();// Master node distribution data
else Mpi_recv ();// child nodes receive data
Main_calc ();
Mpi::finalize (); Free (a);
return 0;
}
Programming framework for single-node multi-card
Perimeter frame Pseudo-code
int device_num=m+1;//m isthe number of GPU(MIC) cards on a single node
Omp_set_nested (TRUE);// allow OpenMP nesting
#pragma omp parallel for private (...), num_threads (device_num)// card with CPU simultaneous calculation
{for (i=0;i<device_num;i++)
{
if (i==0)
{
CPU -Side computing
Int J;
#pragma omp parallel for private (...)
for (int j=0;j<device_num;j++)
a[k]=k+6;
}
else{
Mic_kernel ();
}
}
}
In-Device load balancing
Through the scheduling algorithm (static,dynamic,guided , etc., have learned, skip)
The remaining methods are slightly.
Mic Performance Optimization