Turn from: Open source China http://www.oschina.net/news/57481/what-is-new-for-x86-in-upcoming-gcc-50
Part1. Vectorization of load/storage groups
GCC 5.0 significantly improves the loading of vector vectors and the code quality of the storage group, I'm talking about sequential iterations, for example:
x = A[i], y = a[i + 1], z = a[i + 2] Iterate through I, load a group of size 3
The group size is determined by loading and storing the maximum and minimum values for the address, for example (i + 2) – (i) + 1 = 3
The number of times the group is loaded and stored is less than and equal to the size of the group, for example:
x = A[i], z = a[i + 2] iterate through I, although only 2 loads, but the size of the load group is 3
The size of the GCC 4.9 vector Group is 2, and the GCC 5.0 vectorization Group is 3, or 2, and the other group sizes are less used.
The most common scenario for loading and storing groups is an array of structures.
Image conversion (for example, converting an RGB structure to another) (Scene test https://gcc.gnu.org/bugzilla/show_bug.cgi?id=52252)
Multidimensional coordinates (test scenario https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61403)
multiplication constant matrix of vectors
A[I][0] = 7 * B[i][0]-3 * b[i][1];
A[I][1] = 2 * b[i][0] + b[i][1];
Basically GCC 5.0 brings us:
Introducing vector loading and storage groups of size 3
Improve other group sizes for legacy support
Maximizes load and storage group performance by code optimized for specific x86 CPUs
Here is a piece of code to compare GCC 4.9 and GCC 5.0 performance (the number of elements in the maximized vector)
?
1
2
3
4
5
6
7
8
9
10
11
12
13
14
int I, j, K;
byte *in = A, *out = b;
for (i = 0; i < 1024x768; i++)
{
for (k = 0; k < stgsize; k++)
{
byte s = 0;
for (j = 0; J < Ldgsize; J + +)
s + = in[j] * C[j][k];
Out[k] = s;
}
In + = Ldgsize;
out + = Stgsize;
}
and "C" is a fixed matrix:
?
1
2
3
4
5
6
7
8
Const BYTE C[8][8] = {1,-1, 1,-1, 1,-1, 1,-1,
1, 1,-1,-1, 1, 1,-1,-1,
1, 1, 1, 1,-1,-1,-1,-1,
-1, 1,-1, 1,-1, 1,-1, 1,
-1,-1, 1, 1,-1,-1, 1, 1,
-1,-1,-1,-1, 1, 1, 1, 1,
-1,-1,-1, 1, 1, 1,-1, 1,
1,-1, 1, 1, 1,-1,-1,-1};
Using simple calculations in loops, such as add, sub, and so on, is very fast.
The "in" and "out" pointers point to the global array "a[1024 * ldgsize]" and "b[1024 * Stgsize]"
Byte is an unsigned char
Ldgsize and stgsize– macros that load and store groups according to the group size definition
Compile option "-ofast" plus "-MARCH=SLM" for Silvermont, "-march=core-avx2" for Haswell and All merge-dldgsize={1,2,3,4,8}-dstgsize={1,2,3,4 , 8}
GCC 5.0 to 4.9 (in time, the bigger the better)
Silvermont:intel (R) Atom (TM) CPU C2750 @ 2.41GHz
6.5 Times Times Performance improvement
We can see that the group size is 3 when the result is not that good. This is because when the group size is 3 o'clock, 8 PSHUFB instructions and approximately 5 ticks are required on a slivermont. Of course, the loop is still vectorization, if there is more CPU-intensive calculation in the loop, then the effect will be very good. (We look at the other group size again)
Haswell:intel (R) Core (TM) i7-4770k CPU @ 3.50GHz
3 times times faster performance!
Only 1 ticks are required when the group size is 3 on Haswell. We can see that the biggest increase is when the group size is 3.
The above experiment you can get the corresponding compiler by the following address
GCC 4.9:https://gcc.gnu.org/gcc-4.9
GCC 5.0 trunk built at revision 217914:https://gcc.gnu.org/viewcvs/gcc/trunk/?pathrev=218160
Download Matrix.zip
Via Intel
"Turn" the upcoming GCC 5.0 brings new optimizations to the x86