"Turn" the upcoming GCC 5.0 brings new optimizations to the x86

Source: Internet
Author: User

Turn from: Open source China http://www.oschina.net/news/57481/what-is-new-for-x86-in-upcoming-gcc-50

Part1. Vectorization of load/storage groups

GCC 5.0 significantly improves the loading of vector vectors and the code quality of the storage group, I'm talking about sequential iterations, for example:

x = A[i], y = a[i + 1], z = a[i + 2] Iterate through I, load a group of size 3

The group size is determined by loading and storing the maximum and minimum values for the address, for example (i + 2) – (i) + 1 = 3

The number of times the group is loaded and stored is less than and equal to the size of the group, for example:

x = A[i], z = a[i + 2] iterate through I, although only 2 loads, but the size of the load group is 3

The size of the GCC 4.9 vector Group is 2, and the GCC 5.0 vectorization Group is 3, or 2, and the other group sizes are less used.

The most common scenario for loading and storing groups is an array of structures.

Image conversion (for example, converting an RGB structure to another) (Scene test https://gcc.gnu.org/bugzilla/show_bug.cgi?id=52252)

Multidimensional coordinates (test scenario https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61403)

multiplication constant matrix of vectors

A[I][0] = 7 * B[i][0]-3 * b[i][1];

A[I][1] = 2 * b[i][0] + b[i][1];

Basically GCC 5.0 brings us:

Introducing vector loading and storage groups of size 3

Improve other group sizes for legacy support

Maximizes load and storage group performance by code optimized for specific x86 CPUs

Here is a piece of code to compare GCC 4.9 and GCC 5.0 performance (the number of elements in the maximized vector)

?

1

2

3

4

5

6

7

8

9

10

11

12

13

14

int I, j, K;

byte *in = A, *out = b;

for (i = 0; i < 1024x768; i++)

{

for (k = 0; k < stgsize; k++)

{

byte s = 0;

for (j = 0; J < Ldgsize; J + +)

s + = in[j] * C[j][k];

Out[k] = s;

}

In + = Ldgsize;

out + = Stgsize;

}

and "C" is a fixed matrix:

?

1

2

3

4

5

6

7

8

Const BYTE C[8][8] = {1,-1, 1,-1, 1,-1, 1,-1,

1, 1,-1,-1, 1, 1,-1,-1,

1, 1, 1, 1,-1,-1,-1,-1,

-1, 1,-1, 1,-1, 1,-1, 1,

-1,-1, 1, 1,-1,-1, 1, 1,

-1,-1,-1,-1, 1, 1, 1, 1,

-1,-1,-1, 1, 1, 1,-1, 1,

1,-1, 1, 1, 1,-1,-1,-1};

Using simple calculations in loops, such as add, sub, and so on, is very fast.

The "in" and "out" pointers point to the global array "a[1024 * ldgsize]" and "b[1024 * Stgsize]"

Byte is an unsigned char

Ldgsize and stgsize– macros that load and store groups according to the group size definition

Compile option "-ofast" plus "-MARCH=SLM" for Silvermont, "-march=core-avx2" for Haswell and All merge-dldgsize={1,2,3,4,8}-dstgsize={1,2,3,4 , 8}

GCC 5.0 to 4.9 (in time, the bigger the better)

Silvermont:intel (R) Atom (TM) CPU C2750 @ 2.41GHz

6.5 Times Times Performance improvement

We can see that the group size is 3 when the result is not that good. This is because when the group size is 3 o'clock, 8 PSHUFB instructions and approximately 5 ticks are required on a slivermont. Of course, the loop is still vectorization, if there is more CPU-intensive calculation in the loop, then the effect will be very good. (We look at the other group size again)

Haswell:intel (R) Core (TM) i7-4770k CPU @ 3.50GHz

3 times times faster performance!

Only 1 ticks are required when the group size is 3 on Haswell. We can see that the biggest increase is when the group size is 3.

The above experiment you can get the corresponding compiler by the following address

GCC 4.9:https://gcc.gnu.org/gcc-4.9

GCC 5.0 trunk built at revision 217914:https://gcc.gnu.org/viewcvs/gcc/trunk/?pathrev=218160

Download Matrix.zip

Via Intel

"Turn" the upcoming GCC 5.0 brings new optimizations to the x86

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.