Cache and Efficiency

Last Update:2018-12-05 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Today, the use of multiple CPU cores has become a challenge for every programmer. For a slightly experienced programmer, OMP is undoubtedly the fastest shortcut and has a very high effect, which can basically increase the efficiency by 1.8 times (dual-core CPU ).

The endurance child has read the OMP Introduction Article and put it into practice in 3D applications. The results are remarkable. It is handy for using bone computing, skin computing, vertex coordinates, texture coordinates, and normal interpolation, on the Rise, after the key part of computing was rewritten using the SSE command, the overall efficiency was improved by about 4 times.

Some problems are found in this process:

1. Use the SSE command to write a function for processing one data and a function for processing multiple data. From the perspective of a single thread, it is obvious that reducing the overhead of function calls can improve the efficiency. Therefore, the following statement is used to process a piece of data:

// Num_threads: Number of OMP threads started
Int num_threads = omp_get_max_threads ();
// SPT: Data allocated to each thread
Int SPT = ucount/num_threads;
If (SPT> 0)
{
// Each thread processes a continuous data segment
# Pragma OMP parallel
For (INT I = 0; I <num_threads; ++ I)
{
Int nstart = I * SPT;
Sse_processdatas (buff + nstart, SPT );
}
}
For (INT I = num_threads * SPT; I <ucount; ++ I)
Sse_processdata (buff [I]);

The test results almost didn't show my chin dislocation, and the results showed that the running result of this writing method is not as fast as that of a single thread. The single-threaded code is as follows:

For (INT I = 0; I <ucount; I + = 4)
{
Sse_processdata (buff [I]);
}

However, simply adding the OMP command in front of a Single-thread Code improves the efficiency by about 0.9 times (AMD 3800 +, dual-core CPU:

# Pragma OMP parallel
For (INT I = 0; I <ucount; I + = 4)
{
Sse_processdata (buff [I]);
}

Incredible.

I am not a special root-user. Everything starts with practice. At the same time, I keep in mind the next rule: The OMP optimization rules are good, and some seemingly clever tips are not good. I am always honest and practical with OMP. Under the guidance of this idea, other optimizations will be smooth and exciting. I started to explain to my wife how to link OMP and MPI with shoes. When I drank several packets of coffee during the day, I began to suffer from insomnia in the middle of the night.

As the saying goes, we often see rivers and lakes drifting. How can we leave without a knife? How can we avoid wet shoes? OMP gave me another lesson when doing Ray and model triangle-by-triangle interaction. According to the previous experience, this code is very conservative and well-disciplined:

Single-thread code:

Float fmaxdistance = flt_max;
Int ntriangle =-1;
For (INT I = 0; I <ncount; I + = 3)
{
Float temp; // Ray and triangle surface (pvertex [pindex [I + 0], pvertex [pindex [I + 1], pvertex [pindex [I + 2]) intersection distance
If (false = intersecttriangleline (pvertex [pindex [I + 0],
Pvertex [pindex [I + 1],
Pvertex [pindex [I + 2],
Ray, null, & temp ))
Continue;
If (temp <0)
Continue;
If (temp <fmaxdistance)
{
Fmaxdistance = temp;
Ntriangle = I;
}
}

Multi-threaded code. A typical version of OMP for minimum value:

Int NTT [2] = {-1,-1 };
Float f [2] = {flt_max, flt_max };
# Pragma OMP parallel for num_threads (2)
For (INT I = 0; I <ncount; I + = 3)
{
Float temp; // Ray and triangle surface (pvertex [pindex [I + 0], pvertex [pindex [I + 1], pvertex [pindex [I + 2])
Intersection distance
If (false = intersecttriangleline (pvertex [pindex [I + 0],
Pvertex [pindex [I + 1],
Pvertex [pindex [I + 2],
Ray, null, & temp ))
Continue;
If (temp <0)
Continue;
Int threadid = omp_get_thread_num ();
If (temp <F [threadid])
{
F [threadid] = temp;
NTT [threadid] = I;
}
}
If (F [0] <F [1])
{
Fmaxdistance = f [0];
Ntriangle = NTT [0];
}
Else
{
Fmaxdistance = f [1];
Ntriangle = NTT [1];
}

I am very happy to conduct the test (my unit test habits are good), and the results almost didn't let my chin dislocation again-the efficiency has dropped by about 1/8!

After that, I repeatedly tried to modify it. Is it possible that the OMP should not be synchronized? Is the variable reasonably shared, private, or firstprivate? Check the information, access the Internet, and read msdn ..... the 17th martial arts were used up and basically abandoned. Finally, I thought of using the last weapon: ASM. because of the test efficiency, all previous compilation and tests were carried out in the release version. After the OMP code is added to the Assembly, it is more difficult to understand and many variables do not know the value, change this time to debug for efficiency testing. soon, the results are shown below:
First case

# Pragma OMP parallel
For (INT I = 0; I <ucount; I + = 4)
{
Sse_processdata (buff [I]);
}

The OMP optimization result is

// Thread 0:
For (INT I = 0; I <ucount; I + = 8)
{
Sse_processdata (buff [I]);
}
// Thread 1:
For (INT I = 4; I <ucount; I + = 8)
{
Sse_processdata (buff [I]);
}

In the second case, the OMP optimization result is:

// Thread 0:
For (INT I = 0; I <ucount/2; I + = 3)
{
...
}
// Thread 1:
For (INT I = ucount/2; I <ucount; I + = 3)
{
...
}

Obviously, the difference lies in the way of loop. what are the rules for the differences between OMP compilation results? Obviously, they are related to context or pointer usage. if anyone knows these details, please let us know.

Report a suspicious attitude and modify the code in the second case as follows:

Int NTT [2] = {-1,-1 };
Float f [2] = {flt_max, flt_max };
# Pragma OMP parallel sections num_threads (2)
{
# Pragma OMP Section
{
Int threadid = omp_get_thread_num ();
For (INT I = 0; I <ncount; I + = 6)
{
Float temp;
If (false = intersecttriangleline (pvertex [pindex [I + 0],
Pvertex [pindex [I + 1],
Pvertex [pindex [I + 2],
Ray, null, & temp ))
Continue;
If (temp <0)
Continue;
If (temp <F [threadid])
{
F [threadid] = temp;
NTT [threadid] = I;
}
}
}
# Pragma OMP Section
{
Int threadid = omp_get_thread_num ();
For (INT I = 3; I <ncount; I + = 6)
{
Float temp;
If (false = intersecttriangleline (pvertex [pindex [I + 0],
Pvertex [pindex [I + 1],
Pvertex [pindex [I + 2],
Ray, null, & temp ))
Continue;
If (temp <0)
Continue;
If (temp <F [threadid])
{
F [threadid] = temp;
NTT [threadid] = I;
}
}
}
}
If (F [0] <F [1])
{
Fmaxdistance = f [0];
Ntriangle = NTT [0];
}
Else
{
Fmaxdistance = f [1];
Ntriangle = NTT [1];
}

(Supplement: The above code is not the final code. The final code is grouped and cyclically calculated based on the current number of CPU cores. The code is not at hand and will not be presented if it is not complex)

That is to say, for two threads, Thread 0 processes an odd number of triangles, and thread 1 processes an even number of triangles.
The test results show that the efficiency has been improved by more than 0.8.

"Hey, hey, buddy, you have already passed the article. Why are you still not having to answer the question ?" -- The Reader roared and looked around for pebbles or rotten eggs.
"WUSA... quiet, quiet, I have been hanging on the title !" ---- I raised an umbrella that I had prepared long ago.

Why? Why? Why?
One thing I did not mention in the previous article: To take care of GPU Cache Optimization, all of my model data is cache optimized for 16 vertices cached by geforce2, that is, this is also an advantage when the CPU processes these vertices. In two threads, two triangles are processed simultaneously. If they are processed at the same speed, they are basically processing the adjacent triangles, and these adjacent triangles, whether they are indexes, or vertex, the probability of being in the cache of the CPU is quite large, and the efficiency of the cache effect is relatively small. When data is divided into two segments for processing, the CPU needs to cache four segments of memory at the same time (two indexes and two vertices ), when the cache is not large enough (AMD's U cache is small), the efficiency reduction caused by the cache effectiveness directly offsets the Efficiency Improvement of multithreading and even slows down!

To verify this, when I loaded the model, I randomly replaced the Triangle Index to disrupt the CPU cache vertex data. The test again showed that the efficiency had dropped significantly.

I have no questions!

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Cache and Efficiency

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Cache and Efficiency

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support