Today, the use of multiple CPU cores has become a challenge for every programmer. For a slightly experienced programmer, OMP is undoubtedly the fastest shortcut and has a very high effect, which can basically increase the efficiency by 1.8 times (dual-core CPU ).
The endurance child has read the OMP Introduction Article and put it into practice in 3D applications. The results are remarkable. It is handy for using bone computing, skin computing, vertex coordinates, texture coordinates, and normal interpolation, on the Rise, after the key part of computing was rewritten using the SSE command, the overall efficiency was improved by about 4 times.
Some problems are found in this process:
1. Use the SSE command to write a function for processing one data and a function for processing multiple data. From the perspective of a single thread, it is obvious that reducing the overhead of function calls can improve the efficiency. Therefore, the following statement is used to process a piece of data:
- // Num_threads: Number of OMP threads started
- Int num_threads = omp_get_max_threads ();
- // SPT: Data allocated to each thread
- Int SPT = ucount/num_threads;
- If (SPT> 0)
- {
- // Each thread processes a continuous data segment
- # Pragma OMP parallel
- For (INT I = 0; I <num_threads; ++ I)
- {
- Int nstart = I * SPT;
- Sse_processdatas (buff + nstart, SPT );
- }
- }
- For (INT I = num_threads * SPT; I <ucount; ++ I)
- Sse_processdata (buff [I]);
The test results almost didn't show my chin dislocation, and the results showed that the running result of this writing method is not as fast as that of a single thread. The single-threaded code is as follows:
- For (INT I = 0; I <ucount; I + = 4)
- {
- Sse_processdata (buff [I]);
- }
However, simply adding the OMP command in front of a Single-thread Code improves the efficiency by about 0.9 times (AMD 3800 +, dual-core CPU:
- # Pragma OMP parallel
- For (INT I = 0; I <ucount; I + = 4)
- {
- Sse_processdata (buff [I]);
- }
Incredible.
I am not a special root-user. Everything starts with practice. At the same time, I keep in mind the next rule: The OMP optimization rules are good, and some seemingly clever tips are not good. I am always honest and practical with OMP. Under the guidance of this idea, other optimizations will be smooth and exciting. I started to explain to my wife how to link OMP and MPI with shoes. When I drank several packets of coffee during the day, I began to suffer from insomnia in the middle of the night.
As the saying goes, we often see rivers and lakes drifting. How can we leave without a knife? How can we avoid wet shoes? OMP gave me another lesson when doing Ray and model triangle-by-triangle interaction. According to the previous experience, this code is very conservative and well-disciplined:
Single-thread code:
- Float fmaxdistance = flt_max;
- Int ntriangle =-1;
- For (INT I = 0; I <ncount; I + = 3)
- {
- Float temp; // Ray and triangle surface (pvertex [pindex [I + 0], pvertex [pindex [I + 1], pvertex [pindex [I + 2]) intersection distance
- If (false = intersecttriangleline (pvertex [pindex [I + 0],
- Pvertex [pindex [I + 1],
- Pvertex [pindex [I + 2],
- Ray, null, & temp ))
- Continue;
- If (temp <0)
- Continue;
- If (temp <fmaxdistance)
- {
- Fmaxdistance = temp;
- Ntriangle = I;
- }
- }
Multi-threaded code. A typical version of OMP for minimum value:
- Int NTT [2] = {-1,-1 };
- Float f [2] = {flt_max, flt_max };
- # Pragma OMP parallel for num_threads (2)
- For (INT I = 0; I <ncount; I + = 3)
- {
- Float temp; // Ray and triangle surface (pvertex [pindex [I + 0], pvertex [pindex [I + 1], pvertex [pindex [I + 2])
- Intersection distance
- If (false = intersecttriangleline (pvertex [pindex [I + 0],
- Pvertex [pindex [I + 1],
- Pvertex [pindex [I + 2],
- Ray, null, & temp ))
- Continue;
- If (temp <0)
- Continue;
- Int threadid = omp_get_thread_num ();
- If (temp <F [threadid])
- {
- F [threadid] = temp;
- NTT [threadid] = I;
- }
- }
- If (F [0] <F [1])
- {
- Fmaxdistance = f [0];
- Ntriangle = NTT [0];
- }
- Else
- {
- Fmaxdistance = f [1];
- Ntriangle = NTT [1];
- }
I am very happy to conduct the test (my unit test habits are good), and the results almost didn't let my chin dislocation again-the efficiency has dropped by about 1/8!
After that, I repeatedly tried to modify it. Is it possible that the OMP should not be synchronized? Is the variable reasonably shared, private, or firstprivate? Check the information, access the Internet, and read msdn ..... the 17th martial arts were used up and basically abandoned. Finally, I thought of using the last weapon: ASM. because of the test efficiency, all previous compilation and tests were carried out in the release version. After the OMP code is added to the Assembly, it is more difficult to understand and many variables do not know the value, change this time to debug for efficiency testing. soon, the results are shown below:
First case
- # Pragma OMP parallel
- For (INT I = 0; I <ucount; I + = 4)
- {
- Sse_processdata (buff [I]);
- }
The OMP optimization result is
- // Thread 0:
- For (INT I = 0; I <ucount; I + = 8)
- {
- Sse_processdata (buff [I]);
- }
- // Thread 1:
- For (INT I = 4; I <ucount; I + = 8)
- {
- Sse_processdata (buff [I]);
- }
In the second case, the OMP optimization result is:
- // Thread 0:
- For (INT I = 0; I <ucount/2; I + = 3)
- {
- ...
- }
- // Thread 1:
- For (INT I = ucount/2; I <ucount; I + = 3)
- {
- ...
- }
Obviously, the difference lies in the way of loop. what are the rules for the differences between OMP compilation results? Obviously, they are related to context or pointer usage. if anyone knows these details, please let us know.
Report a suspicious attitude and modify the code in the second case as follows:
- Int NTT [2] = {-1,-1 };
- Float f [2] = {flt_max, flt_max };
- # Pragma OMP parallel sections num_threads (2)
- {
- # Pragma OMP Section
- {
- Int threadid = omp_get_thread_num ();
- For (INT I = 0; I <ncount; I + = 6)
- {
- Float temp;
- If (false = intersecttriangleline (pvertex [pindex [I + 0],
- Pvertex [pindex [I + 1],
- Pvertex [pindex [I + 2],
- Ray, null, & temp ))
- Continue;
- If (temp <0)
- Continue;
- If (temp <F [threadid])
- {
- F [threadid] = temp;
- NTT [threadid] = I;
- }
- }
- }
- # Pragma OMP Section
- {
- Int threadid = omp_get_thread_num ();
- For (INT I = 3; I <ncount; I + = 6)
- {
- Float temp;
- If (false = intersecttriangleline (pvertex [pindex [I + 0],
- Pvertex [pindex [I + 1],
- Pvertex [pindex [I + 2],
- Ray, null, & temp ))
- Continue;
- If (temp <0)
- Continue;
- If (temp <F [threadid])
- {
- F [threadid] = temp;
- NTT [threadid] = I;
- }
- }
- }
- }
- If (F [0] <F [1])
- {
- Fmaxdistance = f [0];
- Ntriangle = NTT [0];
- }
- Else
- {
- Fmaxdistance = f [1];
- Ntriangle = NTT [1];
- }
(Supplement: The above code is not the final code. The final code is grouped and cyclically calculated based on the current number of CPU cores. The code is not at hand and will not be presented if it is not complex)
That is to say, for two threads, Thread 0 processes an odd number of triangles, and thread 1 processes an even number of triangles.
The test results show that the efficiency has been improved by more than 0.8.
"Hey, hey, buddy, you have already passed the article. Why are you still not having to answer the question ?" -- The Reader roared and looked around for pebbles or rotten eggs.
"WUSA... quiet, quiet, I have been hanging on the title !" ---- I raised an umbrella that I had prepared long ago.
Why? Why? Why?
One thing I did not mention in the previous article: To take care of GPU Cache Optimization, all of my model data is cache optimized for 16 vertices cached by geforce2, that is, this is also an advantage when the CPU processes these vertices. In two threads, two triangles are processed simultaneously. If they are processed at the same speed, they are basically processing the adjacent triangles, and these adjacent triangles, whether they are indexes, or vertex, the probability of being in the cache of the CPU is quite large, and the efficiency of the cache effect is relatively small. When data is divided into two segments for processing, the CPU needs to cache four segments of memory at the same time (two indexes and two vertices ), when the cache is not large enough (AMD's U cache is small), the efficiency reduction caused by the cache effectiveness directly offsets the Efficiency Improvement of multithreading and even slows down!
To verify this, when I loaded the model, I randomly replaced the Triangle Index to disrupt the CPU cache vertex data. The test again showed that the efficiency had dropped significantly.
I have no questions!