The development of parallel programs is different from that of single-core programs, and algorithms are the top priority. Different parallel algorithms are designed based on different businesses, which directly affects program efficiency. Therefore, how to design parallel program algorithms seems to become parallel
The greatest difficulty in programming. View the algorithm, including the Cuda SDK examples and online cool people, some examples are given, mainly matrix and vector processing, in-depth points include mathematical formulas such as FFT and Julia, A more advanced level is the graphics processor.
. The idea of learning these algorithms is inevitable.
I have studied OMP programming before. Combined with cuda, I think that to understand parallel programming, we should first understand the concepts of division and protocol. Maybe your algorithms are more solid. Division is an important idea in algorithms.
Issues or tasks are broken down into small and small tasks, and the results are merged. The Protocol is an important entry-level idea introduced in the Cuda ** book, and the algorithm (normalization) it is widely used to find concatenation, multiplication, and maximum value.
Generic. The number of threads involved in the operation in each loop is halved.
Regardless of the patterns in the algorithm's ideas, the principle is to break down a large task into a small set of tasks. The decomposition principle is that the granularity should be as small as possible, and the data relevance should be as small as possible. That's all. Because we use GPU
To accelerate the task, you must increase the degree of parallelism! To understand this truth, we will try our best to analyze the tasks in our hands, break down, break down, and break down!
Here we will take the statute as an example, because it seems that we can use a single 9*9 multiplication table to get familiar with this and get familiar with the basics. Then the 99*99 problem will be solved.
Ex: vector addition. The sum of vectors with N = 64*256 length must be realized. Assume that a + B consumes T at a time.
CPU computing:
Obviously, 64*256 * t is required for a single core. We cannot tolerate it.
GPU computing:
Originally, if we had a GPU that could run n/2 threads at the same time, we would have run n/2 threads at the same time, so isn't it just t time to add n numbers and add n/2 numbers? Yes. T time is used for this round;
Next thought, can we find this process of continuous recursion? In the second round, we run with n/2/2 threads at the same time, and the remaining n/2/2 threads are added together. We also used t time in this round;
Keep thinking like this. In the last round, we ran with one thread, and there was one remaining. This is our result !!!
Every round of time is T. How many rounds of such calculation are used in the ideal situation?
Calculation times = Log (n) = 6*8 = 48, right. It only takes 48 rounds. That is to say, we spent 48 * t!
This is the Protocol. It is very simple and easy to use. Regardless of the later optimization of the program, the time complexity N is reduced from the algorithm analysis to the logn, which is in the conventional algorithm, it is absolutely impossible to improve such efficiency,
This is an exponential increase in efficiency! Therefore, you will find that the GPU has a unique advantage that cannot be replaced by the CPU-a huge number of processing units!
The code of the core function for sum of the Protocol is as follows:
_ Global _ void rowsum (float * a, float * B ){
Int bid = blockidx. X;
Int tid = threadidx. X;
_ Shared _ s_data [128];
// Read data to shared memory
S_data [TID] = A [bid * 128 + TID];
_ Synctheads (); // Sync
For (INT I = 64; I> 0; I/= 2 ){
If (TID <I)
S_data [TID] = s_data [TID] + s_data [TID + I];
_ Synctheads ();
}
If (TID = 0)
B [bid] = s_data [0];
}