6. Concepts of protocol and Synchronization
In terms of expansion, parallel computing has a basic idea. This algorithm can solve many common problems and is very practical, such as accumulation and accumulation. For basic and important, I think it is necessary to systematically learn.
I think it is necessary to copy the previous introduction:
Http://www.cnblogs.com/viviman/archive/2012/11/21/2780286.html
The development of parallel programs is different from that of single-core programs, and algorithms are the top priority. Different parallel algorithms are designed based on different businesses, which directly affects program efficiency. Therefore, designing parallel program algorithms seems to be the biggest difficulty in programming. View the algorithm, including the Cuda SDK examples and online cool people, some examples are given, mainly matrix and vector processing, in-depth points include mathematical formulas such as FFT and Julia, A more advanced example is graph processing. The idea of learning these algorithms is inevitable. I have studied OMP programming before. Combined with cuda, I think that to understand parallel programming, we should first understand the concepts of division and protocol. Maybe your algorithms are more solid. Division is an important idea in algorithms. It breaks down a large problem or task into a small problem or task, breaks through each task, and finally merges the results; the Statute is an important entry-level idea introduced in the Cuda ** book. It is widely used to find the concatenation, multiplication, and maximum value. The number of threads involved in the operation in each loop is halved. Regardless of the patterns in the algorithm's ideas, the principle is to break down a large task into a small set of tasks. The decomposition principle is that the granularity should be as small as possible, and the data relevance should be as small as possible. That's all. Because we use GPU for acceleration, to accelerate, we must increase the degree of parallelism for running tasks! To understand this truth, we will try our best to analyze the tasks in our hands, break down, break down, and break down! Here we will take the statute as an example, because it seems that we can use a single 9*9 multiplication table to get familiar with this and get familiar with the basics. Then the 99*99 problem will be solved. Ex: vector addition. The sum of vectors with N = 64*256 length must be realized. Assume that a + B consumes T at a time.
CPU computing: 64*256 * t per core. We cannot tolerate it.
GPU computing: the original idea is that if we have a GPU that can run n/2 threads at the same time, we can run n/2 threads at the same time, so isn't it just t time to add n numbers and add n/2 numbers? Yes. In this round, we used T-time. Then, can we find the process of recursion? In the second round, we run with n/2/2 threads at the same time, and the remaining n/2/2 threads are added together. We also used t time in this round. Keep thinking about this. In the last round, we used one thread to run, and the remaining number was 1. This is our result! Every round of time is T. How many rounds of such calculation are used in the ideal situation? Calculation times = Log (n) = 6*8 = 48, right. It only takes 48 rounds. That is to say, we spent 48 * t!
This is the Protocol. It is very simple and easy to use. Regardless of the later optimization of the program, the time complexity N is reduced from the algorithm analysis to the logn, which is in the conventional algorithm, this is an exponential increase in efficiency! Therefore, you will find that the GPU has a unique advantage that cannot be replaced by the CPU-a huge number of processing units!
The code of the kernel function for the sum of the rules is as follows: __global _ void rowsum (float * a, float * B) {int bid = blockidx. X; int tid = threadidx. X;
_ Shared _ s_data [128]; // read data to shared memory s_data [TID] = A [bid * 128 + TID]; _ synctheads (); // Sync
For (INT I = 64; I> 0; I/= 2) {If (TID <I) s_data [TID] = s_data [TID] + s_data [TID + I]; _ synctheads ();} If (TID = 0) B [bid] = s_data [0];}
This example also allows me to learn another thing-synchronization! Let me tell you a story about synchronization: we have dispatched 10 groups to fight from Nanjing to Japan. Our agreement is that 10 groups can act on their own, all groups will meet at Shanghai Airport on the third day and then go to Japan together. This must be handled. You cannot go to Japan after the first group arrives in Shanghai. The only thing you can do with the first group is-wait! When you need to manage the tasks in a unified manner, you must synchronize them. In the Shanghai Region, You can unify the pace, fast groups, and other slow groups, and then take the next journey together.
Is it easy to understand? This is an example of synchronization in our daily lives. It should be said that many computer mechanisms and algorithms are derived from our daily lives! It is easier to understand.
In cuda, is our synchronization mechanism very useful? How is it used? I will tell you that in a project of normal scale, data generally has a first-to-second relationship. This computing result may be provided to another thread. This dependency exists, this will cause synchronization applications.
_ Synctheads () indicates that when all threads in the block are executed here, all threads in the block are listened to. When all threads are executed in this place, run the following command.
7. Open the programming lock
Locks are something that must be used in data correlation. They are good. They are also the same in life. They are not locked, and they are not safe at home. If there is no lock in the GPU, the data will be "stolen ".
For competing data, Cuda provides the atomic operation function -- atom operation.
The following is an example:
_ Global _ void kernelfun ()
{
_ Shared _ int I = 0;
Atomicadd (& I, 1 );
}
If no mutex mechanism is added, the threads in the same half warp will confuse operations on I.
With atomic operation functions, you can easily write your own locks. The lock struct given in SKD is as follows:
# Ifndef _ lock_h __
# DEFINE _ lock_h __
# Include "cuda_runtime.h"
# Include "device_launch_parameters.h"
# Include "atomic_functions.h"
Struct lock {
Int * mutex;
Lock (void ){
Handle_error (cudamalloc (void **) & mutex, sizeof (INT )));
Handle_error (cudamemset (mutex, 0, sizeof (INT )));
}
~ Lock (void ){
Cudafree (mutex );
}
_ DEVICE _ void lock (void ){
While (atomiccas (mutex, 0, 1 )! = 0 );
}
_ DEVICE _ void unlock (void ){
Atomicexch (mutex, 0 );
}
};
# Endif
8. Cuda Software Architecture
9 make good use of existing resources
If you need to write your own program to implement the open-side operations, I believe that the programmer's career will shrink, and no one is willing to do this kind of work. I think programmers need to learn to be "lazy" and to use existing resources efficiently. When the STL library appears in C ++, the development efficiency of C ++ programmers can be doubled, and the program stability is higher.
What does Cuda provide to us? Yes, but I actually gave it a lot.
First, we will introduce several databases: cufft, cublas, and cudpp.
Here I will not go into detail about the functions in these libraries, but I need to know about them in the general direction. Otherwise I will not know where to find them. Cufft is the Library of Fourier transformation. cublas provides basic matrix and vector operations, and cudpp provides common parallel sorting and search.
Cuda4.0 and above provide a template library similar to STL for preliminary research, just a template type similar to vector. Is there a map? MAP is actually a hash. You can use hashtable to implement this mechanism.
There are many examples in the SDK, including some common basic operations, such as initcuda, which can be solidified into function components for new programs to call.
I will summarize some specific things that can be solidified in the future to enrich my Cuda library!