Preface
This article describes how to implement parallel computing from the perspective of GPU programming technology.
Three important issues to be considered in parallel computing
1. synchronization problems
In the course on operating system principles, we have learned about deadlocks between processes and critical resource issues caused by resource sharing.
2. Concurrency
Some problems are "Easy parallelism", such as matrix multiplication. In this type of problem, the output results of each computing unit are independent of each other, and such problems can be easily solved (The problem can be solved even by calling several class libraries ).
However, if there is a dependency between each operational unit, the problem becomes complicated. In cuda, intra-block communication is implemented through shared memory, while inter-block communication can only be implemented through global memory.
The Cuda parallel programming architecture can be described by grid: a grid is better than an army. The grid is divided into multiple parts, which are like each Department of the Army (such as the Logistics Department, headquarters, and Communication Department ). Each block is divided into multiple thread bundles, which are similar to the internal team of the Department and can help you understand:
3. locality
In terms of operating system principles, we have made a key introduction to locality. Simply put, we will introduce the previously accessed data (Time locality) and the recently accessed data (space locality) stored in the cache.
In GPU programming, locality is also very important. This is reflected in that the data to be computed should be sent to the video memory as much as possible before computing, in the iteration process, we must minimize the data transmission between memory and memory. This is very important in actual projects.
For GPU programming, the program needs to manage the memory by itself, or in other words, implement the locality by itself.
Two types of Parallel Computing
1. Task-Based Parallel Processing
In this parallel mode, computing tasks are divided into several small but different tasks. For example, some computing units are responsible for data acquisition, some computing units are responsible for computing, and some are responsible ...... such a large task can form a pipeline.
Note that the efficiency bottleneck of the pipeline lies in the computing unit with the lowest efficiency.
2. Data-based parallel processing
In this parallel mode, data is divided into multiple parts, so that multiple computing units can calculate these small pieces of data separately and then summarize them.
In general, the multi-thread programming of CPU tends to be in the first parallel mode, while the GPU parallel programming mode tends to be in the second mode.
Common Parallel Optimization objects
1. Loop
This is also the most common mode for each thread to process one or a group of data in a loop.
This type of optimization must be careful with the dependencies between each computing unit and its own previous iteration results.
2. derivation/collection mode
In this mode, most of them are serial code, but some sections of the code can be processed in parallel.
A typical scenario is that different parts of an input queue need to be processed differently when the serial processing reaches a certain time point, in this way, you can divide them into multiple computing units to process (derived) the queue, and then summarize (that is, aggregate) it ).
This mode is often used when concurrent events are not specified in advance and has "dynamic concurrency ".
3. Shard/shard Mode
For extremely large data (such as climate models), data can be divided into blocks for parallel computing.
4. Divide and conquer
The vast majority of recursive algorithms, such as quick sorting, can be converted to iterative models, while iterative models can be mapped to GPU programming models.
In particular, although both the fermion architecture and the GPU of the Kepler architecture support the buffer stack, the recursive model can be directly converted to the GPU parallel model. However, for program efficiency, we 'd better convert it into an iterative model when the development time permits.