Second article: Understanding Parallel Computing from the perspective of the GPU

Source: Internet
Author: User

Preface

This article from the perspective of using GPU programming technology to understand the parallel implementation of the method of calculation ideas.

three important issues to be considered in parallel computing

1. Synchronization issues

In the relevant course of operating system theory, we learned about the deadlock problem between processes and the critical resource problems caused by resource sharing.

  2. Concurrency level

There are some issues that are "easy to parallel" issues, such as matrix multiplication. In this type of problem, the results of each unit output are independent of each other, such problems can be easily resolved (usually even call a few class libraries to solve the problem).

However, if there is a dependency between the units, the problem is complicated. In CUDA, intra-block communication is achieved through shared memory, while communication between blocks can only be achieved through global memory.

The CUDA parallel programming architecture can be described as a grid (GRID): A grid is like an army. Grids are divided into blocks, which are like every department in the Army (Logistics, command, communications, etc.). Each block is divided into multiple lines Cheng, which are like teams within a department that can help understand:

3. Locality

in the operating system principle, the focus of the locality has been introduced, in a nutshell, the data previously accessed (time locality) and the data previously accessed (spatial locality) is saved in the cache.

in GPU programming, the locality is also very important, this is reflected in the data to be calculated before the calculation should be as far as possible before the video into the memory, in the iterative process must be as far as possible to reduce the amount of data in the transmission between RAM and video, the actual project found this is very important.

for GPU programming, it is necessary for the program ape to manage the memory itself, or in other words, to realize its locality .

Two types of parallel computing

1. Task-based parallel processing

This parallel mode splits the computational task into several small but different tasks, such as the number of operational units that are responsible for the calculation, and some are responsible for ... Such a large task can form an assembly line.

It is important to note that the efficiency bottleneck of the pipeline is the one with the lowest efficiency .

  2. Data-based parallel processing

This parallel mode decomposes the data into multiple parts, allowing multiple units to calculate the data for each of these pieces, and then summarize them together.

In general, multithreaded programming of CPUs is biased towards the first parallel mode, while the GPU parallel programming pattern is biased towards the second.

Common parallel Optimization Objects

1. Cycle

This is also the most common pattern for each thread to process one or a set of data in a loop.

This type of optimization must be careful with the individual arithmetic units, and the dependence of each operation unit on its own last iteration result.

2. Derivation/Aggregation mode

Most of this mode is serial code, but one segment of the code can be processed in parallel.

The typical case is that an input queue needs to be treated differently when it is serially processed to a certain point, so that it can be divided into multiple cells to process (or derive from) the queue, and then summarize it (or assemble it).

This pattern is often used in the case of concurrency events in advance, with " dynamic parallelism ".

3. Sub-bar/block mode

for particularly large data, such as climate models, data can be divided into blocks for parallel computing.

4. Divide and conquer

Most recursive algorithms, such as fast ordering, can be transformed into iterative models, which can be mapped to the GPU programming model.

in particular: while the Fermi architecture and the GPU of the Kepler architecture support buffer stacks, it is possible to directly transform the recursive model to the GPU parallel model. But for the efficiency of the program, we'd better convert it to an iterative model in the case of development time permitting.

Second article: Understanding Parallel Computing from the perspective of the GPU

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.