Second article: Understanding Parallel Computing from the perspective of the GPU

Last Update:2016-12-06 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Preface

This article from the perspective of using GPU programming technology to understand the parallel implementation of the method of calculation ideas.

three important issues to be considered in parallel computing

1. Synchronization issues

In the relevant course of operating system theory, we learned about the deadlock problem between processes and the critical resource problems caused by resource sharing.

　　2. Concurrency level

There are some issues that are "easy to parallel" issues, such as matrix multiplication. In this type of problem, the results of each unit output are independent of each other, such problems can be easily resolved (usually even call a few class libraries to solve the problem).

However, if there is a dependency between the units, the problem is complicated. In CUDA, intra-block communication is achieved through shared memory, while communication between blocks can only be achieved through global memory.

The CUDA parallel programming architecture can be described as a grid (GRID): A grid is like an army. Grids are divided into blocks, which are like every department in the Army (Logistics, command, communications, etc.). Each block is divided into multiple lines Cheng, which are like teams within a department that can help understand:

3. Locality

in the operating system principle, the focus of the locality has been introduced, in a nutshell, the data previously accessed (time locality) and the data previously accessed (spatial locality) is saved in the cache.

in GPU programming, the locality is also very important, this is reflected in the data to be calculated before the calculation should be as far as possible before the video into the memory, in the iterative process must be as far as possible to reduce the amount of data in the transmission between RAM and video, the actual project found this is very important.

for GPU programming, it is necessary for the program ape to manage the memory itself, or in other words, to realize its locality .

Two types of parallel computing

1. Task-based parallel processing

This parallel mode splits the computational task into several small but different tasks, such as the number of operational units that are responsible for the calculation, and some are responsible for ... Such a large task can form an assembly line.

It is important to note that the efficiency bottleneck of the pipeline is the one with the lowest efficiency .

　　2. Data-based parallel processing

This parallel mode decomposes the data into multiple parts, allowing multiple units to calculate the data for each of these pieces, and then summarize them together.

In general, multithreaded programming of CPUs is biased towards the first parallel mode, while the GPU parallel programming pattern is biased towards the second.

Common parallel Optimization Objects

1. Cycle

This is also the most common pattern for each thread to process one or a set of data in a loop.

This type of optimization must be careful with the individual arithmetic units, and the dependence of each operation unit on its own last iteration result.

2. Derivation/Aggregation mode

Most of this mode is serial code, but one segment of the code can be processed in parallel.

The typical case is that an input queue needs to be treated differently when it is serially processed to a certain point, so that it can be divided into multiple cells to process (or derive from) the queue, and then summarize it (or assemble it).

This pattern is often used in the case of concurrency events in advance, with " dynamic parallelism ".

3. Sub-bar/block mode

for particularly large data, such as climate models, data can be divided into blocks for parallel computing.

4. Divide and conquer

Most recursive algorithms, such as fast ordering, can be transformed into iterative models, which can be mapped to the GPU programming model.

in particular: while the Fermi architecture and the GPU of the Kepler architecture support buffer stacks, it is possible to directly transform the recursive model to the GPU parallel model. But for the efficiency of the program, we'd better convert it to an iterative model in the case of development time permitting.

Second article: Understanding Parallel Computing from the perspective of the GPU

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Second article: Understanding Parallel Computing from the perspective of the GPU

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Second article: Understanding Parallel Computing from the perspective of the GPU

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support