2. Understanding parallel computing from the perspective of GPU

Source: Internet
Author: User

Preface

This article describes how to implement parallel computing from the perspective of GPU programming technology.

Three important issues to be considered in parallel computing

1. synchronization problems

In the course on operating system principles, we have learned about deadlocks between processes and critical resource issues caused by resource sharing.

2. Concurrency

Some problems are "Easy parallelism", such as matrix multiplication. In this type of problem, the output results of each computing unit are independent of each other, and such problems can be easily solved (The problem can be solved even by calling several class libraries ).

However, if there is a dependency between each operational unit, the problem becomes complicated. In cuda, intra-block communication is implemented through shared memory, while inter-block communication can only be implemented through global memory.

The Cuda parallel programming architecture can be described by grid: a grid is better than an army. The grid is divided into multiple parts, which are like each Department of the Army (such as the Logistics Department, headquarters, and Communication Department ). Each block is divided into multiple thread bundles, which are similar to the internal team of the Department and can help you understand:

    

3. locality

In terms of operating system principles, we have made a key introduction to locality. Simply put, we will introduce the previously accessed data (Time locality) and the recently accessed data (space locality) stored in the cache.

In GPU programming, locality is also very important. This is reflected in that the data to be computed should be sent to the video memory as much as possible before computing, in the iteration process, we must minimize the data transmission between memory and memory. This is very important in actual projects.

For GPU programming, the program needs to manage the memory by itself, or in other words, implement the locality by itself.

Two types of Parallel Computing

1. Task-Based Parallel Processing

In this parallel mode, computing tasks are divided into several small but different tasks. For example, some computing units are responsible for data acquisition, some computing units are responsible for computing, and some are responsible ...... such a large task can form a pipeline.

Note that the efficiency bottleneck of the pipeline lies in the computing unit with the lowest efficiency.

2. Data-based parallel processing

In this parallel mode, data is divided into multiple parts, so that multiple computing units can calculate these small pieces of data separately and then summarize them.

In general, the multi-thread programming of CPU tends to be in the first parallel mode, while the GPU parallel programming mode tends to be in the second mode.

Common Parallel Optimization objects

1. Loop

This is also the most common mode for each thread to process one or a group of data in a loop.

This type of optimization must be careful with the dependencies between each computing unit and its own previous iteration results.

2. derivation/collection mode

In this mode, most of them are serial code, but some sections of the code can be processed in parallel.

A typical scenario is that different parts of an input queue need to be processed differently when the serial processing reaches a certain time point, in this way, you can divide them into multiple computing units to process (derived) the queue, and then summarize (that is, aggregate) it ).

This mode is often used when concurrent events are not specified in advance and has "dynamic concurrency ".

3. Shard/shard Mode

For extremely large data (such as climate models), data can be divided into blocks for parallel computing.

4. Divide and conquer

The vast majority of recursive algorithms, such as quick sorting, can be converted to iterative models, while iterative models can be mapped to GPU programming models.

In particular, although both the fermion architecture and the GPU of the Kepler architecture support the buffer stack, the recursive model can be directly converted to the GPU parallel model. However, for program efficiency, we 'd better convert it into an iterative model when the development time permits.

 

      

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.