Rapid development of Cuda programs

Last Update:2014-09-22 Source: Internet

Author: User

Tags array definition

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Based on years of Cuda development experience, we will briefly introduce the general development steps of the Cuda program, and follow the principle of first modifying the CPU serial program and then porting it to the GPU platform, modify the work that needs to be done on the GPU as much as possible on the CPU platform, reducing the difficulty of Program Development and debugging with bugs. By implementing a fast and effective method for developing Cuda parallel programs, the efficiency of developing Cuda parallel programs is improved, and the development cycle and difficulty of Cuda parallel programs are reduced.

(1) CPU serial Program Analysis

For a CPU serial program, you must first test the hotspot functions in the serial program and analyze the parallelism of the hotspot functions:

A) Hot Spot Test

The hotspot function is determined based on the test results of the time and serves as the key code module to be transplanted later.

B) Parallel Analysis

After finding out the hotspot code, you need to analyze the algorithm and data features of the hotspot part, and analyze whether the algorithm can be parallel based on the characteristics of the algorithm and data, and whether it is suitable for fine-grained parallel processing.

C) Determine the array used by the Cuda Kernel

Based on the analysis of the serial program, determine which modules need to be transplanted to the GPU platform for running, and analyze the data in the code that needs to run on the GPU platform, determine which arrays need to be used in the Cuda kernel. During analysis of Cuda computing, these arrays are transmitted in the direction of cputogpu, gputocpu, and the data size during each transmission, then design the definition and size of these arrays.

(2) A cuda-like Serial CPU program

Cuda programs are more complex than CPU programs. When a bug occurs, debugging is much more difficult than CPU programs. To reduce the difficulty and cycle of Cuda program development, some GPU porting work can be implemented on the CPU platform in advance, which involves the following aspects:

A) modification to a parallel algorithm

For CPU serial programs, some code can be parallel in theory, but after optimization of the CPU version, the Code cannot be directly parallel. In this case, you need to modify the original program according to the requirements of the parallel algorithm, this mode can be changed to a parallel mode. Some modules can be parallel in theory, but serial algorithms cannot be directly parallelized. We need to re-design parallel algorithms.

B) array Modification

The array definition used in the CPU serial program may not be directly used in the Cuda kernel. In this case, you need to modify the array definition, such as the C language program, the pointer in the struct must be changed to a separate pointer/array to transfer data between the CPU and GPU. In addition, considering the combined access of global memory, you sometimes need to modify the access direction of the array, and then change the definition form of the array (such as row and column conversion ). In short, according to the difference between Cuda's array usage and the CPU serial program, modify the array in advance to facilitate program debugging.

According to the previous modification methods, the original CPU program is modified into a cuda-like Serial CPU program, which makes a lot of preparations for subsequent transplantation, and facilitates the transplantation of subsequent Cuda programs.

(3) GPU Array Design and parallel Model Design

GPU Array Design: Design GPU array size, type, dimension, and other information;

Design the array communication mode between CPU and GPU;

Parallel model design: the design of block and grid meets the data characteristics of algorithms.

(4) Basic version of Cuda parallel program

Based on the analysis of the original program array, the CPU serial program is transplanted to the GPU platform, and the Cuda kernel code is implemented based on the algorithm and code of the hotspot module.

A) design the Call Statement

Kernel <grid, block,…> (...);

B) design the Cuda Kernel

Based on the parallel analysis of the algorithm, the kernel is designed to divide the computing tasks of each thread, and the logic correctness of the kernel program is satisfied by the synchronous statement.

(5) optimized Cuda parallel program version

Based on the basic version of Cuda parallel program implemented in Step 4, The Cuda optimization technology is used to further improve the performance of parallel programs. The main optimization includes two aspects: Communication Optimization and kernel optimization.

A) CPU and GPU Communication Optimization

GPU computing requires data transmission between the CPU and the GPU. Reasonable Use of communication optimization technology can improve the performance of GPU parallel programs, such as stream technology to hide the communication between the CPU and GPU.

B) Cuda kernel Optimization

The optimization of Cuda kernel is more important to its performance, mainly involving memory access optimization and command flow optimization. Memory Access optimization includes: global memory combined access, use shared storage, constant storage, and texture storage to replace access to the global storage to improve access speed. Command Flow Optimization refers to the use of efficient commands to replace inefficient commands, such as fast functions in Cuda.

The above method is just a rough introduction. It is inevitable that there will be incomplete introductions. I hope it will be helpful for your development.

Rapid development of Cuda programs

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More