Typical six GPU Parallel Optimization Strategies

Source: Internet
Author: User

Preface

How to optimize existing programs in parallel is the most important practical issue in GPU parallel programming technology. This article provides several optimization ideas to point out the path for parallel program optimization.

Preparation before optimization

First, we need to clarify the goal of Optimization-is it necessary to speed up the program twice? Or 10 times? 100 times? Maybe you will not think about it. Of course, the higher the improvement, the better.

However, there is a problem with optimization costs. At the same level of technical hardware, A 2x increase may take only one afternoon's workload, but a 10x increase may take into account more things, perhaps a week's work. Increase by 100 times, 1000 times the required cost, more time.

Then, we need to break down the problem. Generally, the dataset is first decomposed and then the task is decomposed. Here we need to analyze the data from a matrix like a dataset, find out the correspondence between the input and output sets and assign them to each block and each thread.

Policy 1: Identify the bottleneck in the code

The bottleneck of analyzing program efficiency lies in analysis. This method is very useful for programs with simple code structures. However, for complex projects in practical applications, analysis of the human brain often leads to incorrect conclusions-maybe you have made every effort to figure out the bottleneck, then it was optimized, but the efficiency was only improved by 1%.

Therefore, a more effective method is to use analysis tools to identify bottlenecks. You can use Cuda profiler or parallel nsight.

For details about how to use parallel nsight to analyze parallel programs, refer to my article: (preparing ...)

Another thing to note is that when processing data on the GPU, the CPU can consider doing something else, such as getting data from the server, in this way, the CPU Parallel and GPU parallel are combined, and the program efficiency will naturally be greatly improved.

Policy 2: make proper use of memory

First, you must flexibly use the various memory structures in the video card, such as shared memory and constant memory. Pay special attention to the use of shared memory, which is close to the first-level cache speed.

In addition, Multiple kernel functions are integrated when necessary. This can avoid data transmission problems when starting new kernel functions, and reuse some useful data left by previous tasks. However, if you want to integrate multiple kernel functions written by others, pay attention to the implicit synchronization problem-after the code of the previous kernel function is fully executed, the next kernel function starts execution.

Then, data access should be merged-use the cudamalloc function whenever possible. The data accessed at a time should be greater than 128 bytes so that the bandwidth of the video card can be fully utilized.

Policy 3: optimization of the transmission process

As mentioned in the previous article, data exchange between memory and memory is very time-consuming.

To solve this problem, we can use the host memory by locking the page memory. The so-called lock page memory means that the transfer of memory and video card in this region does not require CPU intervention. If a region is not declared as the lock page memory, before the memory is transferred to the memory or to the memory, some locking operations will take place (indicating that the memory in the region is being transferred with the memory, and the CPU is not disturbed ).

You can call the cudahostalloc function. This function is not just as simple as declaring the lock page memory. By setting function parameters, this function can also implement many very practical functions. I personally recommend this function.

Then, we also need to recommend zero-copy memory. It is a special lock page memory, a special memory ing. It allows you to map the host memory to the GPU memory space. If your program is computing-intensive, this mechanism will be very useful and it will automatically overlap data transmission and computing. For more information, see my article.

Policy 4: optimize the thread structure layout

Create a scientific computing grid, and set the appropriate dimension, number of blocks, and number of threads in the block to implement merged memory access as much as possible to ensure the maximum memory bandwidth.

You must learn to use a multi-dimensional computing grid flexibly, instead of just one dimension. For more information about how to use a multi-dimensional computing grid, see my article.

When the number of blocks in a single dimension is limited, the multi-dimensional grid must be taken into account.

Policy 5: perform task-level decomposition from the algorithm itself

Separate the steps of the algorithm into unrelated parts, and adopt GPU parallel in the steps. These steps adopt CPU Parallel.

Policy 6: Some libraries and APIs for flexible use of Cuda C

Cuda C provides many practical APIs and a considerable number of C ++ support (not all ). It can greatly improve development efficiency. Such as atomic operation functions.

Cuda provides many practical libraries, such as cublas cusparse, which are not described here. Especially the thrust library is simply the parallel implementation of STL, which is very convenient for direct use.

Summary

The optimization idea can be said to be the core and key of Cuda parallel programming.

This article only provides the overall optimization strategy and ideas. For specific implementation methods, please refer to the relevant materials for implementation.

  

  

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.