10 major considerations for GPGPU Opencl/cuda High performance programming

Last Update:2017-02-27 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. Expand the Cycle

If you know the number of cycles in advance, you can do cyclic expansion, which eliminates the number of times the cycle conditions are compared. But it also doesn't make kernel code too big.

Looping through code examples:

#include <iostream>
using namespace std;
    
int main () {
    int sum=0;
    for (int i=1;i<=100;i++) {
        sum+=i
    }
    
    sum=0;
    for (int i=1;i<=100;i=i+5) {
        sum+=i;
        sum+=i+1;
        sum+=i+2;
        sum+=i+3;
        sum+=i+4;
    }
    return 0;
}

2. Avoid dealing with non-standardized figures

OpenCL numbers are normal values that are less than the minimum exponent. Because of the limited number of digits in the computer, the range and precision of the data cannot be unlimited. (You can view the IEEE 754 standard, http://zh.wikipedia.org/zh-cn/IEEE_754)

With non-standard numbers in OpenCL, you may receive "0 operations", which can be time-consuming.

If the "except 0" operation in kernel does not affect you, you can include-cl-denorms-are-zero in the compilation options, such as:

Clbuildprogram (program, 0, NULL, "-cl-denorms-are-zero", NULL, NULL);

3. Transfer constant base type data through compiler options to kernel instead of using private memory

If you need to transfer constant base type data to kernel in your program, it is best to use compiler options, such as macro definitions. Instead, each work-item defines a private memory variable. This allows the compiler to make variable substitutions directly at compile time, without defining new variables and saving space.

As shown in the following code (DMACRO.CPP):

#include <stdio.h>
int main ()
{
    int a=size;
    printf ("a=%d, size=%d\n", a,size);
    return 0;
}

Compile:

g++-dsize=128-o A Dmacro.cpp

4. If sharing is not important, save part of the variable in private memory instead of local memory

Work-item access private memory faster than the local memory, so you can save part of the variable data in private memory. Of course, when private memory is full, the GPU hardware automatically converts the data to the local memory.

5. Visit local memory should avoid bank conflicts

The local memory is organized into a single bank,bank that can be accessed separately from each other, so that successive 32bit is stored in a contiguous bank. As shown in the following illustration:

(1) If multiple Work-item access continuous local memory data, they will be able to achieve maximum parallel reading and writing.

(2) If multiple Work-item access data in the same bank, they must execute sequentially, seriously reducing the parallelism of data reading. Therefore, the layout of the data in the local memory should be reasonably arranged.

(3) In special cases, if a thread in a wave/warp reads an address in a local memory at the same time, it broadcasts and does not belong to the bank conflict.

More Wonderful content: http://www.bianceng.cnhttp://www.bianceng.cn/Programming/cplus/

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More