10 major considerations for GPGPU Opencl/cuda High performance programming

Source: Internet
Author: User

1. Expand the Cycle

If you know the number of cycles in advance, you can do cyclic expansion, which eliminates the number of times the cycle conditions are compared. But it also doesn't make kernel code too big.

Looping through code examples:

#include <iostream>
using namespace std;
    
int main () {
    int sum=0;
    for (int i=1;i<=100;i++) {
        sum+=i
    }
    
    sum=0;
    for (int i=1;i<=100;i=i+5) {
        sum+=i;
        sum+=i+1;
        sum+=i+2;
        sum+=i+3;
        sum+=i+4;
    }
    return 0;
}

2. Avoid dealing with non-standardized figures

OpenCL numbers are normal values that are less than the minimum exponent. Because of the limited number of digits in the computer, the range and precision of the data cannot be unlimited. (You can view the IEEE 754 standard, http://zh.wikipedia.org/zh-cn/IEEE_754)

With non-standard numbers in OpenCL, you may receive "0 operations", which can be time-consuming.

If the "except 0" operation in kernel does not affect you, you can include-cl-denorms-are-zero in the compilation options, such as:

Clbuildprogram (program, 0, NULL, "-cl-denorms-are-zero", NULL, NULL);

3. Transfer constant base type data through compiler options to kernel instead of using private memory

If you need to transfer constant base type data to kernel in your program, it is best to use compiler options, such as macro definitions. Instead, each work-item defines a private memory variable. This allows the compiler to make variable substitutions directly at compile time, without defining new variables and saving space.

As shown in the following code (DMACRO.CPP):

#include <stdio.h>
int main ()
{
    int a=size;
    printf ("a=%d, size=%d\n", a,size);
    return 0;
}

Compile:

g++-dsize=128-o A Dmacro.cpp

4. If sharing is not important, save part of the variable in private memory instead of local memory

Work-item access private memory faster than the local memory, so you can save part of the variable data in private memory. Of course, when private memory is full, the GPU hardware automatically converts the data to the local memory.

5. Visit local memory should avoid bank conflicts

The local memory is organized into a single bank,bank that can be accessed separately from each other, so that successive 32bit is stored in a contiguous bank. As shown in the following illustration:

(1) If multiple Work-item access continuous local memory data, they will be able to achieve maximum parallel reading and writing.

(2) If multiple Work-item access data in the same bank, they must execute sequentially, seriously reducing the parallelism of data reading. Therefore, the layout of the data in the local memory should be reasonably arranged.

(3) In special cases, if a thread in a wave/warp reads an address in a local memory at the same time, it broadcasts and does not belong to the bank conflict.

More Wonderful content: http://www.bianceng.cnhttp://www.bianceng.cn/Programming/cplus/

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.