Cuda programming FAQs

Last Update:2014-07-26 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Http://blog.csdn.net/yutianzuijin/article/details/8147912category: Programming Language 2521 people read comments (0) Add to favorites report cudagpu

Recently, I first tried Cuda programming. As a newbie, I encountered various problems and spent a lot of time solving these incredible problems. In order to avoid people from repeating the same mistakes, we will summarize the problems we have encountered as follows.

(1). cudamalloc

The first time I used this function, I felt that there was no difficulty. It was similar to the malloc in C language. However, a hard-to-find error occurred in a specific application, which took a lot of time. Note that the unit of memory space allocated by this function is byte. Therefore, we need to use sizeof to specify the specific variable type for use so that we can correctly allocate space. Example:

Cudamalloc (void ***) & gpu_data, sizeof (float) * 1024 );

(2) function execution position

A major feature of Cuda is that the core part of the program is executed on the GPU, so Cuda functions are divided into three categories: Host, global, and device. Therefore, when writing a function, we must clearly identify which function is being compiled and which library function can be called.

Host function: it is called on the CPU and executed on the CPU. It can call the global function and cannot call the device function;
Global function: This function can only be called in the host function, but is executed on the GPU. For example, a memory operation library function such as cudamalloc can call the device function;
Device function: it can only be raised and executed on the GPU and can only be referenced by global functions.

The common error in function category is the confusion between CPU and GPU during memory allocation. We only need to remember that the memory that can be directly used in the host function is the memory on the CPU, and the memory on the GPU needs to be copied to the CPU memory space through the cudamemcpy function call; the memory used in the global and device functions is in the GPU memory space and needs to be allocated before use. (3) shared memory is an important part of improving program performance. Whether the shared memory can be used properly is an important basis for mastering Cuda programming. I just want to emphasize that shared memory is not initialized! The following is an array summation program written by myself. Shared Memory is used: [CPP]View plaincopyprint?

_ DEVICE _ int COUNT = 0;
_ Global _ static void sum (int * data_gpu, int * block_gpu, int * sum_gpu, int length)
{
Extern _ shared _ int blocksum [];
_ Shared _ int islast;
Int offset;
Const int tid = threadidx. X;
Const int bid = blockidx. X;
Blocksum [TID] = 0;
For (INT I = bid * thread_num + tid; I <length; I + = block_num * thread_num)
{
Blocksum [TID] + = data_gpu [I];
}
_ Syncthreads ();
Offset = thread_num/2;
While (Offset> 0)
{
If (TID <OFFSET)
{
Blocksum [TID] + = blocksum [TID + offset];
}
Offset> = 1;
_ Syncthreads ();
}
If (TID = 0)
{
Block_gpu [bid] = blocksum [0];
_ Threadfence ();
Int value = atomicadd (& count, 1 );
Islast = (value = griddim. x-1 );
}
_ Syncthreads ();
If (islast)
{
If (TID = 0)
{
Int S = 0;
For (INT I = 0; I <block_num; I ++)
{
S + = block_gpu [I];
}
* Sum_gpu = s;
}
}
}

Pay special attention to the 11th billion lines of code. If you do not initialize the shared memory to be accessed, the correct result will not be obtained. (4) When calling an atomic function, you must specify the computing power of the current video card. Otherwise, the error "Atomic *** is undefined." will be reported .". In Linux, the solution is to specify a computing capability option for the nvcc compiler when compiling the source code. For example, if the computing power is 1.3, you can add the parameter-arch sm_13 to compile the program smoothly. (5) Many reference books on Cuda syntax show that Cuda uses C extension syntax. Therefore, it is easy to think that C syntax is enough at the beginning. However, this also easily leads us to a misunderstanding: only C syntax, not other. In fact, Cuda is a mixture of C and C ++. Sometimes it is more convenient to use the C ++ Syntax:

Variables can be defined in a for loop, which is not supported by the Standard C language, so we can directly use (for int I = 0; I <length; I ++ ), this can save a register;
Variable definition positions are unrestricted and variables can be defined at any position;
Cuda supports polymorphism, so we can define multiple functions with the same name and different parameters. This is no problem;
Sometimes, templates can be used to merge code to simplify programming;

(6) correct use of block and thread numbers in order to schedule different threads, we usually need to use the built-in variables threadidx and blockidx as the increments in the loop. But remember to use the built-in variables correctly within the loop. Two-day debugging lessons! The following is an example code: [CPP]View plaincopyprint?

_ Global _ static void saliencefunc (float * peaks_gpu, int * index_gpu, float * saliencebins_gpu, int framenumber)
{
_ Shared _ float peaks [half_peak_num];
_ Shared _ int index [half_peak_num];
Int tid = threadidx. X;
Int bid = blockidx. X;
For (INT I = bid; I <framenumber; I + = block_num)
{
If (TID
{
Peaks [TID] = peaks_gpu [half_peak_num * I + TID];
Index [TID] = index_gpu [half_peak_num * I + TID];
}
_ Syncthreads ();
}
}

Pay attention to the value assignment operation half_peak_num * I + TID in lines 13th and 14. The author previously wrote half_peak_num * bid + tid, and it took two days to locate the problem. Therefore, you must use it correctly, use variables like I or J if they can be replaced, and use as few built-in variables as possible. (7) release the space allocated on the GPU and release it in time after use. For a program that runs once, it does not matter if the space is not released. After all, the end space of the program is automatically released. However, when the program runs continuously for multiple times, the unreleased space may cause serious GPU Memory leakage. The first problem is that as the program runs, the GPU memory is exhausted, resulting in subsequent Memory Allocation failure. The second problem is that the program runs slowly. Therefore, we must develop the habit of releasing space in time.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Cuda programming FAQs

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Cuda programming FAQs

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support