The bug debugging of MMP based on Cuda

Source: Internet
Author: User

After the first draft of the program comes out, there are several bugs:

1. Memory fetch out of range

2. After each cycle, the Register TMP for turnover is not reset to zero

3. The first iteration data after copying data to Sharedmemory is incorrect. Data results run differently each time, but a finite set.

For the first bug, the use of pointers is confusing. MMP this kernel used in a little bit of pointers, and intertwined with each other, the logic is slightly unclear is prone to go awry. The use of a general pointer is that the pointer is unchanged followed by an index, or the pointer changes with the cycle of the index is unchanged, and in Cuda, there is a thread pointer value may be the same or different. To analyze these situations seriously, if you are not sure of your logical ability, you can just let the index change, the pointer is not changed. My pointer went out of scope because the order of calculation and transmission was different, and I used the same pointer, which made it easy to make mistakes in my visit. So I let the calculation and transmission of the pointer divide and conquer , each use, the logic is not easy to error. The problem of PS memory visit can be detected by Memorychecker .

The second bug is actually easier to find, and when you write a program you will notice the problem.

For the third bug, I initially thought it was because the pointer to the result was not initialized, so that there was a problem with the result in the iteration, so I separated the first loop to assign the result pointer, and then iterate (+ =). But doing so does not solve the problem. On the one hand, I initialize the result pointer in the host code and copy it to the device side, on the other hand, Cuda automatically initializes the uninitialized pointer to full 0 or 1, not garbled. So if it's not initialized then each time the result should be the same. This data result is different each time, but a finite set of cases, 99% is because there is no synchronization . Programming Guide points out that if there is no synchronization at the point where synchronization is required, it can result in unexpected results, but it must be a solution of a finite set. See Guide for details. In general, after the data is written, a synchronous guaranteed write-read dependency is required before reading.

In addition, the currently written MMP accounts for only 20% of the peak performance, which is 1/3 of the Cublas performance. The main problems are two, 1 is the access address of the matrix B is not continuous, does not meet the coalesce;2 is the Matrix C (global) of the number of visits or a little more, only reduced by 8 times times. So for now it might be considered that the second solution is to work out a single element of C, which reduces the number of global visits and solves the main contradiction.

Later, a more about the nsight of the debugging function skills.

The bug debugging of MMP based on Cuda

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.