Cuda matrix multiplication-using shared storage

Last Update:2018-12-06 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The method described in the previous article is in the global storage area, which slows down the data retrieval speed and affects the performance, the threads in the device read the data in the shared storage area of the block quickly, and many array elements are read repeatedly in the global storage area. Therefore, first read the array data to be calculated to the shared storage area, and then use the data in the shared storage area for computing, it will improve performance.

However, because the storage space of each shared storage area of each block is usually very small, I use 8400mg as an example, only 16 KB. Therefore, when the amount of data required in a block is large, it is necessary to partition the data, calculate data in multiple parts.

1. Block Policy

For example, in the previous article:

Grid dimension: (width/tile_width, width/tile_width) (64, 64)

Block dimension: (tile_width, tile_width) (16, 16)

In the calculation, a block in the PD needs the data surrounded by the MD and Nd dotted lines. The required data size in a block of PD: tile_width * width * 2*4/1024 (Kb) 2 because of the MD and Nd arrays, 4 because of float. The calculation is 128 MB. Obviously, it is larger than 16 KB. In this case, we need to use the block calculation method. (Note: block computing is a common method in Big Data Processing)

Block Size: You can first try to calculate the block size in stages by using the tile_width * tile_width rectangular block as a small block. The size of the memory required for calculation is 16*16*2*4/1024 = 2 kb. Yes, it is clear and clear (this is the key ).

2. source program

The _ shared _ keyword indicates that the variables are stored in the shared storage area, and the data in the shared storage area is shared to the threads in the block. The scope of the variable stated in the thread is knowledge thread.

Each block contains tile_width * tile_width threads, while the data elements in MD and Nd in each small block are tile_width * tile_width, therefore, a thread in the block loads an element in MD and Nd into the shared memory. The threads in lines 21st and 22 (TX, Ty) load one element in MD and Nd into the shared memory respectively.

Using this method, the kernel function is changed as follows:

1 _ global _ static void matrixmulkernel (const float * MD, const float * nd, float * PD, int width) 2 {3 // shared storage stores data 4 _ shared _ float MDS [tile_width] [tile_width] loaded from the global storage; 5 _ shared _ float NDS [tile_width] [tile_width]; 6 7 // calculate the row index of the elements in PD and MD 8 int BX = blockidx. x; 9 int by = blockidx. y; 10 int Tx = threadidx. x; 11 int ty = threadidx. y; 12 // PD row and column 13 int ROW = by * tile_width + ty; 14 int Col = Bx * tile_width + Tx; 15 16 float pvalue = 0.0; 17 // K small block 18 for (int K = 0; k <width/tile_width; k ++) 19 {20 // load the MD and Nd blocks to the shared storage through collaboration 21 MDS [ty] [TX] = md [row * width + K * tile_width + Tx]; 22 NDS [ty] [TX] = Nd [(K * tile_width + ty) * width + Col]; 23 _ syncthreads (); // wait for other threads in the block to synchronize 24 25 for (INT m = 0; m <tile_width; m ++) 26 pvalue + = MDS [ty] [m] * NDS [m] [TX]; 27 _ syncthreads (); // wait until other threads finish computing, because pvalue needs to use the calculation of the next block 28} 29 // each thread is responsible for calculating one element of P 30 PD [row * width + Col] = pvalue; 31}

It is important to use the _ syncthreads () function to synchronize threads in a block, this is because the thread needs to add the required data to the shared storage area for computing.

3. Test Results

CPU computing is time-consuming, so we will not test it. We can see that using this method is much better than the performance in the previous article.

4. Comparison of Different allocation policies

The above only uses a 2 kb shared storage, you can consider a larger part of the data, the above Md AND Nd in the four small pieces into a small block

MD: (tile_width, tile_width * 4)

Nd: (tile_width * 4, tile_width)

The above Code requires some changes, mainly the size of shared memory allocation and the cyclic conditions of cyclic variables.

The test results are as follows:

The visibility is slower than the previous one. Other block policies can also be used for testing to find the most suitable block policy. The block policy also takes into account physical storage restrictions and memory access models.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Cuda matrix multiplication-using shared storage

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Cuda matrix multiplication-using shared storage

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support