The method described in the previous article is in the global storage area, which slows down the data retrieval speed and affects the performance, the threads in the device read the data in the shared storage area of the block quickly, and many array elements are read repeatedly in the global storage area. Therefore, first read the array data to be calculated to the shared storage area, and then use the data in the shared storage area for computing, it will improve performance.
However, because the storage space of each shared storage area of each block is usually very small, I use 8400mg as an example, only 16 KB. Therefore, when the amount of data required in a block is large, it is necessary to partition the data, calculate data in multiple parts.
1. Block Policy
For example, in the previous article:
Grid dimension: (width/tile_width, width/tile_width) (64, 64)
Block dimension: (tile_width, tile_width) (16, 16)
In the calculation, a block in the PD needs the data surrounded by the MD and Nd dotted lines. The required data size in a block of PD: tile_width * width * 2*4/1024 (Kb) 2 because of the MD and Nd arrays, 4 because of float. The calculation is 128 MB. Obviously, it is larger than 16 KB. In this case, we need to use the block calculation method. (Note: block computing is a common method in Big Data Processing)
Block Size: You can first try to calculate the block size in stages by using the tile_width * tile_width rectangular block as a small block. The size of the memory required for calculation is 16*16*2*4/1024 = 2 kb. Yes, it is clear and clear (this is the key ).
2. source program
The _ shared _ keyword indicates that the variables are stored in the shared storage area, and the data in the shared storage area is shared to the threads in the block. The scope of the variable stated in the thread is knowledge thread.
Each block contains tile_width * tile_width threads, while the data elements in MD and Nd in each small block are tile_width * tile_width, therefore, a thread in the block loads an element in MD and Nd into the shared memory. The threads in lines 21st and 22 (TX, Ty) load one element in MD and Nd into the shared memory respectively.
Using this method, the kernel function is changed as follows:
1 _ global _ static void matrixmulkernel (const float * MD, const float * nd, float * PD, int width) 2 {3 // shared storage stores data 4 _ shared _ float MDS [tile_width] [tile_width] loaded from the global storage; 5 _ shared _ float NDS [tile_width] [tile_width]; 6 7 // calculate the row index of the elements in PD and MD 8 int BX = blockidx. x; 9 int by = blockidx. y; 10 int Tx = threadidx. x; 11 int ty = threadidx. y; 12 // PD row and column 13 int ROW = by * tile_width + ty; 14 int Col = Bx * tile_width + Tx; 15 16 float pvalue = 0.0; 17 // K small block 18 for (int K = 0; k <width/tile_width; k ++) 19 {20 // load the MD and Nd blocks to the shared storage through collaboration 21 MDS [ty] [TX] = md [row * width + K * tile_width + Tx]; 22 NDS [ty] [TX] = Nd [(K * tile_width + ty) * width + Col]; 23 _ syncthreads (); // wait for other threads in the block to synchronize 24 25 for (INT m = 0; m <tile_width; m ++) 26 pvalue + = MDS [ty] [m] * NDS [m] [TX]; 27 _ syncthreads (); // wait until other threads finish computing, because pvalue needs to use the calculation of the next block 28} 29 // each thread is responsible for calculating one element of P 30 PD [row * width + Col] = pvalue; 31}
It is important to use the _ syncthreads () function to synchronize threads in a block, this is because the thread needs to add the required data to the shared storage area for computing.
3. Test Results
CPU computing is time-consuming, so we will not test it. We can see that using this method is much better than the performance in the previous article.
4. Comparison of Different allocation policies
The above only uses a 2 kb shared storage, you can consider a larger part of the data, the above Md AND Nd in the four small pieces into a small block
MD: (tile_width, tile_width * 4)
Nd: (tile_width * 4, tile_width)
The above Code requires some changes, mainly the size of shared memory allocation and the cyclic conditions of cyclic variables.
The test results are as follows:
The visibility is slower than the previous one. Other block policies can also be used for testing to find the most suitable block policy. The block policy also takes into account physical storage restrictions and memory access models.