GPU coarse-grained parallel implementation and testing for convolution operations
First, the basic idea of the algorithm:
1. A thread in the GPU produces a convolution result, and how many blocks are used for the number of results;
2. Matrix and convolution cores are stored in shared memory, and the convolution results are stored in global memory;
3, support 10000 in any dimension of the two-dimensional matrix, convolution core support 16x16.
4, support any number of images batch processing.
Second, the experimental platform:
Cpu:intel (R) Xeon (r) e5-2650 0 @2.00ghz 16 Core 32 threads
Gpu:nvidia Tesla C2070 (see table below)
RAM memory:64g
Operating system:64bit Win7.
Size specifications |
9.75-inch PCIe x16 Specs |
Number of Tesla GPUs |
1 |
Cuda Core Number |
448 |
Cuda Core Frequency |
1.15 GHz |
Double-precision floating-point performance (peak) |
515 Gflops |
Single-precision floating-point performance (peak) |
1.03 Tflops |
Total capacity of dedicated memory * Tesla C2050 Tesla C2070 |
3GB GDDR5 6GB GDDR5 |
Memory frequency |
1.5 GHz |
Memory interface |
384 guests |
Memory bandwidth |
144 GB/S |
Power Tesla C2050 |
238W Thermal Design Power consumption |
System interface |
PCIe x16 GEN2 |
Cooling Solutions |
Active Fan-heatsink |
Monitor support Dual-link Dvi-i Display Maximum Resolution @ 60Hz |
1 2560x1600
|
Software development tools |
CUDA C/c++/fortran, OpenCL, and DirectCompute Toolkit. Nvidia (NVIDIA) Parallel nsight for Visual Studio? |
Iii. CPU and GPU time-consuming comparisons
Description: in the CPU implementation, no optimizations are made, and the convolution operation uses the basic four-layer loop form. In order to program the versatility, many conditions are tested inside the GPU, which is more time consuming. This is the initial version and can be further optimized, such as some byte alignment.
Test time Description: Using clock () to test the accuracy of 0.001 seconds (1ms), the total time of the GPU test includes function scheduling cost, GPU startup cost, device memory request release cost, data transfer cost and calculation cost, and test the data preprocessing and post-processing time consumption. The CPU has only function scheduling overhead and computational overhead.
Error description: Both the CPU and GPU use the float array to store the data, the matrix and convolution kernel data are randomly initialized, and the CPU and GPU convolution results are compared to test the validity of the convolution.
Preliminary Analysis:
A. The CPU and GPU convolution result error is 0, which indicates that the GPU implementation is valid.
B. When matrices and convolution cores are small, CPU performance is better. When matrices and convolution cores are large, GPU performance is high, which reflects the parallelism of the GPU and can quickly handle large-scale, high-throughput data. When the amount of data is not very large, the GPU time-consuming is primarily the start-up cost, which is the allocation overhead of device memory.
C. On the GPU, a thread produces a convolution result, which is serialized internally, and the larger the convolution core, the more time it takes for a single convolution to result.
To reduce the number of accesses to global memory, the matrix and convolution cores are copied to the shared memory inside the block before the operation.
Using a two-dimensional block, a block internal 16x16 threads, the number of convolution results to use how many blocks, so fixed convolution core, change the size of the matrix, run time basically unchanged, when the matrix is too large, when the number of blocks used too much, Operation time is limited by the number of SM and blocks supported in the GPU hardware. (considering the generality of convolution program, the data copy uses more conditional test and branch operation, the performance is affected by some).
D. The CPU is a simple serial operation, which is affected by the matrix and convolution core size.
E. When the matrix two dimensions are more than 10000, the CPU operation has a pointer anomaly, perhaps the matrix is large (around 550MB), the data storage is no longer contiguous, the pointer overflow. So it can't be tested.
Matrix Size |
Number |
Kernel |
CPU (s) |
Cpu2gpu |
Gpu-kernel |
Gpu2cpu |
5x4 |
1 |
5x4 |
<1ms |
<1ms |
<1ms |
<1ms |
12x9 |
1 |
5x4 |
<1ms |
<1ms |
<1ms |
<1ms |
18x19 |
1 |
5x4 |
<1ms |
<1ms |
<1ms |
<1ms |
118x29 |
1 |
5x4 |
<1ms |
<1ms |
<1ms |
<1ms |
138x59 |
1 |
5x4 |
<1ms |
<1ms |
<1ms |
<1ms |
158x159 |
1 |
5x4 |
0.003 |
<1ms |
0.001 |
<1ms |
558x559 |
1 |
5x4 |
0.044 |
0.001 |
0.001 |
<1ms |
1128x1159 |
1 |
5x4 |
0.157 |
0.002 |
0.004 |
0.001 |
2128x2159 |
1 |
5x4 |
0.442 |
0.007 |
0.012 |
0.007 |
5128x5159 |
1 |
5x4 |
2.394 |
0.038 |
0.068 |
0.035 |
18128x4159 |
1 |
5x4 |
6.866 |
0.111 |
0.193 |
0.114 |
10128x11159 |
1 |
5x4 |
10.074 |
0.160 |
0.288 |
0.142 |
|
|
|
|
|
15.54Gflops |
1.427GBps |
5x4 |
1 |
14x15 |
~ |
~ |
~ |
~ |
12x9 |
1 |
14x15 |
~ |
~ |
~ |
~ |
18x19 |
1 |
14x15 |
<1ms |
<1ms |
<1ms |
<1ms |
118x29 |
1 |
14x15 |
<1ms |
<1ms |
<1ms |
<1ms |
138x59 |
1 |
14x15 |
<1ms |
0.001 |
<1ms |
<1ms |
158x159 |
1 |
14x15 |
0.024 |
<1ms |
<1ms |
<1ms |
558x559 |
1 |
14x15 |
0.354 |
<1ms |
0.006 |
0.001 |
1128x1159 |
1 |
14x15 |
1.400 |
0.002 |
0.023 |
0.002 |
2128x2159 |
1 |
14x15 |
3.839 |
0.007 |
0.082 |
0.007 |
5128x5159 |
1 |
14x15 |
22.856 |
0.042 |
0.475 |
0.035 |
11128x4159 |
1 |
14x15 |
38.172 |
0.079 |
0.833 |
0.061 |
10128x11159 |
1 |
14x15 |
122.679 |
0.203 |
2.614 |
0.358 |
|
|
|
|
|
23.23Gflops |
382.6MBps |
5x4 |
15 |
14x15 |
~ |
~ |
~ |
~ |
12x9 |
15 |
14x15 |
~ |
~ |
~ |
~ |
18x19 |
15 |
14x15 |
0.001 |
<1ms |
<1ms |
<1ms |
118x29 |
15 |
14x15 |
0.041 |
<1ms |
0.001 |
<1ms |
138x59 |
15 |
14x15 |
0.097 |
<1ms |
0.002 |
<1ms |
158x159 |
15 |
14x15 |
0.372 |
0.001 |
0.007 |
<1ms |
558x559 |
15 |
14x15 |
4.943 |
0.006 |
0.084 |
0.006 |
1128x1159 |
15 |
14x15 |
15.851 |
0.030 |
0.353 |
0+0x8 |
2128x2159 |
15 |
14x15 |
57.699 |
0.097 |
1.247 |
0.084 |
3158x3059 |
15 |
14x15 |
121.152 |
0.201 |
2.624 |
0.192 |
5128x5159 |
15 |
14x15 |
Pointer overflow |
11128x4159 |
15 |
14x15 |
10128x11159 |
15 |
14x15 |
|
|
|
|
|
23.01Gflops |
362.9MBps |
Further analysis:
From the above table, the maximum throughput rate is 1.427gbps,pcaie bus bandwidth of 5GBps, there is a certain amount of space. The highest effective calculation performance for single-precision floating-point number multiplication is 23.23Gflops, and the highest single-precision floating-point performance indicated on official documents is 1Tflops with a large difference, for the following reasons:
A. The data that the CPU transmits to the GPU is a one-dimensional array, and the internal GPU is computed in two dimensions, and it takes a lot of address computation to access the data.
B. Due to convolution-specific properties, the data has a lot of repeatability, the large image is divided into many small image blocks, the need to consider the boundary problem, and in order to ensure the universality of the program, support arbitrary image and convolution kernel size, and support any number of images of batch processing, GPU, the data copied to the shared memory, Have done a lot of conditions test, time consumption is relatively large. This place can further improve the performance of GPU operations by preprocessing such as boundary expansion at the CPU.
C. A two-dimensional loop is used when calculating a single convolution result within a thread, and when the convolution core is large, the time of operation is exponentially increased.
Summarize:
Writing a GPU program is simple, writing an efficient GPU program is difficult, and writing an efficient universal GPU program is more difficult and needs to be considered in every aspect of the thing. The current version of the program, only has a simple commonality, because the use of basic data structure to save information, when the amount of data reached 500MB, it is easy to have problems, and computational efficiency is not very high.
Work plan for next week:
(1) in the CPU to do some data preprocessing, consider byte alignment, boundary padding, etc., to minimize the internal conditions of the GPU test. Consider taking a single convolution operation apart to achieve fine-grained parallelism and improve the computational efficiency of individual convolution results.
(2) Learn three-dimensional block, put an image on the first and second dimensions, batch image in the third dimension, reduce the amount of post-processing calculation, increase the volume of the batch image calculation performance.
(3) Consider the parallelization of deconvolution operations. Consider combining Cuda convolution with the MATLAB version of CNN to cross-compile and improve the efficiency of CNN.
GPU coarse-grained parallel implementation and testing for convolution operations