GPU coarse-grained parallel implementation and testing for convolution operations

Last Update:2015-03-02 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

First, the basic idea of the algorithm:

1. A thread in the GPU produces a convolution result, and how many blocks are used for the number of results;

2. Matrix and convolution cores are stored in shared memory, and the convolution results are stored in global memory;

3, support 10000 in any dimension of the two-dimensional matrix, convolution core support 16x16.

4, support any number of images batch processing.

Second, the experimental platform:

Cpu:intel (R) Xeon (r) e5-2650 0 @2.00ghz 16 Core 32 threads

Gpu:nvidia Tesla C2070 (see table below)

RAM memory:64g

Operating system:64bit Win7.

Size specifications	9.75-inch PCIe x16 Specs
Number of Tesla GPUs	1
Cuda Core Number	448
Cuda Core Frequency	1.15 GHz
Double-precision floating-point performance (peak)	515 Gflops
Single-precision floating-point performance (peak)	1.03 Tflops
Total capacity of dedicated memory * Tesla C2050 Tesla C2070	3GB GDDR5 6GB GDDR5
Memory frequency	1.5 GHz
Memory interface	384 guests
Memory bandwidth	144 GB/S
Power Tesla C2050	238W Thermal Design Power consumption
System interface	PCIe x16 GEN2
Cooling Solutions	Active Fan-heatsink
Monitor support Dual-link Dvi-i Display Maximum Resolution @ 60Hz	1 2560x1600
Software development tools	CUDA C/c++/fortran, OpenCL, and DirectCompute Toolkit. Nvidia (NVIDIA) Parallel nsight for Visual Studio?

Iii. CPU and GPU time-consuming comparisons

Description: in the CPU implementation, no optimizations are made, and the convolution operation uses the basic four-layer loop form. In order to program the versatility, many conditions are tested inside the GPU, which is more time consuming. This is the initial version and can be further optimized, such as some byte alignment.

Test time Description: Using clock () to test the accuracy of 0.001 seconds (1ms), the total time of the GPU test includes function scheduling cost, GPU startup cost, device memory request release cost, data transfer cost and calculation cost, and test the data preprocessing and post-processing time consumption. The CPU has only function scheduling overhead and computational overhead.

Error description: Both the CPU and GPU use the float array to store the data, the matrix and convolution kernel data are randomly initialized, and the CPU and GPU convolution results are compared to test the validity of the convolution.

Preliminary Analysis:

A. The CPU and GPU convolution result error is 0, which indicates that the GPU implementation is valid.

B. When matrices and convolution cores are small, CPU performance is better. When matrices and convolution cores are large, GPU performance is high, which reflects the parallelism of the GPU and can quickly handle large-scale, high-throughput data. When the amount of data is not very large, the GPU time-consuming is primarily the start-up cost, which is the allocation overhead of device memory.

C. On the GPU, a thread produces a convolution result, which is serialized internally, and the larger the convolution core, the more time it takes for a single convolution to result.

To reduce the number of accesses to global memory, the matrix and convolution cores are copied to the shared memory inside the block before the operation.

Using a two-dimensional block, a block internal 16x16 threads, the number of convolution results to use how many blocks, so fixed convolution core, change the size of the matrix, run time basically unchanged, when the matrix is too large, when the number of blocks used too much, Operation time is limited by the number of SM and blocks supported in the GPU hardware. (considering the generality of convolution program, the data copy uses more conditional test and branch operation, the performance is affected by some).

D. The CPU is a simple serial operation, which is affected by the matrix and convolution core size.

E. When the matrix two dimensions are more than 10000, the CPU operation has a pointer anomaly, perhaps the matrix is large (around 550MB), the data storage is no longer contiguous, the pointer overflow. So it can't be tested.

Matrix Size	Number	Kernel	CPU (s)	Cpu2gpu	Gpu-kernel	Gpu2cpu
5x4	1	5x4	<1ms	<1ms	<1ms	<1ms
12x9	1	5x4	<1ms	<1ms	<1ms	<1ms
18x19	1	5x4	<1ms	<1ms	<1ms	<1ms
118x29	1	5x4	<1ms	<1ms	<1ms	<1ms
138x59	1	5x4	<1ms	<1ms	<1ms	<1ms
158x159	1	5x4	0.003	<1ms	0.001	<1ms
558x559	1	5x4	0.044	0.001	0.001	<1ms
1128x1159	1	5x4	0.157	0.002	0.004	0.001
2128x2159	1	5x4	0.442	0.007	0.012	0.007
5128x5159	1	5x4	2.394	0.038	0.068	0.035
18128x4159	1	5x4	6.866	0.111	0.193	0.114
10128x11159	1	5x4	10.074	0.160	0.288	0.142
					15.54Gflops	1.427GBps
5x4	1	14x15	~	~	~	~
12x9	1	14x15	~	~	~	~
18x19	1	14x15	<1ms	<1ms	<1ms	<1ms
118x29	1	14x15	<1ms	<1ms	<1ms	<1ms
138x59	1	14x15	<1ms	0.001	<1ms	<1ms
158x159	1	14x15	0.024	<1ms	<1ms	<1ms
558x559	1	14x15	0.354	<1ms	0.006	0.001
1128x1159	1	14x15	1.400	0.002	0.023	0.002
2128x2159	1	14x15	3.839	0.007	0.082	0.007
5128x5159	1	14x15	22.856	0.042	0.475	0.035
11128x4159	1	14x15	38.172	0.079	0.833	0.061
10128x11159	1	14x15	122.679	0.203	2.614	0.358
					23.23Gflops	382.6MBps
5x4	15	14x15	~	~	~	~
12x9	15	14x15	~	~	~	~
18x19	15	14x15	0.001	<1ms	<1ms	<1ms
118x29	15	14x15	0.041	<1ms	0.001	<1ms
138x59	15	14x15	0.097	<1ms	0.002	<1ms
158x159	15	14x15	0.372	0.001	0.007	<1ms
558x559	15	14x15	4.943	0.006	0.084	0.006
1128x1159	15	14x15	15.851	0.030	0.353	0+0x8
2128x2159	15	14x15	57.699	0.097	1.247	0.084
3158x3059	15	14x15	121.152	0.201	2.624	0.192
5128x5159	15	14x15	Pointer overflow
11128x4159	15	14x15
10128x11159	15	14x15
					23.01Gflops	362.9MBps

Further analysis:

From the above table, the maximum throughput rate is 1.427gbps,pcaie bus bandwidth of 5GBps, there is a certain amount of space. The highest effective calculation performance for single-precision floating-point number multiplication is 23.23Gflops, and the highest single-precision floating-point performance indicated on official documents is 1Tflops with a large difference, for the following reasons:

A. The data that the CPU transmits to the GPU is a one-dimensional array, and the internal GPU is computed in two dimensions, and it takes a lot of address computation to access the data.

B. Due to convolution-specific properties, the data has a lot of repeatability, the large image is divided into many small image blocks, the need to consider the boundary problem, and in order to ensure the universality of the program, support arbitrary image and convolution kernel size, and support any number of images of batch processing, GPU, the data copied to the shared memory, Have done a lot of conditions test, time consumption is relatively large. This place can further improve the performance of GPU operations by preprocessing such as boundary expansion at the CPU.

C. A two-dimensional loop is used when calculating a single convolution result within a thread, and when the convolution core is large, the time of operation is exponentially increased.

Summarize:

Writing a GPU program is simple, writing an efficient GPU program is difficult, and writing an efficient universal GPU program is more difficult and needs to be considered in every aspect of the thing. The current version of the program, only has a simple commonality, because the use of basic data structure to save information, when the amount of data reached 500MB, it is easy to have problems, and computational efficiency is not very high.

Work plan for next week:

(1) in the CPU to do some data preprocessing, consider byte alignment, boundary padding, etc., to minimize the internal conditions of the GPU test. Consider taking a single convolution operation apart to achieve fine-grained parallelism and improve the computational efficiency of individual convolution results.

(2) Learn three-dimensional block, put an image on the first and second dimensions, batch image in the third dimension, reduce the amount of post-processing calculation, increase the volume of the batch image calculation performance.

(3) Consider the parallelization of deconvolution operations. Consider combining Cuda convolution with the MATLAB version of CNN to cross-compile and improve the efficiency of CNN.

GPU coarse-grained parallel implementation and testing for convolution operations

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

GPU coarse-grained parallel implementation and testing for convolution operations

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

GPU coarse-grained parallel implementation and testing for convolution operations

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support