GPU coarse-grained parallel implementation and testing for convolution operations

Source: Internet
Author: User

GPU coarse-grained parallel implementation and testing for convolution operations

First, the basic idea of the algorithm:

1. A thread in the GPU produces a convolution result, and how many blocks are used for the number of results;

2. Matrix and convolution cores are stored in shared memory, and the convolution results are stored in global memory;

3, support 10000 in any dimension of the two-dimensional matrix, convolution core support 16x16.

4, support any number of images batch processing.

Second, the experimental platform:

Cpu:intel (R) Xeon (r) e5-2650 0 @2.00ghz 16 Core 32 threads

Gpu:nvidia Tesla C2070 (see table below)

RAM memory:64g

Operating system:64bit Win7.

Size specifications

9.75-inch PCIe x16 Specs

Number of Tesla GPUs

1

Cuda Core Number

448

Cuda Core Frequency

1.15 GHz

Double-precision floating-point performance (peak)

515 Gflops

Single-precision floating-point performance (peak)

1.03 Tflops

Total capacity of dedicated memory *

Tesla C2050
Tesla C2070

3GB GDDR5
6GB GDDR5

Memory frequency

1.5 GHz

Memory interface

384 guests

Memory bandwidth

144 GB/S

Power

Tesla C2050

238W Thermal Design Power consumption

System interface

PCIe x16 GEN2

Cooling Solutions

Active Fan-heatsink

Monitor support

Dual-link Dvi-i
Display Maximum Resolution @ 60Hz


1
2560x1600

Software development tools

CUDA C/c++/fortran, OpenCL, and DirectCompute Toolkit.
Nvidia (NVIDIA) Parallel nsight for Visual Studio?

Iii. CPU and GPU time-consuming comparisons

Description: in the CPU implementation, no optimizations are made, and the convolution operation uses the basic four-layer loop form. In order to program the versatility, many conditions are tested inside the GPU, which is more time consuming. This is the initial version and can be further optimized, such as some byte alignment.

Test time Description: Using clock () to test the accuracy of 0.001 seconds (1ms), the total time of the GPU test includes function scheduling cost, GPU startup cost, device memory request release cost, data transfer cost and calculation cost, and test the data preprocessing and post-processing time consumption. The CPU has only function scheduling overhead and computational overhead.

Error description: Both the CPU and GPU use the float array to store the data, the matrix and convolution kernel data are randomly initialized, and the CPU and GPU convolution results are compared to test the validity of the convolution.

Preliminary Analysis:

A. The CPU and GPU convolution result error is 0, which indicates that the GPU implementation is valid.

B. When matrices and convolution cores are small, CPU performance is better. When matrices and convolution cores are large, GPU performance is high, which reflects the parallelism of the GPU and can quickly handle large-scale, high-throughput data. When the amount of data is not very large, the GPU time-consuming is primarily the start-up cost, which is the allocation overhead of device memory.

C. On the GPU, a thread produces a convolution result, which is serialized internally, and the larger the convolution core, the more time it takes for a single convolution to result.

To reduce the number of accesses to global memory, the matrix and convolution cores are copied to the shared memory inside the block before the operation.

Using a two-dimensional block, a block internal 16x16 threads, the number of convolution results to use how many blocks, so fixed convolution core, change the size of the matrix, run time basically unchanged, when the matrix is too large, when the number of blocks used too much, Operation time is limited by the number of SM and blocks supported in the GPU hardware. (considering the generality of convolution program, the data copy uses more conditional test and branch operation, the performance is affected by some).

D. The CPU is a simple serial operation, which is affected by the matrix and convolution core size.

E. When the matrix two dimensions are more than 10000, the CPU operation has a pointer anomaly, perhaps the matrix is large (around 550MB), the data storage is no longer contiguous, the pointer overflow. So it can't be tested.

Matrix Size

Number

Kernel

CPU (s)

Cpu2gpu

Gpu-kernel

Gpu2cpu

5x4

1

5x4

<1ms

<1ms

<1ms

<1ms

12x9

1

5x4

<1ms

<1ms

<1ms

<1ms

18x19

1

5x4

<1ms

<1ms

<1ms

<1ms

118x29

1

5x4

<1ms

<1ms

<1ms

<1ms

138x59

1

5x4

<1ms

<1ms

<1ms

<1ms

158x159

1

5x4

0.003

<1ms

0.001

<1ms

558x559

1

5x4

0.044

0.001

0.001

<1ms

1128x1159

1

5x4

0.157

0.002

0.004

0.001

2128x2159

1

5x4

0.442

0.007

0.012

0.007

5128x5159

1

5x4

2.394

0.038

0.068

0.035

18128x4159

1

5x4

6.866

0.111

0.193

0.114

10128x11159

1

5x4

10.074

0.160

0.288

0.142

15.54Gflops

1.427GBps

5x4

1

14x15

~

~

~

~

12x9

1

14x15

~

~

~

~

18x19

1

14x15

<1ms

<1ms

<1ms

<1ms

118x29

1

14x15

<1ms

<1ms

<1ms

<1ms

138x59

1

14x15

<1ms

0.001

<1ms

<1ms

158x159

1

14x15

0.024

<1ms

<1ms

<1ms

558x559

1

14x15

0.354

<1ms

0.006

0.001

1128x1159

1

14x15

1.400

0.002

0.023

0.002

2128x2159

1

14x15

3.839

0.007

0.082

0.007

5128x5159

1

14x15

22.856

0.042

0.475

0.035

11128x4159

1

14x15

38.172

0.079

0.833

0.061

10128x11159

1

14x15

122.679

0.203

2.614

0.358

23.23Gflops

382.6MBps

5x4

15

14x15

~

~

~

~

12x9

15

14x15

~

~

~

~

18x19

15

14x15

0.001

<1ms

<1ms

<1ms

118x29

15

14x15

0.041

<1ms

0.001

<1ms

138x59

15

14x15

0.097

<1ms

0.002

<1ms

158x159

15

14x15

0.372

0.001

0.007

<1ms

558x559

15

14x15

4.943

0.006

0.084

0.006

1128x1159

15

14x15

15.851

0.030

0.353

0+0x8

2128x2159

15

14x15

57.699

0.097

1.247

0.084

3158x3059

15

14x15

121.152

0.201

2.624

0.192

5128x5159

15

14x15

Pointer overflow

11128x4159

15

14x15

10128x11159

15

14x15

23.01Gflops

362.9MBps

Further analysis:

From the above table, the maximum throughput rate is 1.427gbps,pcaie bus bandwidth of 5GBps, there is a certain amount of space. The highest effective calculation performance for single-precision floating-point number multiplication is 23.23Gflops, and the highest single-precision floating-point performance indicated on official documents is 1Tflops with a large difference, for the following reasons:

A. The data that the CPU transmits to the GPU is a one-dimensional array, and the internal GPU is computed in two dimensions, and it takes a lot of address computation to access the data.

B. Due to convolution-specific properties, the data has a lot of repeatability, the large image is divided into many small image blocks, the need to consider the boundary problem, and in order to ensure the universality of the program, support arbitrary image and convolution kernel size, and support any number of images of batch processing, GPU, the data copied to the shared memory, Have done a lot of conditions test, time consumption is relatively large. This place can further improve the performance of GPU operations by preprocessing such as boundary expansion at the CPU.

C. A two-dimensional loop is used when calculating a single convolution result within a thread, and when the convolution core is large, the time of operation is exponentially increased.

Summarize:

Writing a GPU program is simple, writing an efficient GPU program is difficult, and writing an efficient universal GPU program is more difficult and needs to be considered in every aspect of the thing. The current version of the program, only has a simple commonality, because the use of basic data structure to save information, when the amount of data reached 500MB, it is easy to have problems, and computational efficiency is not very high.

Work plan for next week:

(1) in the CPU to do some data preprocessing, consider byte alignment, boundary padding, etc., to minimize the internal conditions of the GPU test. Consider taking a single convolution operation apart to achieve fine-grained parallelism and improve the computational efficiency of individual convolution results.

(2) Learn three-dimensional block, put an image on the first and second dimensions, batch image in the third dimension, reduce the amount of post-processing calculation, increase the volume of the batch image calculation performance.

(3) Consider the parallelization of deconvolution operations. Consider combining Cuda convolution with the MATLAB version of CNN to cross-compile and improve the efficiency of CNN.

GPU coarse-grained parallel implementation and testing for convolution operations

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.