C ++ AMP: Parallel Computing On the GPU

Last Update:2018-12-08 Source: Internet

Author: User

Tags mathematical functions gtx

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

C ++ AMP: Parallel Computing On the GPU

Written by Allen Lee

I see all the young believers, your target audience. I see all the old deceivers; we all just sing their song.
-Marilyn Manson, Target Audience (Narcissus Narcosis)

From CPU to GPU

In parallel and Asynchronization of meeting C ++ PPL: C ++, we introduced how to use C ++ PPL for parallel computing on the CPU. This time, we will change the stage to a GPU to introduce how to use C ++ AMP for parallel computing.

Why choose parallel computing on the GPU? The current multi-core CPUs are generally dual-core or quad-core. If hyper-Threading Technology is taken into account, they can be considered as four or eight logic cores, but the current GPU is running hundreds of cores, for example, the mid-range nvidia gtx 560 SE has 288 cores, and the top nvidia gtx 690 has up to 3072 cores. These ultra-core GPUs are ideal for large-scale parallel computing.

Next, we will make some modifications to the Code for calculating the sine value in parallel on the basis of "meeting C ++ PPL: C ++ parallel and asynchronous, it can run on the GPU. If you have not read the article, I suggest you read it first. In addition, this article assumes that you have some knowledge about C ++ Lambda. Otherwise, I suggest you read "Meet C ++ Lambda" first.

Calculate the sine value in parallel

First, include/reference the relevant header file/namespace, as shown in Code 1. Amp. h is the header file of C ++. It contains related functions and classes, which are within the concurrency namespace. Amp_math.h contains common mathematical functions, such as sin functions. The functions in the concurrency: fast_math namespace only support single-precision floating point numbers, while concurrency :: the functions in the precise_math namespace support single-precision floating-point numbers and double-precision floating-point numbers.

Code 1

Change the floating point type from double to float, as shown in code 2. This is because not all GPUs support double-precision floating point calculation. In addition, both std and concurrency namespaces have an array class. To eliminate ambiguity, we need to add the "std:" prefix before the array, to inform the compiler that we are using the STL array class.

Code 2

Create an array_view object and wrap the previously created array object, as shown in code 3. The array_view object is only a package and cannot contain any data. It must be used with a real container, such as a C-style array, an STL array object, or a vector object. When creating an array_view object, you must use the type parameter to specify the element type and dimension of the array_view object, specify the length of the corresponding dimension and the container containing the actual data by using the constructor parameters.

Code 3

Code 3 creates a one-dimensional array_view object. The length of this dimension is the same as the length of the previous array object. This packaging seems redundant. Why? This is because the code running on the GPU cannot directly access the data in the system memory. The array_view object is required to act as a bridge, this allows the code running on the GPU to indirectly access data in the system memory. In fact, the code running on the GPU does not access the data in the system memory, but copies the data to the video memory. The array_view object is responsible for copying the data from the system memory to the video memory, this process is automatic and requires no intervention.

With the above preparations, we can write the code that runs on the GPU, as shown in code 4. The parallel_for_each function can be regarded as the entry point of the C ++ AMP. We use the extent object to tell it how many GPU threads are created, and Lambda to tell it what code these GPU threads are running, we usually call this code Kernel.

Code 4

We hope that each GPU thread can complete a group of Operations corresponding to an element in the result set. For example, if we need to calculate the sine of 10 floating point numbers, we want to create 10 GPU threads, each of which reads the floating point number, calculates the sine value, and saves the sine value in sequence. However, each GPU thread runs the same code. How can we differentiate different GPU threads and locate the data to be processed?

At this time, it is the turn of the index object. Our array_view object is one-dimensional, so the index object type is index <1>, and the length of this dimension is 10, therefore, 10 index objects from 0 to 9 are generated, and each GPU thread corresponds to one of the index objects. This index object will be passed to us through Lambda parameters, and we will find the data to be processed by the current GPU thread through this index object in the Kernel.

Since Lambda parameters only pass the index object, how does the Kernel exchange data with the outside world? We can capture the variables in the current context through the closure, which allows us to flexibly operate multiple data sources and result sets, so there is no need to provide the return value. From this perspective, the parallel_for_each function of C ++ AMP is similar to the parallel_for function of C ++ PPL in usage, as shown in code 5, the extent object we pass to the former replaces the start and end index values we pass to the latter.

Code 5

So what is the restrict (amp) Modifier on the right of the Kernel? The Kernel eventually runs on the GPU. Whatever the form, the restrict (amp) modifier is used to tell the compiler this. When the compiler sees the restrict (amp) modifier, it checks whether the Kernel uses unsupported language features. If yes, the compilation process is aborted and an error is listed. Otherwise, kernel is compiled into HLSL and handed to DirectCompute for running. Kernel can call other functions, but these functions must add the restrict (amp) modifier, such as the sin function in code 4.

After calculation, we can output the data of the array_view object through a for loop, as shown in Code 6. When we access the array_view object through the indexer for the first time on the CPU, it will copy the data from the display memory back to the system memory. This process is automatic without our intervention.

Code 6

Wow, I have already talked so much about it. In fact, using C ++ AMP generally involves only the following three steps:

Create an array_view object.
Call the parallel_for_each function.
Access the computing result through the array_view object.

Other things, such as distribution and release of video memory and planning and management of GPU threads, will be handled by C ++.

Sum of parallel computing Matrices

In the previous section, we used a simple example to learn how to use C ++ AMP. Next we will use another example to learn more about the usage of array_view, extent, and index in two-dimensional scenarios.

Assume that we want to calculate the sum of two 100x100 matrices. First, we need to define the rows and columns of the matrix, then create two vector objects through the create_matrix function, and then create a vector object to store the sum of the matrices, as shown in code 7.

Code 7

The implementation of the create_matrix function is very simple. It accepts the total capacity of the matrix (row and column product) as the parameter, and then creates and returns a vector object containing a random number less than 100, as shown in code 8.

Code 8

It is worth noting that when the create_matrix function executes "return matrix;", the vector object is copied to a temporary object and the temporary object is returned to the caller, the original vector object will be automatically destroyed because it exceeded the scope, but we can optimize it through the Named Return Value Optimization of the compiler, so there is no need to worry about performance problems caused by Return by Value.

Although we define a matrix through two-dimensional concepts such as rows and columns, its implementation is simulated by a vector object. Therefore, we need to perform an index transformation when using it, the index of the vector object corresponding to column n of Row m in the matrix is m * columns + n (m and n are calculated from 0 ). Suppose we want to use a vector object to simulate a 3x3 matrix, as shown in 1. To access the elements of the 2nd rows and 0th columns of the matrix, we should use index 6 (2*3 + 0) access the vector object.

Figure 1

Next, we need to create three array_view objects, respectively packing the three previously created vector objects. during creation, we must first specify the row size and then the column size, as shown in code 9.

Code 9

Because we created a two-dimensional array_view object, we can directly use the elements of the two-dimensional Index Access Matrix without having to calculate the corresponding index as previously. Take the matrix of 3x3 as an example. As shown in figure 2, a vector object is divided into three sections, each of which contains three elements. The first section corresponds to the first row of the array_view object, the second row corresponds to the second row, and so on. If you want to access the elements in the 2nd rows and 0th columns of the matrix, you can directly use the index (2, 0) to access the array_view object. This index corresponds to the index 6 of the vector object.

Figure 2

Considering that the data flow direction of the first and second array_view objects is from system memory to memory, we can change their first type parameters to const int, as shown in code 10, it indicates that they are read-only in the Kernel and will not affect the vector object it wraps. As for the third array_view object, because it is only used to output the calculation result, we can call the discard_data member function of the array_view object before calling the parallel_for_each function, it indicates that we are not interested in the data of the vector objects it wraps, and do not need to copy them from the system memory to the video memory.

Code 10

With these preparations, we can start to write the Kernel, as shown in Code 11. We pass the extent of the third array_view object to the parallel_for_each function. Because the matrix is 100x100, the parallel_for_each function creates 10,000 GPU threads and each GPU thread computes an element of this matrix. Because the array_view object we access is two-dimensional, the index type should also be changed to the corresponding index <2>.

Code 11

Here, you may ask, can GPUs really create so many threads? This depends on the specific GPU. For example, nvidia gtx 690 has 16 multi-processors (in the Kepler architecture, each multi-processor has 192 CUDA cores), and each multi-processor has a maximum thread count of 2048, therefore, it can accommodate a maximum of 32,768 threads at the same time, while nvidia gtx 560 SE has 9 multi-processors (Fermi architecture, each with 32 CUDA cores ), the maximum number of threads per multi-processor is 1536, so it can accommodate up to 13,824 threads at the same time.

After the computation is complete, we can access the computation results through the indexer on the CPU. Code 12 will output the 14th rows and 12 columns of the result matrix to the console.

Code 12

Async + continuation

After mastering the basic usage of C ++ AMP, we naturally want to know whether the parallel_for_each function will block the current CPU thread. The parallel_for_each function is synchronous and is responsible for initiating the operation of the Kernel, but does not return the result until the operation of the Kernel ends. Take code 13 as an example. When the parallel_for_each function returns, the code at the checkpoint 1 will run as usual even if the Kernel operation is not finished. From this perspective, the parallel_for_each function is asynchronous. However, when we access the computing result through the array_view object, if the running of the Kernel is not over, the code at the checkpoint 2 will get stuck until the running of the Kernel ends, the array_view object copies data from the video memory to the system memory.

Code 13

Since the operation of the Kernel is asynchronous, we naturally want C ++ AMP to provide continuation similar to C ++ PPL. Fortunately, the array_view object provides a synchronize_async member function, which returns a concurrency: completion_future object. We can use the then member function of this object to implement continuation, as shown in Code 14. In fact, this then member function is implemented through the C ++ PPL task object.

Code 14

Questions you may ask

1. What are the requirements for C ++ program development?

You need Visual Studio 2012 and a graphics card that supports DirectX 11. Visual C ++ 2012 Express can also be used. If you want to perform GPU debugging, you also need a Windows 8 operating system. To run the C ++ AMP program, you need Windows 7/Windows 8 and a graphics card supporting DirectX 11. during deployment, you need to run the C ++ AMP (vcamp110.dll) in the directory that the program can find, or install Visual C ++ 2012 Redistributable Package on the target machine.

2. Does C ++ AMP support other languages?

C ++ AMP can only be used in C ++. Other languages can call your C ++ Code indirectly through relevant mechanisms:

How to use C ++ AMP from C #
How to use C ++ AMP from C # using WinRT
How to use C ++ AMP from C ++ CLR app
Using C ++ AMP code in a C ++ CLR project

3. Does C ++ AMP support other platforms?

Currently, C ++ AMP only supports Windows platforms. However, Microsoft has released the C ++ open standard, which allows anyone to implement it on any platform. If you want to use GPU for parallel computing on other platforms, you can consider other technologies, such as nvidia cuda (only NVIDIA graphics cards are supported) or OpenCL, they all support multiple platforms.

4. Can I recommend some C ++ learning materials?

Currently, there is no C ++ AMP book. Kate Gregory and Ade Miller are writing a book about C ++ AMP, hoping to see it soon. Some online learning materials are recommended below:

C ++ open specification
Parallel Programming in Native Code (team blog)
C ++ AMP (C ++ Accelerated Massive Parallelism)
C ++ AMP Videos

* Statement: This article has been first published on the InfoQ Chinese site. All Rights Reserved. encounter C ++ AMP: parallel computing on GPU. If you need to reprint it, please attach this statement, thank you.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More