Introduction to C + + AMP (ii)

Source: Internet
Author: User

Introduction to C + + AMP (ii)

Last updated on : 2014-05-02

Read the premise : "C + + amp introduction (i)"

Environment : Windows 8.1 64bit in English, Visual Studio Update1 English, Nvidia QuadroK600 graphics

Content Introduction

Describes the knowledge of C + + AMP's array, Array_view, extent, and other types of peace shop.

Body

Movement of data

Array and Array_view two data containers (template classes) are used to move data from the runtime Library (CPU) to the accelerator (video card or Universal computing card), the array class constructs a deep copy of the data, the data is copied to the accelerator (GPU), and Array_ The view class is a wrapper class that copies the source data to the accelerator only when the core function (kernel functions) is going to use the data.

#include <amp.h>using namespace concurrency;//demonstrates how the array class is used void Test_array () {//test data std::vector<int> Data (5); for (int count = 0; count < 5; count++) {Data[count] = count;} Constructs an array instance array<int, 1> a (5, Data.begin (), Data.end ());p Arallel_for_each (a.extent,[=, &a] (index<1> IDX) Restrict (AMP) {A[idx] = a[idx] * 10;}); /array instance A does not need to call the synchronization method//But needs to be assigned to Datadata = a;//output 0,10,20,30,40for (int i = 0; i < 5; i++) {std::cout << data[i] << ; "\ n";}}


Array_view has almost the same members as the array, but they behave differently at the bottom, so when you build two Array_view instances that point to the same data source, they actually point to the same memory address, and the data is copied to the accelerator only when needed. So you have to pay attention to the synchronization of the data, the main advantage of the Array_view class is that the data is moved only when it is used by the accelerator.

Shared memory is a memory that can be accessed by both the CPU and the GPU, and the array class can control how shared memory is accessed, but first we need to test whether the accelerator supports shared memory, and here is the sample code using shared memory for the array.

int Test_sharedmemory () {///a computer may have multiple accelerators, take the default accelerator accelerator ACC = Accelerator (accelerator::d efault_accelerator);// To test if the default accelerator supports shared memory if (!acc.supports_cpu_shared_memory) {std::cout << "the default accelerator does not support shared Memory "<< Std::endl;return 1;} Set CPU default access mode Acc.set_default_cpu_access_type (access_type_read_write);//Set up Accelerator_view (accelerator view) instance for ACC Accelerator// Read-write mode defaults to the setting of the Accelerator Default_cpu_access_type property Accelerator_view Acc_v = acc.default_view;// Extent indicates that the array instance establishes a one-dimensional array with 10 elements extent<1> EX (10);//Specifies the accelerator view, the input array is write-only array<int on the CPU, 1> Arr_w (ex, Acc_v, Access_type_write);//Specify the accelerator view, the output array is read-only array<int on the CPU, 1> Arr_r (ex, Acc_v, access_type_read);//Specify the accelerator view, Can read and write on the CPU array array<int, 1> ARR_RW (ex, Acc_v, Access_type_read_write); return 0;}


Index class

The index class specifies the position of the element in an array or Array_view object, and the following is the example code for using the index class

void Test_indexclass () {int acpp[] = {1, 2, 3, 4, 5, 6};//new 2-dimensional (two rows three columns) Array_view wrapper array_view<int, 2> A (2, 3, acpp); in Dex<2> idx (1, 2);//Output 6std::cout << a[idx] << "\ n";}


Extent class

Although the extent class is not necessary on many occasions, Microsoft's sample code uses the extent class, so it is necessary to introduce the next extent class.

Extentclass is used to specify the number of elements in an array or array_view dimension, you can use extent class to create an array or Array_view object, or you can access Array_view from an array or extent object. The following example shows the use of the extent class.

void Test_extentclass () {int acpp[] = {111, 113, 121, 122, 123, 124,, 131, 133,   211, 212, 2 214,   221, 222, 223, 224, 231,   232, 233, 234};extent<3> E (2, 3, 4); Array_view<int, 3> A (e, acpp); /Assert extent[0],[1],[2] has properties of 2, 3, 4assert (2 = = A.extent[0]), assert (3 = = a.extent[1]), assert (4 = = a.extent[2]);


Parallel_for_each function

We called the Parallel_for_each function in the previous article, it has two entry parameters, the first entry parameter is the computed field, is a extent or Tiled_extent object, defines the set of threads to run concurrently on the accelerator, It generates a thread for each element that is used for the calculation. The second parameter is a lambda expression that defines the code to run on each thread.

Acceleration code: Brick Face (Tiles) and boundary (barriers)

The entire thread is divided into several sets of threads with an equal number of rectangles (m*n root), each of which is called a tile (brick face), and multiple tiles (brick faces) make up the entire thread, called tiling (tiling).

To use tiling, call the Extent::tile method on the computed domain in the Parallel_for_each method and use the Tiled_index object in the lambda expression.

Here are two org charts from the Microsoft Official website, where you can see how the elements are indexed.

The IDX in the figure is the index class, and sample is the global space (array or Array_view object)

The T_IDX in is the index class, and descriptions is the global space (array or Array_view object)



Here's an example from Microsoft, where each 2*2=4 root thread consists of a brick face (tile) that calculates the average of the elements in the brick face (tile).

void Test_tile () {//Test sample: int sampledata[] = {2, 2, 9, 7, 1, BIS, 4, 8, 8, 3, 4,1, 5, 1, 2, 5, 2,6, 8, 3, 2, 7, 2};//the Tiles (Below is 6 brick noodles)://2 2 9 7 1 4//4 4 8 8 3 4////1 5 1 2 5 2//6 8 3 2 7 2//Averagedata used to hold the results of the operation: in T averagedata[] = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,};//each four elements (four threads) Form a tile (brick face), so a total of six t Ile (Brick noodle) Array_view<int, 2> sample (4, 6, sampledata); Array_view<int, 2> average (4, 6, averagedata);//through [1] Extent.tile instead of Extent[2]tiled_index instead of index, enable tile mode Parallel_for_each (//cut extent into 2 Tile for units (brick face) sample.extent.tile<2, 2> (), [=] (tiled_index<2, 2> idx) Restrict (AMP) {//tile_ The variable range of the static keyword is the entire tile (brick face)//So each tile (brick face) (2*2=4 root thread) instantiates only one tile_statictile_static int nums[2][2];//tile (brick face) Run the following code//Copy the value to the Tile_static instance nums, so the same nums will be assigned 2*2=4 times nums[idx.local[1]][idx.local[0]] = Sample[idx.global] ;//wait for all threads in tile (brick face) to run the above Code idx.barrier.wait ();//Now the 2*2=4 element in Nums already has a valid value//tile (brick surface) all the threads run the following code again//calculate the average value, int sum = Nums[0] [0] + nums[0][1] + nums[1][0] + nums[1][1];//Copy the results of the calculation to the Array_view object. Average[idx.global] = SUM/4;}); /print operation result for (int i = 0; i < 4; i++) {for (int j = 0; J < 6; J + +) {std::cout << average (i, j) << "";} Std::cout << "\ n";} OUTPUT://3 3 8 8 3 3//3 3 8 8 3 3//5 5 2 2 4 4//5 5 2 2 4 4}


The advantage of using tiling is that it is faster to access data from tile_static variables than from global space (array and Array_view objects). To get the performance advantage from tiling, our algorithm must split the computational domain into tiles (brick faces) and then put the data into the tile_static variable to speed up data access.

Be careful not to use code like the following to accumulate data from tiles (brick faces),

tile_static float Total;

Total + = Matrix[t_idx]; '

The reason [1]total's initial value is indeterminate, so the operation of the second code is meaningless.

Reason [2] because multiple threads in tiles (brick faces) compete for the same title_static variable, the calculation results are indeterminate.

Memory Barrier (memoryfences)

In Restrict (AMP) qualification, there are two types of memory that must be synchronized:

Global Memory: Array or Array_view instance

Tile_static Memory: Tile (brick face) memory

The memory barrier ensures that the two memory threads are synchronized, and the following three methods are used to invoke the memory barrier:

Tile_barrier::wait (or Tile_barrier::wait_with_all_memory_fence) method: Creates a barrier to global memory and tile_static memory.

Tile_barrier::wait_with_global_memory_fence method: Create a barrier to global memory only

Tile_barrier::wait_with_tile_static_memory_fence method: Only tile_static memory barrier is established

Invoking a specific type of barrier (fence) can improve the performance of your application, and in the following example the Tile_barrier::wait_with_tile_static_memory_fence method calls instead of Tile_barrier:: The call to the wait method improves the performance of the application.

Use tile_static memory barrier parallel_for_each (Matrix.extent.tile<samplesize, samplesize> (),     [=, &averages] ( Tiled_index<samplesize, samplesize> t_idx) Restrict (AMP) {    //data is copied from global memory into title_static memory.    Tile_static Floattilevalues[samplesize][samplesize];   TILEVALUES[T_IDX.LOCAL[0]][T_IDX.LOCAL[1]] = Matrix[t_idx];     Wait for the data in title_static memory to be copied   t_idx.barrier.wait_with_tile_static_memory_fence ();     If you remove the IF statement, the code calls all the threads in the tile (brick face) so that the    elements in each//tile (brick face) are assigned an equal average.    if (t_idx.local[0] = = 0&& t_idx.local[1] = = 0) {for        (int trow = 0, Trow <SAMPLESIZE; trow++) {for            (i NT Tcol = 0; tcol< samplesize; tcol++) {               averages (t_idx.tile[0],t_idx.tile[1]) + = Tilevalues[trow][tcol];}        }       Averages (t_idx.tile[0],t_idx.tile[1])/= (float) (samplesize *samplesize);    });


The Restrict (AMP) decorated code snippet is run on the accelerator (GPU), the default in the code snippet, the next breakpoint does not break (enter), in the [Solution Explorer] window, click on the project name, shortcut key [Alt]+[enter], Opens the current Project property page, [Configuration properties]->[debugging]->[debugger Type] defaults to "Auto" and "GPU only" to debug the current project accelerator (GPU Code that runs on it.

According to the official website, unsigned integers are processed faster than signed integers, so use unsigned integers as much as possible.

Resources

"Using Tiles"

Http://msdn.microsoft.com/en-us/library/vstudio/hh873135.aspx

Using tiles

Http://msdn.microsoft.com/zh-cn/library/vstudio/hh873135.aspx


Introduction to C + + AMP (ii)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.