Introduction to C + + AMP (two)
last update : 2014-05-02
Read the premise : "C + + Amp introduction (one)"
Environment : Windows 8.1 64bit English, Visual Studio Update1 English, Nvidia QuadroK600 graphics
Content Brief Introduction
Describes the knowledge of C + + AMP's array, Array_view, extent, and other types of peace shop.
Body
Movement of data
The array and array_view two data containers (template classes) are used to move data from the execution-time library (CPU) to the accelerator (video card or Universal computing card). The array class establishes a deep copy of the data at construction time. Copy the data onto the accelerator (GPU), while the Array_view class is a wrapper class. Only if the core function (kernel functions) is to be used for data. To copy the source data to the accelerator.
#include <amp.h>using namespace concurrency;//demonstrates how the array class is used void Test_array () {//test data std::vector<int> Data (5); for (int count = 0; count < 5; count++) {Data[count] = count;} Constructs an array instance array<int, 1> a (5, Data.begin (), Data.end ());p Arallel_for_each (a.extent,[=, &a] (index<1> IDX) Restrict (AMP) {A[idx] = a[idx] * 10;}); /array instance A does not need to call the synchronous method//But needs to be assigned to Datadata = a;//output 0,10,20,30,40for (int i = 0; i < 5; i++) {std::cout << data[i] << ; "\ n";}}
Array_view have nearly the same members as the array, but they behave differently at the bottom, so when you build two Array_view instances that point to the same data source, they actually point to the same memory address. The data is copied into the accelerator only when it is needed. So you have to pay attention to the synchronization of the data, the main advantage of the Array_view class is that the data is moved only when it is used by the accelerator.
Shared memory is a memory that can be accessed by the CPU and GPU, and the array class can control how the shared memory is accessible, but first we need to test whether the accelerator supports shared memory, and the following is a demo sample code using shared memory for the array.
int Test_sharedmemory () {///a computer may have multiple accelerators, take the default accelerator accelerator ACC = Accelerator (accelerator::d efault_accelerator);// Test whether the default accelerator supports shared memory if (!acc.supports_cpu_shared_memory) {std::cout << "the default accelerator does not support shared Memory "<< Std::endl;return 1;} Set CPU default access mode Acc.set_default_cpu_access_type (access_type_read_write);//Set up Accelerator_view (accelerator view) instance for ACC Accelerator// Read and write mode of the accelerator Default_cpu_access_type attribute set Accelerator_view Acc_v = acc.default_view;// Extent indicates that the array instance establishes a one-dimensional array with 10 elements extent<1> EX (10);//Specifies the accelerator view. The input array is only written array<int on the CPU, 1> Arr_w (ex, Acc_v, access_type_write);//Specifies the accelerator view. The output array reads only array<int on the CPU, 1> Arr_r (ex, Acc_v, access_type_read);//Specifies the accelerator view. An array capable of reading and writing on the CPU array<int, 1> ARR_RW (ex, Acc_v, Access_type_read_write); return 0;}
Index class
The index class specifies the position of the element in an array or Array_view object, and the following is a demo sample code for using the index class
void Test_indexclass () {int acpp[] = {1, 2, 3, 4, 5, 6};//new 2-dimensional (two rows three columns) Array_view wrapper array_view<int, 2> A (2, 3, acpp); in Dex<2> idx (1, 2);//Output 6std::cout << a[idx] << "\ n";}
Extent class
Although the extent class is not necessary for very many occasions, Microsoft's partial Demo sample code uses the extent class. So it is necessary to introduce the next extent class.
Extentclass is used to specify the number of elements in an array or array_view dimension, and you can use extent class to create an array or Array_view object, or to access Array_view from an array or extent object. The following example demonstrates the use of the extent class.
void Test_extentclass () {int acpp[] = {111, 113, 121, 122, 123, 124,, 131, 133, 211, 212, 2 214, 221, 222, 223, 224, 231, 232, 233, 234};extent<3> E (2, 3, 4); Array_view<int, 3> A (e, acpp); /Assert extent[0],[1],[2] has properties of 2, 3, 4assert (2 = = A.extent[0]), assert (3 = = a.extent[1]), assert (4 = = a.extent[2]);
Parallel_for_each function
We called the Parallel_for_each function in the previous article, it has two entry parameters, the first entry is a computed field, and is a extent or Tiled_extent object. Defines the collection of threads to execute concurrently on the accelerator. It generates a thread for each element for the calculation. The second parameter is a lambda expression that defines the code to execute on each thread.
Acceleration code: Brick Face (Tiles) and boundary (barriers)
Divides the entire thread into several sets of threads with an equal number of rectangles (m*n root), each of which is called a tile (brick face). Multiple tiles (brick faces) make up the entire thread. Called Tiling (tiling).
To use tiling, call the Extent::tile method on the computed domain in the Parallel_for_each method. and use the Tiled_index object in the lambda expression.
Here are two org charts from the Microsoft website that show how to index elements.
The IDX in the figure is the index class. Sample is a global space (array or Array_view object)
The T_IDX in is the index class, and descriptions is the global space (array or Array_view object)
Here is a sample from Microsoft Official. Each 2*2=4 root thread consists of a brick face (tile) that calculates the average of the elements in the brick face (tile).
void Test_tile () {//Test sample: int sampledata[] = {2, 2, 9, 7, 1, BIS, 4, 8, 8, 3, 4,1, 5, 1, 2, 5, 2,6, 8, 3, 2, 7, 2};//the Tiles (Below are 6 brick faces)://2 2 9 7 1 4//4 4 8 8 3 4////1 5 1 2 5 2//6 8 3 2 7 2//Averagedata used to hold the results of the operation: in T averagedata[] = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,};//each four elements (four threads) make up a tile (brick face), so they collectively have six Tile (brick face) Array_view<int, 2> sample (4, 6, sampledata), Array_view<int, 2> average (4, 6, averagedata);//through [1] Extent.tile replace Extent[2]tiled_index to replace index, enable tile mode Parallel_for_each (//cut extent into 2 Tile for units (brick face) sample.extent.tile<2, 2> (), [=] (tiled_index<2, 2> idx) Restrict (AMP) {//tile_ Statickeyword variable range is the entire tile (brick surface)//So each tile (brick face) (2*2=4 root thread) only instantiates a tile_statictile_static int nums[2][2];//tile (brick face) Execute the following code//Copy the value to the Tile_static instance nums, so the same nums will be assigned 2*2=4 times nums[idx.local[1]][idx.local[0]] = Sample[idx.global] ///wait for all threads in tile (brick face), execute the above code idx.barrier.wait ();//Now the 2*2=4 element in Nums has a valid value//tile (brick surface) all the threads are executed the following code, respectively//calculate the average, int Sum = nums[0][0] + nums[0][1] + nums[1][0] + nums[1][1];//Copy the result of the calculation to the Array_view object. Average[idx.global] = SUM/4;}); /print operation result for (int i = 0; i < 4; i++) {for (int j = 0; J < 6; J + +) {std::cout << average (i, j) << "";} Std::cout << "\ n";} OUTPUT://3 3 8 8 3 3//3 3 8 8 3 3//5 5 2 2 4 4//5 5 2 2 4 4}
The advantage of using tiling is that it is faster to access data from tile_static variables than from global space (array and Array_view objects).
To get the performance advantage from tiling, our algorithm must split the computational domain into tiles (brick faces) and then put the data into the tile_static variable to speed up data access.
Be careful not to use code like the following to accumulate data from tiles (brick faces),
tile_static float Total;
Total + = Matrix[t_idx]; '
The reason [1]total's initial value is indeterminate, so the operation of the second code is meaningless.
Reason [2] because multiple threads in tile (brick faces) compete for the same title_static variable, the calculation results are indeterminate.
Memory Barrier (memoryfences)
In Restrict (AMP) qualification, there are two types of memory that must be synchronized:
Global Memory: Array or Array_view instance
Tile_static Memory: Tile (brick face) memory
The memory barrier ensures that the two memory threads are synchronized, and the following three methods are used to invoke the memory barrier:
Tile_barrier::wait (or Tile_barrier::wait_with_all_memory_fence) method: Creates a barrier to global memory and tile_static memory.
Tile_barrier::wait_with_global_memory_fence method: Create a barrier to global memory only
Tile_barrier::wait_with_tile_static_memory_fence method: Only tile_static memory barrier is established
Invoking a specific type of barrier (fence) can improve the performance of your application, and in the following example the call to the Tile_barrier::wait_with_tile_static_memory_fence method supersedes Tile_barrier: The call to the wait method improves the performance of the application.
Use tile_static memory barrier parallel_for_each (Matrix.extent.tile<samplesize, samplesize> (), [=, &averages] ( Tiled_index<samplesize, samplesize> t_idx) Restrict (AMP) { //data is copied from global memory to title_static memory. Tile_static Floattilevalues[samplesize][samplesize]; TILEVALUES[T_IDX.LOCAL[0]][T_IDX.LOCAL[1]] = Matrix[t_idx]; Wait for the data in title_static memory to complete t_idx.barrier.wait_with_tile_static_memory_fence (); Suppose you remove an if statement. The code calls all the threads in the tile (brick face). the elements in each tile (brick face) are then assigned an equal average. if (t_idx.local[0] = = 0&& t_idx.local[1] = = 0) {for (int trow = 0, Trow <SAMPLESIZE; trow++) {for (i NT Tcol = 0; tcol< samplesize; tcol++) { averages (t_idx.tile[0],t_idx.tile[1]) + = Tilevalues[trow][tcol];} } Averages (t_idx.tile[0],t_idx.tile[1])/= (float) (samplesize *samplesize); });
Restrict (AMP)-decorated code snippets are performed on the accelerator (GPU). The default is the code snippet inside. The next breakpoint does not break (enter). Click the project name in the [Solution Explorer] form. Shortcut [Alt]+[enter], open the current Project property page, [Configuration properties]->[debugging]->[debugger Type] default to Feel "Auto". Instead, "GPU only" can debug code executed on the current project accelerator (GPU).
According to official website introduction. Unsigned integers are processed faster than signed integers. So use unsigned integers as much as possible.
References
"Using Tiles"
Http://msdn.microsoft.com/en-us/library/vstudio/hh873135.aspx
Using tiles
Http://msdn.microsoft.com/zh-cn/library/vstudio/hh873135.aspx
Copyright notice: This article blog original articles, blogs, without consent, may not be reproduced.
Introduction to C + + AMP (two)