In this paper, the basic concepts of CUDA parallel programming are illustrated by the vector summation operation. The so-called vector summation is the addition of the corresponding element 22 in the two array data, and the result is saved in the third array. As shown in the following:
1. CPU-based vector summation:
The code is simple:
#include <iostream>using namespace Std;const int N =10;void Add (int *a, int *b, int *c) { int tid = 0; while (Tid < N) { C[tid] = A[tid] + B[tid]; Tid + = 1; }} int main () { int a[n],b[n],c[n]; Assign the array ' a ' and ' B ' on the CPU for (int i=0;i<n;i++) { A[i] =-1; B[i] = i * i; } Add (a,b,c); Print results for (int i=0;i<n;i++) { cout<<a[i]<< "+" <<b[i]<< "=" <<c[i] <<endl; } return 0;}
the use of the while loop above is somewhat complex, but it is intended to allow the code to run concurrently on systems with multiple CPUs or CPU cores. For example, on a dual-core processor You can change the size of each increment to 2 so that one of the cores starts the loop from Tid=0 and the other core starts the loop from tid=1. The first core adds the elements of an even index, while the second core adds the odd indexed elements. This is equivalent to executing the following code on each CPU core:
<strong> A CPU Core: </strong>void Add (int *a, int *b, int *c) { <strong> int tid = 0;</strong>
while (Tid < N) { C[tid] = A[tid] + B[tid]; <strong>tid + = 2; </strong> }}<strong> 2nd CPU Core: </strong>void Add (int *a, int *b, int *c) { <strong>int Tid = 1;</strong> while (Tid < N) { C[tid] = A[tid] + B[tid]; <strong>tid + = 2;</strong> }}
of course, to actually perform this operation on the CPU, you need to add more code. For example, you need to write a certain amount of code to create a worker thread, each of which executes the function add (), and assumes that each thread executes in parallel. However, this assumption is an ideal but impractical idea, and the actual operation of the thread scheduling mechanism is often not the case.
2. GPU-based vector summation:
We can implement the same addition operation on the GPU, which requires that add () be written as a device function. Put the code First:
#include <iostream>using namespace std; #define N 10<strong>__global__</strong> void Add (int *a, int *b , int *c) {int tid = blockidx.x;if (tid<n) {C[tid] = A[tid]+b[tid];}} int main (void) {int a[n],b[n],c[n];int *dev_a, *dev_b, *dev_c;//allocate Memory on Gpucudamalloc ((void**) &dev_a, n sizeof (int)); Cudamalloc ((void**) &dev_b, n*sizeof (int)), Cudamalloc ((void**) &dev_c, n*sizeof (int)); for (int i=0;i<n;i++) {A[i] = -1;b[i] = i * i;} cudamemcpy (Dev_a, A, n*sizeof (int), cudamemcpyhosttodevice); cudamemcpy (Dev_b, B, n*sizeof (int), Cudamemcpyhosttodevi CE);<strong>add<<<n,1>>> (dev_a,dev_b,dev_c); </strong>cudamemcpy (c, Dev_c, N*sizeof (int), cudamemcpydevicetohost), for (int i=0;i<n;i++) {cout<<a[i]<< "+" <<b[i]<< "=" < <c[i]<<endl;} Release the Memory on Gpucudafree (dev_a); Cudafree (Dev_b); Cudafree (dev_c); return 0;}
Operation Result:
Explain the code:
+cudamalloc (): Three arrays are allocated memory on the device, where Dev_a,dev_b contains the input values, and the results are included in the array Dev_c.
+cudafree (): avoid memory leaks and release them through Cudafree () after using GPU memory.
+cudamemcpy (): The input data is copied to the device, and the parameter cudamemcpyhosttodevice is established, after the calculation is completed, the calculation results are copied back to the host by the parameter cudamemcpydevicetohost.
+ Execute the device code in the Add () in the host Code main () through the angle bracket syntax.
__global__: for function Add () to be executed on the device, a modifier __global__is added before the function name.
kernel function:kernel<<<1,1>>> (param1,param2,...);
But in this example, the value in angle brackets is not 1: add<<<n,1>>> (dev_a, Dev_b, Dev_c);
The first parameter in a kernel function: number of blocks.
The second parameter in the kernel function: thread per block. That is, the number of threads in each thread block.
For example, if you specify KERNEL<<<2,1>>>, you can assume that the runtime will create two copies of the kernel functions and run them in parallel. We make each execution environment a line Cheng (block). If the specified kernel<<<256,1>>> (), then there will be 256 lines Cheng running on the GPU.
3. Dynamically allocate arrays with vectors.
Code:
#include <iostream> #include <vector>using namespace std;const int N = 10;__global__ void Add (int* A, int* B, in t* c) {int tid = blockidx.x;if (tid<n) {C[tid] = A[tid] + B[tid];}} int main () {vector<int> vec_a,vec_b;int *va,*vb,*vc;int *dev_a,*dev_b,*dev_c;cudamalloc ((void**) &dev_a,N* sizeof (int)), Cudamalloc ((void**) &dev_b,n*sizeof (int)), Cudamalloc ((void**) &dev_c,n*sizeof (int)), for (int i=0;i<n;i++) {vec_a.push_back ( -1);//vec_a[i] = -1;vec_b.push_back (i*i);//vec_b[i] = i * i;} <strong>/* * First Way */va = new INT[N];VB = new Int[n];copy (Vec_a.begin (), Vec_a.end (), VA); copy (Vec_b.begin (), vec_ B.end (), VB);/* * the second way VA = (int *) &vec_a[0];//vector to ARRAYVB = (int *) &vec_b[0];*/</strong>cudamemcpy ( dev_a,va,n*sizeof (int), cudamemcpyhosttodevice), cudamemcpy (dev_b,vb,n*sizeof (int), cudamemcpyhosttodevice), add <<<N,1>>> (dev_a,dev_b,dev_c) VC = new int[n];cudamemcpy (vc,dev_c,n*sizeof (int), Cudamemcpydevicetohost); #if 1for (int i=0;i<n;i++) {cout<<va[i]<< "+" <<vb[i]<< "=" <<VC[I]<<ENDL;} #endifcudaFree (dev_a); Cudafree (Dev_b); Cudafree (dev_c); return 0;}
here we discuss the problem of converting from vector to array arrays.
Because for vectors to be stored in memory must be continuous, it is very simple and no problem to write as follows:
std::vector<double> v;double* A = &v[0];
and if the memory is not contiguous, then you need to make a method, copy:
Double Arr[100];std::copy (V.begin (), V.end (), arr);
There are three links to discuss this issue:
1.http://stackoverflow.com/questions/2923272/how-to-convert-vector-to-array-c?answertab=active#tab-top
2.http://www.cplusplus.com/forum/beginner/7477/
3.http://www.cplusplus.com/reference/algorithm/copy/
Annotated Source: http://blog.csdn.net/lavorange/article/details/41894807
"Cuda parallel programming three" cuda Vector summation operation