Now it is necessary to get the sum of all the elements of an array, which seems unlikely before, because each thread only processes one element and cannot relate all the elements, but has recently learned a piece of code that can be implemented, and also has a further understanding of shared memory.
First, C + + serial implementation
The method of serial implementation is very simple, as long as all elements are added sequentially to get the corresponding results, in fact, we focus on not the results, but the efficiency of the operation. Then the code is as follows:
array_sum.cc:
#include <iostream> #include <stdio.h> #include "kmeans.h" using namespace std;const int cnt = 100000;int Main ( {int *a = new Int[cnt];for (int i=0;i<cnt;i++) {A[i] = i+1;} Double T = wtime (), for (int i=0;i<cnt;i++) sum + = A[i];p rintf ("computation elapsed%.8f \ n", Wtime ()-t); return 0;}
WTIME.CU:
#include <sys/time.h> #include <stdio.h> #include <stdlib.h>double wtime (void) { double now _time; struct Timeval etstart; struct timezone tzp; if (Gettimeofday (&etstart, &tzp) = =-1) perror ("error:calling gettimeofday () not successful.\n"); Now_time = (double) etstart.tv_sec) + /* in seconds * /((double) etstart.tv_usec)/1000000.0; /* in microseconds */ return now_time;}
Operation Result:
Second, cuda parallel implementation
First the code and then the explanation:
#include <iostream> #include <stdio.h> #include "kmeans.h" using namespace Std;const int count = 1000;void Generate_data (int *arr) {for (int i=0;i<count;i++) {Arr[i] = i+1;}} int nextpoweroftwo (int n) {n--;n = n >> 1 | n;n = n >> 2 | n;n = n >> 4 | n;n = n >> 8 | n;n = N &G t;> 16 | n;//n = n >> 32 | N for 64-bits int return ++n;} /*cnt:count Cnt2:next Power of both of Count */__global__ static void compute_sum (int *array,int cnt, int cnt2) {extern __shared__ unsigned int sharedmem[];sharedmem[threadidx.x] = (threadidx.x < CNT)? ARRAY[THREADIDX.X]: 0; __syncthreads ();//cnt2 "must" be a power of two!for (unsigned int s = CNT2/2; s > 0; s>> ; =1) {if (threadidx.x < s) {sharedmem[threadidx.x] + + sharedmem[threadidx.x + s];} __syncthreads ();} if (threadidx.x = = 0) {Array[0] = sharedmem[0];}} int main () {int *a = new Int[count];generate_data (a); int *devicearray;cudamalloc (&devicearray,count*sizeof (int)); cudamemcpy (Devicearray,a,count*sizeof (int), cudamemcpyhosttodevice); int npt_count = Nextpoweroftwo (count);//next power of both of count//cout< < "Npt_count =" <<npt_count<<endl;int blockshareddatasize = npt_count * sizeof (int);d ouble t = wtime (); for (int i=0;i<count;i++) {compute_sum<<<1,count,blockshareddatasize>>> (Devicearray,count,npt_count);} printf ("Computation elapsed%.8f \ n", Wtime ()-t), int sum, cudamemcpy (&sum,devicearray,sizeof (int), cudamemcpydevicetohost);cout<< "sum =" <<sum<<endl;return 0;}
main function:
LINE58: assigns an initial value to array A, and the dimension is count.
line60~62: defines the device variable and allocates memory, copying the value of array A to the video memory.
line63:Nextpoweroftwo is a very subtle piece of code that calculates the power of a number greater than or equal to the first 2 of the input parameter n. As for why this is done to the kernel function inside can understand.
line68:"1" in Compute_sum is the number of blocks, "count" is the number of threads inside each block, "blockshareddatasize" is the size of the shared memory.
Kernel function Compute_sum:
line35: defines the shared memory variable.
Line36: The memory area of the corresponding sharedmem of threadidx.x smaller than CNT is assigned to the value in array array.
line39~47: The function of this code is to add all the values and place them in the sharemem[0] position. This code should be a good experience, it the original computational complexity O (n) of the serial implementation of the time efficiency through the parallel reached O (Logn). Finally save the result to array[0] and copy back to main memory.
Makefile
CU:NVCC cuda_array_sum.cu Wtime.cu./a.out
Results:
Iii. Comparison of efficiency
We observe the difference in the efficiency of variables by modifying the value of count and increasing the number of cycles.
Code modification:
Run time comparison:
Count |
Serial (s) |
Parallel (s) |
1000 |
0.00627995 |
0.00345612 |
10000 |
0.29315591 |
0.06507015 |
100000 |
25.18921304 |
0.65188980 |
1000000 |
2507.66827798 |
5.61648989 |
|
|
|
haha, can be seen in the case of large amounts of data efficiency is quite good.
Author: Yi Solo Show
Email:[email protected]
Annotated Source: http://blog.csdn.net/lavorange/article/details/43031419
The sum of elements of "cuda parallel programming Seven" arrays