The sum of elements of "cuda parallel programming Seven" arrays

Source: Internet
Author: User

Now it is necessary to get the sum of all the elements of an array, which seems unlikely before, because each thread only processes one element and cannot relate all the elements, but has recently learned a piece of code that can be implemented, and also has a further understanding of shared memory.


First, C + + serial implementation

The method of serial implementation is very simple, as long as all elements are added sequentially to get the corresponding results, in fact, we focus on not the results, but the efficiency of the operation. Then the code is as follows:

array_sum.cc:

#include <iostream> #include <stdio.h> #include "kmeans.h" using namespace std;const int cnt = 100000;int Main ( {int *a = new Int[cnt];for (int i=0;i<cnt;i++) {A[i] = i+1;} Double T = wtime (), for (int i=0;i<cnt;i++) sum + = A[i];p rintf ("computation elapsed%.8f \ n", Wtime ()-t); return 0;}

WTIME.CU:

#include <sys/time.h> #include <stdio.h> #include <stdlib.h>double wtime (void) {    double          now _time;    struct Timeval  etstart;    struct timezone tzp;    if (Gettimeofday (&etstart, &tzp) = =-1)        perror ("error:calling gettimeofday () not successful.\n");    Now_time = (double) etstart.tv_sec) +              /* in seconds *               /((double) etstart.tv_usec)/1000000.0;  /* in microseconds */    return now_time;}
Operation Result:



Second, cuda parallel implementation

First the code and then the explanation:

#include <iostream> #include <stdio.h> #include "kmeans.h" using namespace Std;const int count = 1000;void Generate_data (int *arr) {for (int i=0;i<count;i++) {Arr[i] = i+1;}} int nextpoweroftwo (int n) {n--;n = n >> 1 | n;n = n >> 2 | n;n = n >> 4 | n;n = n >> 8 | n;n = N &G t;> 16 | n;//n = n >> 32 | N for 64-bits int return ++n;}  /*cnt:count Cnt2:next Power of both of Count */__global__ static void compute_sum (int *array,int cnt, int cnt2) {extern __shared__ unsigned int sharedmem[];sharedmem[threadidx.x] = (threadidx.x < CNT)? ARRAY[THREADIDX.X]: 0; __syncthreads ();//cnt2 "must" be a power of two!for (unsigned int s = CNT2/2; s > 0; s>&gt ; =1) {if (threadidx.x < s) {sharedmem[threadidx.x] + + sharedmem[threadidx.x + s];} __syncthreads ();} if (threadidx.x = = 0) {Array[0] = sharedmem[0];}} int main () {int *a = new Int[count];generate_data (a); int *devicearray;cudamalloc (&devicearray,count*sizeof (int)); cudamemcpy (Devicearray,a,count*sizeof (int), cudamemcpyhosttodevice); int npt_count = Nextpoweroftwo (count);//next power of both of count//cout< < "Npt_count =" <<npt_count<<endl;int blockshareddatasize = npt_count * sizeof (int);d ouble t = wtime (); for (int i=0;i<count;i++) {compute_sum<<<1,count,blockshareddatasize>>> (Devicearray,count,npt_count);} printf ("Computation elapsed%.8f \ n", Wtime ()-t), int sum, cudamemcpy (&sum,devicearray,sizeof (int), cudamemcpydevicetohost);cout<< "sum =" <<sum<<endl;return 0;}


main function:
LINE58:
assigns an initial value to array A, and the dimension is count.

line60~62: defines the device variable and allocates memory, copying the value of array A to the video memory.

line63:Nextpoweroftwo is a very subtle piece of code that calculates the power of a number greater than or equal to the first 2 of the input parameter n. As for why this is done to the kernel function inside can understand.

line68:"1" in Compute_sum is the number of blocks, "count" is the number of threads inside each block, "blockshareddatasize" is the size of the shared memory.

Kernel function Compute_sum:

line35: defines the shared memory variable.

Line36: The memory area of the corresponding sharedmem of threadidx.x smaller than CNT is assigned to the value in array array.

line39~47: The function of this code is to add all the values and place them in the sharemem[0] position. This code should be a good experience, it the original computational complexity O (n) of the serial implementation of the time efficiency through the parallel reached O (Logn). Finally save the result to array[0] and copy back to main memory.

Makefile

CU:NVCC cuda_array_sum.cu Wtime.cu./a.out
Results:



Iii. Comparison of efficiency

We observe the difference in the efficiency of variables by modifying the value of count and increasing the number of cycles.

Code modification:




Run time comparison:

Count Serial (s) Parallel (s)
1000 0.00627995 0.00345612
10000 0.29315591 0.06507015
100000 25.18921304 0.65188980
1000000 2507.66827798 5.61648989



haha, can be seen in the case of large amounts of data efficiency is quite good.

Author: Yi Solo Show

Email:[email protected]

Annotated Source: http://blog.csdn.net/lavorange/article/details/43031419


The sum of elements of "cuda parallel programming Seven" arrays

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.