The sum of elements of "cuda parallel programming Seven" arrays

Last Update:2015-01-22 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Now it is necessary to get the sum of all the elements of an array, which seems unlikely before, because each thread only processes one element and cannot relate all the elements, but has recently learned a piece of code that can be implemented, and also has a further understanding of shared memory.

First, C + + serial implementation

The method of serial implementation is very simple, as long as all elements are added sequentially to get the corresponding results, in fact, we focus on not the results, but the efficiency of the operation. Then the code is as follows:

array_sum.cc:

#include <iostream> #include <stdio.h> #include "kmeans.h" using namespace std;const int cnt = 100000;int Main ( {int *a = new Int[cnt];for (int i=0;i<cnt;i++) {A[i] = i+1;} Double T = wtime (), for (int i=0;i<cnt;i++) sum + = A[i];p rintf ("computation elapsed%.8f \ n", Wtime ()-t); return 0;}

WTIME.CU:

#include <sys/time.h> #include <stdio.h> #include <stdlib.h>double wtime (void) {    double          now _time;    struct Timeval  etstart;    struct timezone tzp;    if (Gettimeofday (&etstart, &tzp) = =-1)        perror ("error:calling gettimeofday () not successful.\n");    Now_time = (double) etstart.tv_sec) +              /* in seconds *               /((double) etstart.tv_usec)/1000000.0;  /* in microseconds */    return now_time;}

Operation Result:

Second, cuda parallel implementation

First the code and then the explanation:

#include <iostream> #include <stdio.h> #include "kmeans.h" using namespace Std;const int count = 1000;void Generate_data (int *arr) {for (int i=0;i<count;i++) {Arr[i] = i+1;}} int nextpoweroftwo (int n) {n--;n = n >> 1 | n;n = n >> 2 | n;n = n >> 4 | n;n = n >> 8 | n;n = N &G t;> 16 | n;//n = n >> 32 | N for 64-bits int return ++n;}  /*cnt:count Cnt2:next Power of both of Count */__global__ static void compute_sum (int *array,int cnt, int cnt2) {extern __shared__ unsigned int sharedmem[];sharedmem[threadidx.x] = (threadidx.x < CNT)? ARRAY[THREADIDX.X]: 0; __syncthreads ();//cnt2 "must" be a power of two!for (unsigned int s = CNT2/2; s > 0; s>&gt ; =1) {if (threadidx.x < s) {sharedmem[threadidx.x] + + sharedmem[threadidx.x + s];} __syncthreads ();} if (threadidx.x = = 0) {Array[0] = sharedmem[0];}} int main () {int *a = new Int[count];generate_data (a); int *devicearray;cudamalloc (&devicearray,count*sizeof (int)); cudamemcpy (Devicearray,a,count*sizeof (int), cudamemcpyhosttodevice); int npt_count = Nextpoweroftwo (count);//next power of both of count//cout< < "Npt_count =" <<npt_count<<endl;int blockshareddatasize = npt_count * sizeof (int);d ouble t = wtime (); for (int i=0;i<count;i++) {compute_sum<<<1,count,blockshareddatasize>>> (Devicearray,count,npt_count);} printf ("Computation elapsed%.8f \ n", Wtime ()-t), int sum, cudamemcpy (&sum,devicearray,sizeof (int), cudamemcpydevicetohost);cout<< "sum =" <<sum<<endl;return 0;}

main function:
LINE58: assigns an initial value to array A, and the dimension is count.

line60~62: defines the device variable and allocates memory, copying the value of array A to the video memory.

line63:Nextpoweroftwo is a very subtle piece of code that calculates the power of a number greater than or equal to the first 2 of the input parameter n. As for why this is done to the kernel function inside can understand.

line68:"1" in Compute_sum is the number of blocks, "count" is the number of threads inside each block, "blockshareddatasize" is the size of the shared memory.

Kernel function Compute_sum:

line35: defines the shared memory variable.

Line36: The memory area of the corresponding sharedmem of threadidx.x smaller than CNT is assigned to the value in array array.

line39~47: The function of this code is to add all the values and place them in the sharemem[0] position. This code should be a good experience, it the original computational complexity O (n) of the serial implementation of the time efficiency through the parallel reached O (Logn). Finally save the result to array[0] and copy back to main memory.

Makefile

CU:NVCC cuda_array_sum.cu Wtime.cu./a.out

Results:

Iii. Comparison of efficiency

We observe the difference in the efficiency of variables by modifying the value of count and increasing the number of cycles.

Code modification:

Run time comparison:

Count	Serial (s)	Parallel (s)
1000	0.00627995	0.00345612
10000	0.29315591	0.06507015
100000	25.18921304	0.65188980
1000000	2507.66827798	5.61648989

haha, can be seen in the case of large amounts of data efficiency is quite good.

Author: Yi Solo Show

Email:[email protected]

Annotated Source: http://blog.csdn.net/lavorange/article/details/43031419

The sum of elements of "cuda parallel programming Seven" arrays

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

The sum of elements of "cuda parallel programming Seven" arrays

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

The sum of elements of "cuda parallel programming Seven" arrays

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support