Shared Memory for large-scale point Product

Source: Internet
Author: User
Package and download a project
1/* 2 * copyright 1993-2010 NVIDIA Corporation. all rights reserved. 3*4 * NVIDIA Corporation and its Licensors retain all intellectual property and 5 * proprietary rights in and to this software and related documentation. 6 * any use, reproduction, disclosure, or distribution of this software 7 * and related documentation without an express license agreement from 8 * NVIDIA Corporation is stric Tly prohibited. 9*10 * Please refer to the applicable NVIDIA End User License Agreement (EULA) 11 * associated with this source code for terms and conditions that govern 12 * your use of this NVIDIA software. 13*14 */15 16 17 # include ".. /common/book. H "18 # include" Cuda. H "19 # include" cuda_runtime.h "20 # include" device_launch_parameters.h "21 # include" device_functions.h "22 # define Imin (, B) (a <B? A: B) 23 24 const int n = 33*1024; 25 const int threadsperblock = 256; // each thread block starts 256 threads 26 const int blockspergrid = Imin (32, (N + threadsperblock-1)/threadsperblock); 27 28/* 29 kernel function 30 */31 _ global _ void dot (float * a, float * B, float * c) {32 // shared memory on the device. Each thread block contains 33 _ shared _ float cache [threadsperblock]; 34 int tid = threadidx. X + blockidx. x * blockdim. x; 35 // The thread index in the thread block is assigned to the Buffer Index 36 int CAC Heindex = threadidx. x; 37 38 float temp = 0; 39 // when the current index is smaller than the total data volume, 40 while (TID <n) {41 temp + = A [TID] * B [TID]; 42 // The step size is the number of active threads 43 TID + = blockdim. x * griddim. x; 44} // if the execution is performed on this thread again, temp stores the value calculated last time, that is, the result of the re-calculation is to add the last calculated value 45 46 // set the cache values 47 // to store the result in the shared storage, each thread corresponds to a shared storage 48 cache [cacheindex] = temp; 49 50/* 51 synchronize threads in this block 52 synchronization operation, so that each thread has completed the calculation, continue with the subsequent operations 53 */54 _ syncthr EADS (); 55 56 57 // for functions, threadsperblock must be a power of 2 58 // because of the following code 59/* 60 reduction operation 61 blockdim. divide the number of threads in the X/2 blocks by 2, which is equivalent to taking the center value 62. Because blockdim is a multiple of 2, there will be no division between 63 */64 int I = blockdim. x/2; 65 while (I! = 0) {66 If (cacheindex <I) 67/* 68 first half and second half correspond to the first addition, similarly, 69 */70 cache [cacheindex] + = cache [cacheindex + I]; 71/* 72 synchronization means that all threads have completed the first reduction. The next reduction is 73 */74 _ syncthreads (); 75 // The center of the next reduction is 76 I/= 2; 77} 78 // the final result is stored in cache [0]. Therefore, assign cache [0] to 79 if (cacheindex = 0) of the block index as the underlying array) 80 C [blockidx. x] = cache [0]; 81} 82 83 84 int main (void) {85 float * a, * B, c, * partial_c; 86 float * dev_a, * dev_ B, * dev_par Tial_c; 87 88 // allocate memory on the CPU side 89 A = (float *) malloc (N * sizeof (float); 90 B = (float *) malloc (N * sizeof (float); 91 partial_c = (float *) malloc (blockspergrid * sizeof (float )); 92 93 // allocate the memory on the GPU 94 handle_error (cudamalloc (void **) & dev_a, 95 N * sizeof (float ))); 96 handle_error (cudamalloc (void **) & dev_ B, 97 N * sizeof (float); 98 handle_error (cudamalloc (void **) & dev_p Artial_c, 99 blockspergrid * sizeof (float); 100 101 // fill in the host memory with data102 for (INT I = 0; I <n; I ++) {103 A [I] = I; 104 B [I] = I * 2; 105} 106 107 // copy the arrays 'A' and 'B' to the gpu108 handle_error (cudamemcpy (dev_a, A, N * sizeof (float), 109 cudamemcpyhosttodevice )); 110 handle_error (cudamemcpy (dev_ B, B, n * sizeof (float), 111 cudamemcpyhosttodevice); 112 113 dot <blockspergrid, T Hreadsperblock >>> (dev_a, dev_ B, dev_partial_c); 114 115 // copy the array 'C' back from the GPU TO THE cpu116 handle_error (cudamemcpy (partial_c, dev_partial_c, 117 blockspergrid * sizeof (float), 118 cudamemcpydevicetohost )); 119 120/* complete the final addition work on the host 121 this is to avoid the waste of resources caused by simple work on the GPU 122 because many resources are idle 123 */124 C = 0; 125 for (INT I = 0; I <blockspergrid; I ++) {126 C + = partial_c [I]; 127} 128 129 # define sum_squar ES (x) (x * (x + 1) * (2 * x + 1)/6) 130 printf ("Does GPU Value %. 6G = %. 6G? \ N ", C, 2 * sum_squares (float) (n-1); 131 132 // free memory on the GPU sidemo-handle_error (cudafree (dev_a )); 134 handle_error (cudafree (dev_ B); 135 handle_error (cudafree (dev_partial_c); 136 137 139 // free memory on the CPU side138 free (a); free (B ); 140 free (partial_c); 141}
 

Shared Memory for large-scale point Product

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.