Cuda: supercomputing for the masses (Super computing for large amounts of data)-Section 10

Source: Internet
Author: User

Original article link

Section 10: cudpp, a powerful data parallel Cuda Library
Rob Farber is a senior researcher at the National Laboratory of the Pacific Northwest. He studied large-scale parallel operations in multiple national laboratories and was a partner of several new startups. You can send an email to [email protected] to communicate with him.

In section 9th of a series of articles on Cuda (compute unified device architecture, short for computing unified device architecture), I explored how to use Cuda to expand high-level languages (such as Python ). In this section, we will discuss cudpp, that is, the Cuda Data Parallel primitives Library Data Parallel primitive library. Cudpp is a fast and mature package. It executes some less obvious algorithms to effectively use GPU for basic data parallel operations, such as classification and stream data hardening, and construct data structures such as tree and sum area table. I will discuss cudpp here because it may provide some required functions to accelerate the development of a certain project.

I also introduced the concept of creating a "plan", using this programming format to provide a problem-based Description and visual hardware

. Although it is not an optimization compiler, the use of plan can greatly enhance the programmer's ability to create effective software for multiple types of GPUs started by cuda-in addition, you can also select specific optimization code for specific problems within the scope of the universal library framework. For example, the NVIDIA cufft library can choose one of the two more effective FFT algorithms when appropriate. Plan is not a brand new concept for Cuda or a series of articles. It is a common design format and can withstand the test of time.

Why use cudpp?

Most people have a toolbox for libraries and methods that we use to perform an operation. In short, these databases provide basic commands that can be used to quickly execute some computing tasks. The classification is just a sample, which is simple and effective. You can call a qsort () routine to return the data structure in order of sorting. NVIDIA cublas and cufft libraries provide similar features for some less simple tasks, such as programming EFT and Optimized BLAS features.


Cudpp uses the same concept to provide optimized "best" methods of the same kind to perform primitive operations, such as parallel prefix sum (SCAN), parallel sorting (number ), parallel reduction and other methods that can effectively perform sparse matrix-vector multiplication, and other operations.


A horizontal row prefix scan is a primitive that helps effectively solve parallel problems. In these parallel problems, each output requires a global understanding of the input. For example, prefix summation (that is, scanning, prefix reduction, or partial Division) is an operation on the list. The elements in each result list are based on the index, this operation is obtained by adding the elements in the operand list. This seems to be a serial operation, because each result depends on all the previous values, as shown below:


Definition: All prefix summation operations use binary join operators and nelement Arrays

Given: [A0, A1 ,..., an-1], returns: [A0, (A0 A1 ),..., (A0 A1... (An-1)].

Example: If it is an addition, all prefix summation operations are performed on the nelement array.

Given [3, 1, 7, 0, 4, 1, 6, 3] returns [3, 4, 11, 11, 15, 16, 22, 25].

The sum of all prefixes has many usage, including, but not limited to, classification, lexical analysis, string comparison, polynomial evaluation, stream Data is robust and parallel bar charts and data structures (graphics, trees, etc.) are constructed ). Different types of research articles provide more extensive and detailed applications, such as Guy blelloch's prefix sums and their applications.
Obviously, the sequential construction of scanning (which can be run by a single thread on the CPU) is insignificant. We loop through all the elements in the input array, add the values of the elements before the array to the sum of the elements before the output array, and then write the values to the existing elements of the output array.

1 void scan( float* output, float* input, int length)2 {3    output[0] = 0; // since this is a prescan, not a scan4    for(int j = 1; j < length; ++j) {5       output[j] = input[j-1] + output[j-1];6       }7 }

The code is an array of N and executes n addition operations-N is the minimum number of addition operations required to generate a scan array. If the parallel version of the scan is efficient, this means that compared with the sequential version, the parallel version does not perform more addition operations (or operations ). In other words, both executions share the same complexity, O (n ). Cudpp claims to obtain the O (n) scan run time to clarify the value of cudpp, because creating parallel execution is not unrelated. For more information, see scan primitives for GPU computing of shubhabrata Sengupta et al.

CuDDP is available for version 1.0:

Segment scan-execute multiple variable-length scanning algorithms in parallel. It is useful for algorithms such as parallel and fast sorting, parallel sparse matrix-vector multiplication and other algorithms;

Parallel sparse matrix-multiplication of vectors (based on Segmentation scanning ). Sparse matrix operations are very important, because in this way, GPU can work on a matrix using multiple zeros (such as sparse matrices) in an efficient computing mode within the space. Because most of the values are zero, most work can be avoided. Similarly, there is no need to waste storage space.

The improved scanning algorithm, called "warp scanning", improves its performance, but its code is simpler;

Scanning and segmentation scanning now support addition, multiplication, maximum and minimum operations;

Supports both inclusive and segmented scanning;

Improved, more useful cudppcompact () interface;

Support. Reverse protocol;

Supports cuda2.0;

More support for Mac OS X and Windows Vista

NT Thread configuration and GPU resources, plus Plan is a convenient way to store and reuse these configurations. In addition, cufft optimization can also be used, but it depends on whether the requested FFT is a quadratic power. The widely used fftw project also utilizes the plan concept. Fftw is widely used on various platforms. For many reasons, plans is a useful tool when developing general solutions that require running on a large number of GPU architectures.

In this simple cudpp example, you need to create a plan to scan the numelements element on the target GPU for forward specific floating sum. You can complete this task by entering cudppconfiguration struct and passing it to the planning. In this example, inform the planning algorithm (cudpp_scan), data type (cudpp_float), calculation (cudpp_add), and options (cudpp_option_forward, cudpp_option_exclusive ). Method cudppplan and the configuration is called (with the maximum number of elements to scan numelements ). Finally, the plan is told that we only want to scan the single-Dimension Array by passing 1 and 0 for the numrows and rowpitch parameters. The cudpp file provides cudppplan () with more details about the parameters.

   CUDPPConfiguration config;    config.op = CUDPP_ADD;    config.datatype = CUDPP_FLOAT;    config.algorithm = CUDPP_SCAN;    config.options = CUDPP_OPTION_FORWARD | CUDPP_OPTION_EXCLUSIVE;        CUDPPHandle scanplan = 0;    CUDPPResult result = cudppPlan(&scanplan, config, numElements, 1, 0);      if (CUDPP_SUCCESS != result)    {        printf("Error creating CUDPPPlan\n");        exit(-1);    }

After cudppplan is successfully called, the handle (pointer) is returned to the inscanplan object ). Cudpp then runs, calls cudppscan (), plans the handle, outputs and inputs the device array, and the number of elements to be scanned are passed to it.

1  // Run the scan2     cudppScan (scanplan, d_odata, d_idata, numElements);

Then, use cudamemcpy to copy the scan result from d_odata back to the host. Verify the GPU results by computing the reference solution on the CPU (by computesumscangold (), and then compare the CPU and GPU results to ensure correctness.

 1 // allocate mem for the result on host side 2     float* h_odata = (float*) malloc( memSize); 3     // copy result from device to host 4     CUDA_SAFE_CALL( cudaMemcpy( h_odata, d_odata, memSize, 5                                 cudaMemcpyDeviceToHost) ); 6     // compute reference solution 7     float* reference = (float*) malloc( memSize); 8     computeSumScanGold( reference, h_idata, numElements, config); 9     // check result10     CUTBoolean res = cutComparef( reference, h_odata, numElements);11     printf( "Test %s\n", (1 == res) ? "PASSED" : "FAILED");

Finally, call cudppdestroyplan () to clear the memory space. The host then uses using free () and cudafree to release the local device and device arrays, respectively, and exit the application because simplecudpp is complete.

1 result = cudppDestroyPlan (scanplan);2 if (CUDPP_SUCCESS != result)3 {4     printf("Error destroying CUDPPPlan\n");5     exit(-1);6 }

Sparse Matrix-multiplication of Vectors

Cudpp contains many other powerful functions, but this article does not discuss them. For example, the simple test code for using cudpp to perform sparse matrix-vector multiplication is sptest. cu. From http://www.nada.kth.se /~ Tomaso/gpu08/sptest. cu. can be downloaded. You can compile and run the following code:

# nvcc -I cudpp_1.0a/cudpp/include -o sptest sptest.cu        -L cudpp_1.0a/lib -lcudpp# ./sptest

Cuda: supercomputing for the masses (Super computing for large amounts of data)-Section 10

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.