Cuda: supercomputing for the masses (Super computing for large amounts of data)-Section 1

Source: Internet
Author: User
Tags cuda toolkit gdb debugger nvcc

Original article link

Section 1

Cuda allows you to develop software that can run on the GPU while using familiar programming concepts.


Rob Farber is a senior researcher at the National Laboratory of the Pacific Northwest. He studied large-scale parallel operations in multiple national laboratories and was a partner of several new startups. You can send an email to [email protected] to communicate with him.

Are you interested in using a standard multi-core processor to increase performance by several orders of magnitude when programming in advanced languages (such as C programming languages? Are you looking forward to the ability to scale across multiple devices?

Many (including myself) have achieved this high performance and scalability by using NVIDIA Cuda (compute unified devicearchitecture, short for computing unified device architecture, to compile a cheap multi-threaded GPU program. I particularly stress that "programming" is because Cuda is an architecture that serves your work and does not "force" your work to adapt to a limited set of performance libraries. With cuda, you can make full use of your talents and design software to achieve the best performance on multi-threaded hardware-and have fun with it, because correct computation is very interesting, moreover, the software development environment is reasonable and intuitive.

This article is the first section of this series of articles. It introduces Cuda functions (by using code) and thinking processes to help you map applications to multithreaded hardware (such as GPU) to improve the performance. Of course, not all problems can be effectively mapped to multithreading hardware, so I will introduce which can be effectively mapped and which cannot, it also gives you a common sense of which mappings can run well.

"Cuda programming" is not the same as "gpgpu programming" (although Cuda runs on the GPU ). Previously, writing software for GPUs meant programming in the GPU language. One of my friends once described this process as pulling data from your elbow. Cuda allows you to use familiar programming concepts to develop software that can run on the GPU. By directly compiling software to hardware (for example, GPU assembly language), you can avoid the performance overhead of the graphic layer API, which can provide better performance.

You can select a Cuda device. Figures 1 and 2 show the Cuda multi-body simulation program running on a notebook and a desktop discrete GPU, respectively.

Figure 1: a multi-body astronomical simulation program running on a notebook using Quadro FX 570m.

Figure 2: a multi-body astronomical simulation program running on a desktop machine that uses geforce 8800 gts mb.

Cuda can really increase the application performance by one to two orders of magnitude-or is this just an exaggeration, not a reality?


Cuda is a quite new technology, but in some books and on the Internet, there are already many examples that highlight this technology, this greatly improves performance when using current commercial GPU hardware. Figures 1 and 2 summarize the related content on the nvidia and beckmaninstitute websites. The core of Cuda is to allow programmers to keep thousands of threads busy. Currently, this generation of diagpu can effectively support a large number of threads, so they can increase the application performance by one to two orders of magnitude. The price of these graphics processors varies greatly, and almost everyone can use them. The newer motherboard will expand Cuda functions by providing more hardware technologies such as memory bandwidth, asynchronous data transmission, atomic operations, and Double Precision Floating Point computing. With the continuous advancement of technology, the Cuda software environment will continue to expand, and the final difference between the GPU and the multi-core processor will gradually disappear. As program developers, we can predict that applications with thousands of active threads will become common and Cuda will run on multiple platforms, including general-purpose processors.

Application Example

URL

Application Acceleration

Seismic Database (Seismic Database)

Http://www.headwave.com

66x to 100x

Mobile phone antenna Simulation)

Http://www.acceleware.com

45x

Molecular Dynamics)

Http://www.ks.uiuc.edu/Research/vmd

21x to 100x

Neuron Simulation)

Http://www.evolvedmachines.com

100x

MRI Processing)

Http://bic-test.beckman.uiuc.edu

245x to ipvx

Atmospheric cloud Simulation)

Http://www.cs.clemson.edu /~ Jesteel/clouds.html

50x


Table 1: NVIDIA summary, www.nvidia.com/object/io_43499.html


GPU performance results, December March 2008

Geforce8800gtx w/Cuda 1.1, driver 169.09

Computing/Algorithm

Algorithm

Acceleration vs. Intel qx6700 CPU

Fluorescence microphotolysis)

Iterative matrix/template)

12x

Calculate the list (pairlist calculation)

Particle pair Distance Test)

10x to 11x

Update a list (pairlist update)

Particle pair Distance Test)

5x to 15X

Molecular Dynamics non-dynamic force operation (

Molecular Dynamics nonbonded force calculation)

Multi-body power disconnection operation (N-body cutoff force calculations)

10x to 20x

Power-off electron density (cutoff electron density sum)

 

Particle-net w/power failure (particle-grid w/cutoff)

15X to 23x

Power-off potential Summary (cutoff potential summation)

Particle-net w/power failure (particle-grid w/cutoff)

12x to 21x

Direct Coulomb summation)

Particle-grid w/cutoff)

44x


Table 2: Beckman Institute table from www.ks.uiuc.edu/research/vmd/publications/siam2008vmdcuda.pdf


In 1980s, I was also a researcher at Los Alamos nationallaboratory, and I was lucky enough to use thinkingmachines supercomputer with up to 65,536 parallel processors. Cuda has been proven to be a framework designed for modern parallel (high-thread) environments. Its performance advantages are obvious. My piece of production code is now written in Cuda and runs on nvidia GPUs. Compared with the 2.6-GHz quad-core opteron system, it has significant linear scaling and almost two orders of magnitude faster.

Enables the Cuda graphics processor to run as the federated processor within the master computer. This means that each GPU is considered to have its own memory and processing elements separated from the master computer. For effective work, data must be transmitted between the memory space of the master computer and the Cuda device. Therefore, performance results must include I/O time to be more meaningful. Colleagues like to call it "honest data" because they will more accurately reflect the performance applications to be delivered for production.

I insist that, compared with the existing technology, the performance improvement of one to two orders of magnitude is a great change, which can greatly change some aspects of the operation. For example, a computing task that may have taken a year can now be completed in just a few days, and a few hours of computing can suddenly become interactive, because new technologies can be used to complete in a few seconds, in the past, difficult real-time processing tasks have become very easy to process. Finally, it provides great opportunities for consultants and engineers with the right skill sets and capabilities to write high-thread (or a large number of parallel) software. What are the benefits of this computing power to your career, application, or real-time processing needs?

You do not need any cost at first. You only need to download Cuda from the Cuda zone homepage (find "Get Cuda "). Then, follow your specific operating system installation instructions. You don't even need a graphics processor, because you can use a software simulator to run on your laptop or workstation and start working. Of course, you can achieve better performance by running the Cuda GPU. Maybe your computer should have such a GPU. View cuda-supported GPU links on the cudazone homepage (Cuda-supported GPUs include shared on-chip memory and thread management ).
If you want to purchase a new graphics processor card, I suggest you read the following articles in sequence, as I will explore different hardware features (such as memory bandwidth, number of registrations, atomic operations, etc) how will the performance of the application be affected. This helps you select the right hardware for your application. In addition, the Cuda zone Forum provides a large amount of information about various aspects of cuda, including the hardware to be purchased.

After installation, Cuda toolkit provides a reasonable set of C language development tools, including:

  • Nvcc C compiler;
  • GPU Cuda FFT and Blas Libraries
  • Performance Analyzer
  • Gdb debugger for GPU in alpha version (as of January 1, March 2008)
  • Cuda runtime Driver (available in standard nvidia gpu drivers)
  • Cuda Programming Manual


The nvccc compiler completes most of the work of converting C code into executable programs that will run on the GPU or simulator. Fortunately, assembly language programming does not require high performance. The following article describes how to use Cuda in other advanced languages, including C ++, Fortran, and python. I suppose you are familiar with C/C ++. Experience in parallel programming or Cuda is not required. This is consistent with the existing Cuda documentation.

Creating and running a Cuda C language program is the same as creating and running a workflow in other C programming environments. Clear build and run instructions for Windows and Linux environments are in the Cuda document. In short, this workflow is:

  • Use your favorite editor to create or edit a Cuda program. Note: the extension of the Cuda C Language Program is. cu.
  • Use nvcc to compile programs to create executable programs (NVIDIA provides complete makefiles with examples. For Cuda devices, you only need to type make. For simulators, you only need to type make EMU = 1 ).
  • Run the executable program.



Table 1 is a simple Cuda program. It is just a simple program that calls the Cuda API to move data into and out of the Cuda device. No new content is added to avoid confusion when learning how to build and run Cuda programs using tools. In the next article, I will introduce how to use the Cuda device to perform some work.

 

 1 // moveArrays.cu 2 // 3 // demonstrates CUDA interface to data allocation on device (GPU) 4 // and data movement between host (CPU) and device. 5  6  7 #include <stdio.h> 8 #include <assert.h> 9 #include <cuda.h>10 11 int main(void)  12 {  13    float *a_h, *b_h;     // pointers to host memory14 15    float *a_d, *b_d;     // pointers to device memory16 17    int N = 14;  18    int i;  19    // allocate arrays on host20 21    a_h = (float *)malloc(sizeof(float)*N);  22    b_h = (float *)malloc(sizeof(float)*N);  23    // allocate arrays on device24 25    cudaMalloc((void **) &a_d, sizeof(float)*N);  26    cudaMalloc((void **) &b_d, sizeof(float)*N);  27    // initialize host data28 29    for (i=0; i<N; i++) {  30       a_h = 10.f+i;  31       b_h = 0.f;  32    }  33    // send data from host to device: a_h to a_d 34 35    cudaMemcpy(a_d, a_h, sizeof(float)*N, cudaMemcpyHostToDevice);  36    // copy data within device: a_d to b_d37 38    cudaMemcpy(b_d, a_d, sizeof(float)*N, cudaMemcpyDeviceToDevice);  39    // retrieve data from device: b_d to b_h40 41    cudaMemcpy(b_h, b_d, sizeof(float)*N, cudaMemcpyDeviceToHost);  42    // check result43 44    for (i=0; i<N; i++)  45       assert(a_h == b_h);  46    // cleanup47 48    free(a_h); free(b_h);   49    cudaFree(a_d); cudaFree(b_d);  

 

Try these development tools. Some suggestions for beginners: You can use the printf statement to see what will happen on the GPU when running in the simulator (using make EMU = 1 to build executable programs. You can also test the alpha version of the debugger at will.

 

Cuda: supercomputing for the masses (Super computing for large amounts of data)-Section 1

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.