Original article link
Section 1
Cuda allows you to develop software that can run on the GPU while using familiar programming concepts.
Rob Farber is a senior researcher at the National Laboratory of the Pacific Northwest. He studied large-scale parallel operations in multiple national laboratories and was a partner of several new startups. You can send an email to [email protected] to communicate with him.
Are you interested in using a standard multi-core processor to increase performance by several orders of magnitude when programming in advanced languages (such as C programming languages? Are you looking forward to the ability to scale across multiple devices?
Many (including myself) have achieved this high performance and scalability by using NVIDIA Cuda (compute unified devicearchitecture, short for computing unified device architecture, to compile a cheap multi-threaded GPU program. I particularly stress that "programming" is because Cuda is an architecture that serves your work and does not "force" your work to adapt to a limited set of performance libraries. With cuda, you can make full use of your talents and design software to achieve the best performance on multi-threaded hardware-and have fun with it, because correct computation is very interesting, moreover, the software development environment is reasonable and intuitive.
This article is the first section of this series of articles. It introduces Cuda functions (by using code) and thinking processes to help you map applications to multithreaded hardware (such as GPU) to improve the performance. Of course, not all problems can be effectively mapped to multithreading hardware, so I will introduce which can be effectively mapped and which cannot, it also gives you a common sense of which mappings can run well.
"Cuda programming" is not the same as "gpgpu programming" (although Cuda runs on the GPU ). Previously, writing software for GPUs meant programming in the GPU language. One of my friends once described this process as pulling data from your elbow. Cuda allows you to use familiar programming concepts to develop software that can run on the GPU. By directly compiling software to hardware (for example, GPU assembly language), you can avoid the performance overhead of the graphic layer API, which can provide better performance.
You can select a Cuda device. Figures 1 and 2 show the Cuda multi-body simulation program running on a notebook and a desktop discrete GPU, respectively.
Figure 1: a multi-body astronomical simulation program running on a notebook using Quadro FX 570m.
Figure 2: a multi-body astronomical simulation program running on a desktop machine that uses geforce 8800 gts mb.
Cuda can really increase the application performance by one to two orders of magnitude-or is this just an exaggeration, not a reality?
Cuda is a quite new technology, but in some books and on the Internet, there are already many examples that highlight this technology, this greatly improves performance when using current commercial GPU hardware. Figures 1 and 2 summarize the related content on the nvidia and beckmaninstitute websites. The core of Cuda is to allow programmers to keep thousands of threads busy. Currently, this generation of diagpu can effectively support a large number of threads, so they can increase the application performance by one to two orders of magnitude. The price of these graphics processors varies greatly, and almost everyone can use them. The newer motherboard will expand Cuda functions by providing more hardware technologies such as memory bandwidth, asynchronous data transmission, atomic operations, and Double Precision Floating Point computing. With the continuous advancement of technology, the Cuda software environment will continue to expand, and the final difference between the GPU and the multi-core processor will gradually disappear. As program developers, we can predict that applications with thousands of active threads will become common and Cuda will run on multiple platforms, including general-purpose processors.
Application Example
|
URL
|
Application Acceleration
|
Seismic Database (Seismic Database)
|
Http://www.headwave.com |
66x to 100x
|
Mobile phone antenna Simulation)
|
Http://www.acceleware.com |
45x
|
Molecular Dynamics)
|
Http://www.ks.uiuc.edu/Research/vmd |
21x to 100x
|
Neuron Simulation)
|
Http://www.evolvedmachines.com |
100x
|
MRI Processing)
|
Http://bic-test.beckman.uiuc.edu |
245x to ipvx
|
Atmospheric cloud Simulation)
|
Http://www.cs.clemson.edu /~ Jesteel/clouds.html |
50x
|
Table 1: NVIDIA summary, www.nvidia.com/object/io_43499.html
GPU performance results, December March 2008
Geforce8800gtx w/Cuda 1.1, driver 169.09
|
Computing/Algorithm
|
Algorithm
|
Acceleration vs. Intel qx6700 CPU
|
Fluorescence microphotolysis)
|
Iterative matrix/template)
|
12x
|
Calculate the list (pairlist calculation)
|
Particle pair Distance Test)
|
10x to 11x
|
Update a list (pairlist update)
|
Particle pair Distance Test)
|
5x to 15X
|
Molecular Dynamics non-dynamic force operation (
Molecular Dynamics nonbonded force calculation)
|
Multi-body power disconnection operation (N-body cutoff force calculations)
|
10x to 20x
|
Power-off electron density (cutoff electron density sum)
|
Particle-net w/power failure (particle-grid w/cutoff)
|
15X to 23x
|
Power-off potential Summary (cutoff potential summation)
|
Particle-net w/power failure (particle-grid w/cutoff)
|
12x to 21x
|
Direct Coulomb summation)
|
Particle-grid w/cutoff)
|
44x
|
Table 2: Beckman Institute table from www.ks.uiuc.edu/research/vmd/publications/siam2008vmdcuda.pdf
In 1980s, I was also a researcher at Los Alamos nationallaboratory, and I was lucky enough to use thinkingmachines supercomputer with up to 65,536 parallel processors. Cuda has been proven to be a framework designed for modern parallel (high-thread) environments. Its performance advantages are obvious. My piece of production code is now written in Cuda and runs on nvidia GPUs. Compared with the 2.6-GHz quad-core opteron system, it has significant linear scaling and almost two orders of magnitude faster.
Enables the Cuda graphics processor to run as the federated processor within the master computer. This means that each GPU is considered to have its own memory and processing elements separated from the master computer. For effective work, data must be transmitted between the memory space of the master computer and the Cuda device. Therefore, performance results must include I/O time to be more meaningful. Colleagues like to call it "honest data" because they will more accurately reflect the performance applications to be delivered for production.
I insist that, compared with the existing technology, the performance improvement of one to two orders of magnitude is a great change, which can greatly change some aspects of the operation. For example, a computing task that may have taken a year can now be completed in just a few days, and a few hours of computing can suddenly become interactive, because new technologies can be used to complete in a few seconds, in the past, difficult real-time processing tasks have become very easy to process. Finally, it provides great opportunities for consultants and engineers with the right skill sets and capabilities to write high-thread (or a large number of parallel) software. What are the benefits of this computing power to your career, application, or real-time processing needs?
You do not need any cost at first. You only need to download Cuda from the Cuda zone homepage (find "Get Cuda "). Then, follow your specific operating system installation instructions. You don't even need a graphics processor, because you can use a software simulator to run on your laptop or workstation and start working. Of course, you can achieve better performance by running the Cuda GPU. Maybe your computer should have such a GPU. View cuda-supported GPU links on the cudazone homepage (Cuda-supported GPUs include shared on-chip memory and thread management ).
If you want to purchase a new graphics processor card, I suggest you read the following articles in sequence, as I will explore different hardware features (such as memory bandwidth, number of registrations, atomic operations, etc) how will the performance of the application be affected. This helps you select the right hardware for your application. In addition, the Cuda zone Forum provides a large amount of information about various aspects of cuda, including the hardware to be purchased.
After installation, Cuda toolkit provides a reasonable set of C language development tools, including:
- GPU Cuda FFT and Blas Libraries
- Gdb debugger for GPU in alpha version (as of January 1, March 2008)
- Cuda runtime Driver (available in standard nvidia gpu drivers)
The nvccc compiler completes most of the work of converting C code into executable programs that will run on the GPU or simulator. Fortunately, assembly language programming does not require high performance. The following article describes how to use Cuda in other advanced languages, including C ++, Fortran, and python. I suppose you are familiar with C/C ++. Experience in parallel programming or Cuda is not required. This is consistent with the existing Cuda documentation.
Creating and running a Cuda C language program is the same as creating and running a workflow in other C programming environments. Clear build and run instructions for Windows and Linux environments are in the Cuda document. In short, this workflow is:
- Use your favorite editor to create or edit a Cuda program. Note: the extension of the Cuda C Language Program is. cu.
- Use nvcc to compile programs to create executable programs (NVIDIA provides complete makefiles with examples. For Cuda devices, you only need to type make. For simulators, you only need to type make EMU = 1 ).
- Run the executable program.
Table 1 is a simple Cuda program. It is just a simple program that calls the Cuda API to move data into and out of the Cuda device. No new content is added to avoid confusion when learning how to build and run Cuda programs using tools. In the next article, I will introduce how to use the Cuda device to perform some work.
1 // moveArrays.cu 2 // 3 // demonstrates CUDA interface to data allocation on device (GPU) 4 // and data movement between host (CPU) and device. 5 6 7 #include <stdio.h> 8 #include <assert.h> 9 #include <cuda.h>10 11 int main(void) 12 { 13 float *a_h, *b_h; // pointers to host memory14 15 float *a_d, *b_d; // pointers to device memory16 17 int N = 14; 18 int i; 19 // allocate arrays on host20 21 a_h = (float *)malloc(sizeof(float)*N); 22 b_h = (float *)malloc(sizeof(float)*N); 23 // allocate arrays on device24 25 cudaMalloc((void **) &a_d, sizeof(float)*N); 26 cudaMalloc((void **) &b_d, sizeof(float)*N); 27 // initialize host data28 29 for (i=0; i<N; i++) { 30 a_h = 10.f+i; 31 b_h = 0.f; 32 } 33 // send data from host to device: a_h to a_d 34 35 cudaMemcpy(a_d, a_h, sizeof(float)*N, cudaMemcpyHostToDevice); 36 // copy data within device: a_d to b_d37 38 cudaMemcpy(b_d, a_d, sizeof(float)*N, cudaMemcpyDeviceToDevice); 39 // retrieve data from device: b_d to b_h40 41 cudaMemcpy(b_h, b_d, sizeof(float)*N, cudaMemcpyDeviceToHost); 42 // check result43 44 for (i=0; i<N; i++) 45 assert(a_h == b_h); 46 // cleanup47 48 free(a_h); free(b_h); 49 cudaFree(a_d); cudaFree(b_d);
Try these development tools. Some suggestions for beginners: You can use the printf statement to see what will happen on the GPU when running in the simulator (using make EMU = 1 to build executable programs. You can also test the alpha version of the debugger at will.
Cuda: supercomputing for the masses (Super computing for large amounts of data)-Section 1