How to program GPU for general computing tasks

Source: Internet
Author: User
 
 
   
 

With the increase in the programmability and performance of modern graphics processors (GPUs), application developers have always hoped that graphics hardware can solve high-density computing tasks that previously could only be completed by general-purpose CPUs. Although the use of general GPU for computing is promising, the traditional image application programming interface still abstracts the GPU into an image painter including textures, triangles, and pixels. Finding a ing algorithm that can use these basic elements is not a simple operation, even for the most advanced graphic developers.

Fortunately, GPU-based computing is conceptually easy to understand, and there are a variety of advanced languages and software tools available to simplify GPU programming. However, developers must first understand how the GPU works during image rendering and then determine the various components that can be used for computing.

When drawing an image, the GPU first receives the geometric data sent by the host system in the form of triangle vertices. These vertex data is processed by a programmable vertex processor that can perform geometric transformation, brightness calculation, and other triangle calculations. Then, these triangles are converted from a fixed-function grating to individual fragment displayed on the screen )". Before the screen is displayed, each fragment uses a programmable Fragment Processor to calculate the final color value.

Figure 1: simple Brook code example for adding two vectors.
Brook supports all c syntaxes with additional stream data,
Streaming data is stored in GPU memory,
The kernel function is also executed on the GPU.

The Calculation of fragment color generally involves the mathematical operation of the set vector and the extraction of stored data from the "texture". The "texture" is a bitmap that stores the color of the surface material. The final plot scenario can be displayed on the output device or replicated from the GPU memory to the host processor.

The programmable vertex processor and Fragment Processor provide many identical functions and instruction sets. However, most GPU programmers only use the Fragment Processor for general-purpose computing tasks because it generally provides better performance and can be directly output to memory.

A simple example of using a Fragment Processor for computing is to add two vectors. First, we release a large triangle with the same number of fragments as the vector size (containing elements. The generated fragments are processed by the Fragment Processor. The processor executes the code in parallel in a single command, Multiple Data (SIMD) mode. The code for adding vectors extracts two elements to be added from the memory, adds Vectors based on the positions of fragments, and assigns output colors to the results. The output memory stores the vector sum, which can be used in the next calculation.

The programmable Fragment Processor's Isa is similar to the DSP or Pentium SSE instruction set and consists of four SIMD instructions and registers. These commands include standard mathematical operations, memory extraction commands, and several specialized graphics commands.

Comparison between GPU and DSP

GPU is different from DSP architecture in several major aspects. All of its calculations use floating-point algorithms, and currently there are no bitwise OR Integer Operation instructions. In addition, because the GPU is designed for image processing, the storage system is actually a two-dimensional segmented storage space, including a segment number (read images from) and two-dimensional addresses (X and Y coordinates in the image ).

In addition, there are no indirect write commands. The output write address is determined by the grating processor and cannot be changed by the program. This is a great challenge for algorithms that are naturally distributed in the memory. Finally, communication is not allowed between the processing processes of different fragments. In fact, A Fragment Processor is a parallel execution unit of SIMD data that executes code independently in all fragments.

Despite the above constraints, the GPU can still effectively execute a variety of operations, from linear algebra and signal processing to numerical simulation. Although the concept is simple, new users are still confused when using GPU computing because GPU requires proprietary graphics knowledge. In this case, some software tools can help. The two advanced tracing languages CG and HLSL allow users to write c-like code and then compile the code into the fragment assembly language. These language compilers can be downloaded free of charge from nvidia and Microsoft websites. Although these languages greatly simplify the compilation of shadow assembly code, you must use graphical APIs to create and publish computing tasks.

Brook is a high-level language designed for GPU computing that does not require graphics knowledge. Therefore, for the first time developers who use GPU for development, it can be a good start point. Brook is an extension of the C language and integrates a simple data parallel programming structure that can be directly mapped to the GPU.

Data stored and operated by GPU is visually represented as "streams", similar to arrays in Standard C. The core (kernel) is a function that operates on the stream. Calling a core function on a series of input streams means implementing an implicit loop on the stream elements, that is, calling the core body for each stream element. Brook also provides a reduction mechanism, such as the sum, maximum, or product calculation of all elements in a stream.

The brook compiler is a source-to-source compiler that maps users' core code into a fragment assembly language and generates C ++ short code to link to large applications. This allows users to input only the performance key part of the application into Brook. Brook also completely hides all the details of the graphic API, and virtualizes parts of the GPU such as the two-dimensional memory system that many users are not familiar.

Applications written in Brook include linear algebra subprograms, Fast Fourier transformation, ray tracing, and image processing. The brook compiler and real-time runtime environment can be obtained free of charge from the http: // Brook website.

The sourceforge.net website also provides resources for many such applications. Using ATI's x800xt and NVIDIA's geforce 6800 ultra GPU, many of these applications have been accelerated up to seven times faster under the same high-speed cache, SSE assembly-optimized Pentium 4 Execution conditions.

Users interested in GPU computing strive to map algorithms to basic graphic elements. The advent of advanced programming languages like brook makes it easy for new programmers to master GPU performance advantages. The convenience of accessing GPU computing makes the evolution of GPU continue, not just as a drawing engine, but as a main computing engine for personal computers.

Author: Ian Buck, Researcher, graphic laboratory, Stanford University, ianbuck@graphics.stanford.edu

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.