GPU essence 2-high-performance graphics chip and general computing programming skills stream programming 1

Source: Internet
Author: User
Bytes. Some recent academic research papers-and other chapters in this book-demonstrate the ability of these stream processors to accelerate a wide range of applications, not just the real-time rendering they originally targeted. However, using this computing capability requires a completely different programming model that is unfamiliar to many programmers. This chapter explores one of the most fundamental differences between CPU and GPU programming: Memory Model, unlike traditional CPU-based programs, GPU-based programs have some restrictions on the time, location, and access to memory. This chapter provides an overview of the GPU Memory Model and explains basic data structures, such as multi-dimensional arrays, structs, lists, and sparse arrays, how they are expressed in the Data Parallel Programming Model. 33.1 stream ProgrammingModern graphics processors can accelerate more than real-time computer graphics. Recent work in emerging fields of general computing on the graphic processing unit (gpgpu) has shown that GPU can accelerate fluid mechanics, high-order image processing, and realistic rendering, even a large number of applications such as computational chemistry and biology (Buck, etc. 2004, Harris, etc. 2004 ). In addition to the purpose of real-time rendering, the key to using GPU is to regard it as a streaming, parallel data computer (See Chapter 29th of this book, and dally and other 2004 ). The method of building computing and Accessing memory in a GPU program is very affected by the stream computing model. Similarly, before discussing the GPU-based data structure, let's briefly describe this model. For example, stream processors such as GPU have completely different programming methods than CPU serial processors. The programming model that most programmers are familiar with is that they can write anywhere in the program to any location in the memory. Instead, access the memory in a more structured way when programming a stream processor. In the stream model, the program is represented as a continuous operation on the data stream, as shown in 33-1. Elements in a stream (an ordered array of data) are processed by commands in a core (a small program. The core operates on each element in the stream and writes the result to an output stream. Figure 33-1 uses the constraints of the stream programming model given by the data dependency diagram to allow the GPU to run cores in parallel, so it can process many data elements at the same time. This guarantee of data concurrency comes from the confirmation that computing on one stream element does not affect computing on other elements in the same stream. As a result, the value that can be used in core computing can only be the input of that core and the read of the global memory. In addition, the GPU requires that the output of the core be independent: the Core cannot perform random writes to the global memory (in other words, they can only be written to a stream element location of the output stream ). The data concurrency provided by this model is the foundation for faster GPU speed than the serial processor. The node is the core and the edge is the data stream. The core processes all data elements in parallel and writes their results to the output stream. The following two sample code snippets demonstrate how to convert a serial program into a Data Parallel streaming program. The first example demonstrates loop through an array (for example, pixels in an image) on a serial processor ). Note that the commands in the loop body only act on one data element at a time: for (I = 0; I <data. size (); I ++) loopbody (data [I]) The next example demonstrates the same code segment. The pseudo code of the stream processor is written as: indatastream = specifyinputdata () kernel = loopbody () outdatastream = Apply (kernel, indatastream) specifies the data stream in the first row. In our image example, the stream is all pixels in the image. The second row specifies the computing core, which is only the cyclic body from the first sample. Finally, the third line applies the core to all elements in the input stream and stores the results in the output stream. In the image example, this operation processes the entire image and generates a new, transformed image. Currently, the GPU fragment processor has exceeded the programming restrictions of the stream model described above. Currently, the GPU fragment processor is a single-instruction, multi-data (SIMD) Parallel processor. Traditionally, this means that all stream elements (fragments) must be processed by the same command sequence. The recent GPU (supporting pixel shader 3.0 [Microsoft 2004a]) slightly relaxed this strict SIMD model and allowed to use variable-length loops and limited segment-level branches. However, since the hardware is still SIMD, the branch must have spatial consistency between fragments to run efficiently (for more information, see Chapter 1 of this book ). Currently, the vertex processor (vertex coloring machine 3.0 [Microsoft 2004b]) is a multi-instruction, multi-data (MIMD) machine that can run core branches more efficiently than the Fragment Processor. Although the flexibility is relatively poor, the SIMD structure of the fragment processor is very efficient and cost-effective. Because almost all gpgpu computing is currently executed on a relatively powerful segment processor, the GPU-based data structure must be suitable for stream and SIMD programming models of the segment processor. Therefore, all data structures in this chapter are represented by streams, and computing on these data structures is in the form of SIMD and Data Parallel cores. 33.2 GPU Memory ModelCompared with the main memory, cache, and registers of the serial microprocessor, the graphics processor has its own memory architecture. However, this memory architecture is designed for accelerated graphics operations and is suitable for stream programming models rather than general and serial computing. Furthermore, graphics APIs such as OpenGL and direct3d further limit this memory to use only graphics-specific elements, such as vertices, textures, and frame buffers. This section provides an overview of the Memory Model on the current GPU and how stream-based computing fits into it. 33.2.1 Storage ArchitectureFigure 33-2 demonstrates the memory architecture of the CPU and GPU. The GPU memory system establishes a branch of the modern computer memory architecture. Like a CPU, a GPU has its own cache and registers to accelerate data access in computing. However, the GPU's primary memory also has its own storage space-which means that before running the program, the programmer must explicitly copy the data into the GPU memory. This transmission is traditionally a bottleneck for many applications, but the new PCI Express bus standard may make it more feasible to share memory between CPU and GPU in the near future. Figure 33-2 memory architecture of CPU and GPU 33.2.2 GPU stream typeUnlike CPU memory, GPU memory has some usage restrictions and can only be accessed through abstract graphical programming interfaces. Each such abstraction can be imagined as a different stream type. Each stream type has its own access rule set. GPU programmers can see three stream types: vertex stream, frame buffer stream, and texture stream. The 4th stream types are fragment streams, which are generated and completely consumed in the GPU. Figure 3-3 demonstrates a modern GPU pipeline where three user-accessible streams can be used in the pipeline. Figure 33-3 stream diagrams in modern GPUs Note: GPU programmers can directly access vertices, frame buffers, and textures. Fragment streams are generated by the grating and consumed by the Fragment Processor. They are input streams of fragment programs, but they are completely built and consumed within the GPU, so they cannot be directly accessed by programmers. 1. vertex streamSpecifies the vertex buffer that passes through the graphic API. These streams hold the vertex position and various vertex-by-vertex attributes. These attributes are traditionally used as texture coordinates, color, and normal, but they can be used as any input stream data of the vertex program. Vertex programs do not allow random indexing of their input vertices. Until recently, the update of the vertex stream can only be completed by transferring data from the CPU to the GPU. The GPU does not allow writing to vertex streams. However, recent API enhancements have made it possible for GPUs to write to vertex streams. This is done by "copying to vertex buffer" or "rendering to vertex buffer. The previous technology copies the rendering results from the frame buffer to the vertex buffer; in the latter, the rendering results are directly written to the vertex buffer. The recently added GPU writable vertex stream feature allows the GPU to connect results from the end of the assembly line to the beginning of the assembly line for the first time. 2. Fragment streamFragment streams are generated by the grating and consumed by the Fragment Processor. They are input streams of fragment programs, but they are not directly accessible to programmers because they are completely built and consumed within the graphics processor. Fragment stream values include all interpolation outputs from vertex processors: Positions, colors, texture coordinates, and so on. Because of the stream attributes of vertices, the stream values required by any fragment program can now be used with the texture coordinate. Fragment programs cannot access fragment streams randomly. Random Access to fragment streams is allowed to generate dependencies between fragment stream elements, thus breaking the Data Parallel guarantee of the programming model. If the algorithm requires Random Access to the fragment stream, the stream must first be saved to the memory and converted into a texture stream. 3. Frame Buffer streamThe stream in the frame buffer is written by the Fragment Processor. They are traditionally used to accommodate pixels to be displayed on the screen. However, streaming GPU computing uses a frame buffer to accommodate the results of the intermediate computing stage. In addition, modern GPUs can write multiple frame buffer surfaces (multiple rgba buffers) at the same time ). Currently, the GPU can write a maximum of 16 floating point scalar values per rendering (this value is expected to increase in hardware in the future ). Fragment or vertex programs cannot randomly access the stream in the frame buffer. However, the CPU can directly read and write them through the graphic API. By allowing rendering times to directly write any type of stream, the latest API enhancement has begun to blur the difference between the frame buffer, vertex buffer, and texture. 4. Texture streamTexture is the only GPU memory that can be randomly accessed by fragment programs and vertex shader 3.0 GPU vertex programs. If programmers need to index a vertex, fragment, or frame buffer stream at will, they must first convert it into a texture. The texture can be read and written by the CPU or GPU. The GPU writes texture by directly rendering to the texture rather than the frame buffer, or copying data from the frame buffer to the texture memory. Textures are declared as 1d, 2d, or 3D streams, and are addressing with 1d, 2d, or 3D addresses respectively. A texture can also be declared as a cubic image, which can be regarded as an array of 6 2D textures. 33.2.3 GPU-core memory accessVertices and fragment programs (cores) are the cutting horse of modern GPUs. The vertex program operates on the vertex stream element and sends the output to the grating. The fragment program operates on the fragment stream and writes the output to the frame buffer. The capabilities of these programs are defined by the Operational operations they can perform and the memory they can access. Multiple computing operations available in the GPU core are similar to those available on the CPU, but there are many memory access restrictions. As previously described, most of these limitations are designed to ensure the concurrency required by the GPU to maintain their speed advantage. However, other limitations are caused by the evolution of the GPU architecture and will be almost certainly addressed in future generations. The following is a list of memory access rules for vertices and fragment cores on GPUs that support pixel shader 3.0 and vertex shader 3.0 (Microsoft 2004 a, B): ● no master CPU access; no disk access. ● No GPU stack or heap. ● Random reading of global texture memory. ● Read constant registers. Vertex programs can use relative indexes of constant registers. ● Read/write temporary registers. -Registers are partial for the stream elements being processed. -No relative register index. ● Read from the stream input register. -Vertex and read vertex streams. -Fragment and read fragment streams (rasterized results ). ● Stream writing (only at the end of the core ). -The write location is determined by the position of the element in the stream.
It cannot be written to the calculated address (that is, it cannot be hashed ). -The vertex core is written to the vertex output stream.
A maximum of 12 four-component floating point values can be written. -The fragment core is written to the frame buffer stream.
A maximum of four floating point values can be written. Another access mode from the preceding rule set and the stream type described in section 33.2.2 is: pointer stream (Purcell, etc., 2002 ). Pointer refers to the ability to use any input stream as the texture read address. Figure 3-4 demonstrates that the pointer stream is only a stream with a value of the memory address. If the pointer stream is read from the texture, this capability is called dependent texturing ). Figure 33-4 Use textures to implement pointer streams

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.