Scatter and gather in GPU General Programmable Technology

Source: Internet
Author: User


With the enhancement of GPU's programmable performance and the continuous development of gpgpu technology, it is hoped that the stream processor model-based GPU can be like a CPU, while supporting the process branch, it also allows flexible read/write operations on the memory. Ian Buck [1] has pointed out that the lack of flexible memory operations is the key to restricting the GPU to complete complex computing tasks, as a result, he added support for scatter/gather features when designing Brook [2], but the implementation process is still achieved through some techniques at the cost of performance.
The implementation of scatter/gather in the GPU is similar to that in the first vector machine. Scatter allows data to be output to an discontinuous memory address, gather allows reading data from non-contiguous memory addresses. Therefore, if we think that memory (such as Dram) is a two-dimensional array, scatter can be seen as writing data to any location in the Array Using subscript, that is, a [I] = X, gather can be viewed as reading data from any position in the Array Using subscripts, that is, x = A [I].
In the Cuda [5] architecture (Figure 1), each ALU can be considered as a processing core. Through scatter/gather operations, multiple ALU can share the memory, read and Write Data of any address.

Figure 1: Cuda scatter/gather

What is the use of scatter/gather?

For example, if we want to sort the data in an array, the most direct idea is to use the bubble algorithm, that is, to find the maximum number by traversing the array, it is called against the first data in the queue, and then the second and second data in the remaining array are called in turn until the sorting ends when the last item of the remaining array is executed. In this process, the two data items can be read and written to any location in the array in the memory operation.

The preceding example is easy to implement on the CPU, because the processor architecture of the IA (Intel architecture) architecture supports scatter/gather operations. However, on the GPU, the situation becomes a little complicated due to the existence of vertex shader and fragment shader, as well as the pursuit of parallelism in the stream processor model.

Scatter/gather implementation in GPU

Fragment shader first, because the texture can be prefetch (FETCH) and any data in the texture can be obtained through adjustment of texture coordinates [4], so the Fragment Processor can actually be stored in memory (Display) any address in the read data, that is, the gather operation can be implemented. However, in turn, the output of fragment shader can only point to specific fragments, and must be arranged in a fixed order. Therefore, fragment shader cannot actually implement scatter operations.

However, by using render to texture and multi-pass rendering technology, the previously rendered data can be stored as textures, as the input for the next rendering, the data cannot be written to any location by reading data from any location. In this way, as long as the data index (texture coordinates) is planned, the pseudo scatter effect can be achieved during multiple rendering processes. This is why we usually feel that fragment shader is more flexible than vertex shader, but this flexibility is at the cost of time-consuming multiple rendering and occupying a large amount of storage space.

Let's look at vertex shader. Generally, when processing each vertex in the vertex stream, it can only process some data of the current vertex, but cannot use data of multiple vertices at the same time, this means that you cannot read or write data from any address in the memory, that is, you do not have the scatter/gather function.

However, with the advent of vertex texture technology, vertex shader can obtain data from Textures like fragment shader through texture sampling, therefore, it has the ability to read data from any location in the memory.

The emergence of geometry shader and the Development of vertex buffer technology provide the possibility to implement scatter operations during geometric processing. Both directx10 and the upcoming OpenGL 3.0 [6] have proposed that data feedback and multi-pass operations are required during geometric processing) as shown in figure 2, this shows the possibility of scatter implementation in the geometric processing phase.

Figure 2: DirectX 10 Pipeline

Conclusion

In the gpgpu field, flexible memory operations are crucial. With the development of graphics processors, people can achieve parallel scatter/gather operations at various stages of the processing pipeline. However, considering the data precision and memory operation performance of the graphics processor, scatter/gather operations in the GPU still have many limitations in the current stage, which limits the development of the gpgpu field, but it also pointed out the direction for the future development of graphics processors.

References

1. Ian Buck, Pat Hanrahan, "Data Parallel Computation on graphics hardware"

2. Ian Buck, Tim Foley, "Brook for GPUs: stream computing on graphics hardware"

3. Ian Buck, "stream computing on graphics hardware", Ph. D. 2004

4. Timothy John Purcell, "Ray Tracing on a stream processor", 2004, p19

5. NVIDIA, "Cuda programming guide", 2007

6. Evan Hart, "new OpenGL features", 2006, GDC 2007

7. Christophe, "OpenGL vertex buffer objects", 2006, http://www.ozone3d.net//tutorials//opengl_vbo_p1.php

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.