Preface
This paper introduces the development of GPU programming technology, so that we have a preliminary understanding of GPU programming, into the world of GPU programming.
von Neumann the bottleneck of computer architecture
In the past, almost all processors were based on the von Neumann computer architecture. The architecture of the system is simply that the processor is constantly fetching, decoding and executing from memory.
But now this system architecture has a bottleneck: memory reads and writes faster than the CPU clock frequency. Systems with this feature are known as memory-constrained systems , and most computer systems currently belong to this type.
To solve this problem, the traditional solution is to use caching techniques. By setting up a multilevel cache for the CPU, the pressure on the storage system can be greatly reduced:
However, as the capacity of the cache increases, the increase in revenue from the use of larger caches is rapidly declining, which means we are looking for a new approach.
a few things that are instructive for the development of GPU programming technology
1. In the late 70, the Clay series supercomputer was successfully developed (Clay 1 spent $8 million in that year).
Such computers employ a shared memory structure of several memory strips that can be connected to multiple processors and evolve into today's symmetric multiprocessor systems (SMD).
Clay 2 is a vector machine -a single operation handles multiple operands.
The core of today's GPU devices is vector processors.
2. In the early 80, a company designed and developed a computer system called a connecting machine .
The system has 16 CPU cores, using standard single instruction multi-data (SIMD) parallel processing. This design allows the connector to eliminate redundant access operations and change the memory read-write cycle to 1/16 of the original.
3. The invention of the CELL processor
This kind of processor is interesting and its architecture is roughly as follows:
In this structure, a PPC processor, as a supervisory processor , is connected to a large number of SPE stream processors, forming a working line.
For a graphics process, an SPE is responsible for extracting the data, the other SPE is responsible for the transformation, and the other is responsible for the memory back. This can constitute a complete assembly line, greatly improving the processing speed.
By the way, the third computer in the 2010 supercomputer was based on this design concept, covering an area of 560 square meters and costing $125 million.
Multi-point computing model
Cluster computing is to achieve high-performance computing by composing a computing network of computers with several general performance. This is a typical multi-point computing model .
The nature of the GPU is also a multi-point computing model. It is relative to today's Hadoop/spark cluster: "Point" from a single computer into a single SM (stream processor cluster), through the network interconnection into memory interconnection (multipoint computing model points between the point of communication is always an important issue to consider).
GPU Solutions
As the CPU " power wall " problem arose, the GPU solution began to formally embark on the stage.
The GPU is particularly well suited for parallel computing of floating-point types, showing the difference between GPU and CPU computing power in this case:
This does not indicate that the GPU is better than the CPU and that the CPU should be eliminated. The test is performed in the case of a fully parallel computation.
for a more flexible and complex serial program, the GPU performs much less efficiently than the CPU (no advanced mechanisms such as branch prediction).
In addition, GPU applications have long been limited to image processing . In fact CUDA's current high-end card Tesla series is dedicated to scientific computing, they do not even VGA interface.
several new graphics cards and their configurations (column N cards only)
Note:
1. The specific meaning of each parameter will be detailed analysis in future articles
2. Specific parameter information of the current video card can be obtained by debugging tool (method slightly)
mainstream GPU programming interfaces
1. CUDA
It is an interface designed by NVIDIA for GPU programming for N cards. Documentation is complete and available for almost all N cards.
The GPU programming techniques described in this column are based on this interface.
2. Open CL
Open source GPU programming interface with the widest range of applications for almost all graphics cards.
But relatively cuda, its mastery is more difficult , it is recommended to learn CUDA, on the basis of the Open CL learning will be very simple and easy.
3. DirectCompute
Microsoft developed the GPU programming interface. It's very powerful, and it's the simplest to learn, but only for Windows , where many high-end servers are not available to UNIX systems.
Summary, these kinds of interfaces each have advantages and disadvantages, need to choose according to actual situation. But they are very similar in use, and it is easy to master one of the other two ways.
learn the meaning of GPU programming
1. Not only can learn how to use the GPU to solve problems, but also let us more in-depth understanding of parallel programming ideas, for a comprehensive grasp of the various parallel technologies to pave the way.
2. The research and development of parallel computing related knowledge is bound to become a hotspot in the future IT industry and academia.
First: The development process and present situation of GPU programming technology