1 Heterogeneous computing, GPGPU and OpenCL
OpenCL is currently a common standard for multi-cpu\gpu\ other chip heterogeneous computing (heterogeneous), co-sponsored by many companies and organizations, and it is cross-platform. Designed to take full advantage of the GPU's powerful parallel computing capabilities and work with CPUs to efficiently perform large-scale (especially high-parallelism) computations with hardware. The techniques used to accelerate image rendering in the past have been very mature, but we know that the GPU's chip structure excels at large-scale parallel computing (the PC-class GPU is probably tens of thousands of CPUs), the CPU is good at logic control, so it's not just limited to image rendering, People want to extend this computing power to more areas, so this is also known as GPGPU (the GPU that computes processing at the General Service).
Simply put, our CPU is not suitable for computing, it is a multi-instruction single Data flow (MISD) architecture, more adept at doing logic control, and data processing is basically a single pipeline, so our code for (i=0; i++) This is the CPU to repeat the iterations of the run many times, but your graphics card GPU is not the case, the GPU is a typical single instruction multi-data (SIMD) architecture, it is not good at logic control, but really natural equipped, for the for (i=0; i++) Such code sometimes only needs to run once, so many vertices, fragments in the graphics world can be quickly parallel in the graphics card rendering processing
GPUs can be up to billions of transistors, while CPUs are usually only a few billion,
As the above figure is the structure of the Nvidia Femi100, it has a large number of parallel computing units.
So people want to put more computing code on the GPU, let him do not know to do rendering, and the CPU is only responsible for logic control, such a CPU (control unit) + The architecture of several GPUs (and sometimes several CPUs) (compute units) is called heterogeneous programming (heterogeneous), where the GPU is GPGPU. The foreground and efficiency of heterogeneous programming is very encouraging, in many areas, especially in high-parallelism computing, the order of magnitude of efficiency is not several times, but a hundredfold.
In fact, Nvidia withdrew from the GPGPU computing Cuda architecture with its graphics card very early on, and the impact was significant, raising the efficiency of several orders of magnitude in computational work (scientific computing, image rendering, gaming). Remember when Nvidia came to Zhejiang University to introduce Cuda, demonstrating the real-time ray tracing, a large number of rigid body collisions and other examples, or excited, Cuda now seems to have developed to 5.0, and is nvdia main push general computing architecture, but Cuda's biggest limitation is that it can only use the N home card, for the vast number of a-card users beyond. OpenCL was created later, and it was co-sponsored by a large mainstream chip trader, operating system, software developer, academic institution, middleware provider, and so forth, initially set by Apple to launch the standard, followed by Khronos Group to establish a working group to coordinate these companies to maintain the common computing language. Khronos Group sounds familiar. The famous hardware and software Interface API specification in the field of image rendering the famous OpenGL is also maintained by this Organization, in fact, they also maintain a lot of multimedia domain specifications, may also be similar to open*** Named (so just when I heard that OpenCL was wondering what it had to do with OpenGL), OpenCL does not have a specific Sdk,khronos group just specifying the criteria (you can understand that they define the header file), and the specific implementation is done by different participating companies, so you will find that nvdia the OpenCL is implemented into its Cuda SDK, and AMD will be implemented in the so-called AMD APP (Accelerated Paral Processing) SDK, and Intel has implemented it, so the current mainstream CPU and GPU support the OPENCL architecture, although different companies do different SDKs, but they follow the same OPENCL specification, That is, in principle if you use those interfaces defined in the standard OPENCL header, programs that use Nvidia's SDK can run on the A's video card. But different SDKs have specific extensions for their chips, similar to the relationship between the brick OpenGL library and the GL Library extension.
The advent of OpenGL has enabled AMD to finally catch up with Nvidia in the Gpgpu field, but Nvidia is a part of OpenCL, but they seem to be more focused on their own weapon cuda, so n home to the OPENCL implementation of the extension is less than AMD, AMD, with its CPU and GPU at the same time, and their Apu, seem to be more interesting to OpenCL.
2. About the things that write code on the GPU
OpenCL is also accelerated by writing code on the GPU, but he encapsulates the CPU, GPU, and other chips in a unified package, a higher layer, and more friendly to developers. Speaking of which, I suddenly want to repeat some of the history of writing code on the GPU.
In fact, the first graphics card is not present, the earliest graphics processing is placed on the CPU, and later found to be able to put a separate chip on the motherboard to speed up the graphics, then also called Image Processing Unit, until Nvidia to make this thing stronger and bigger, and the first to change the name of a NB, called GPU, also known as the image processor , and the GPU later grew at a speed of several times higher than the CPU.
In the beginning, the GPU could not be programmed, or fixed, to go through the data in a fixed path.
and CPU as a computing processor, it is logical to come out of the programmable GPU, but it is not easy to program on the GPU at that time, you can only use the GPU assembly to write GPU programs, GPU assembly. It sounds like a high-level gadget, so the ability to use the GPU to draw a lot of special effects is only in the hands of a handful of graphic engineers, which is called a programmable pipeline.
Soon this fatters po was broken, the high-level programming language on the GPU was born, on some of the more advanced graphics cards (which should be the beginning of the 3-generation graphics card), high-level languages like C can make it easier for programmers to write code to the GPU, which represents the CG created by Nvidia and Microsoft, Microsoft's Hlsl,opengl GLSL and so on, and now they are also commonly referred to as advanced shading languages (shading Language), these shader are now widely used in our various games.
In the process of using shading language, some researchers found that many of the problems of non-graphical computing (such as parallel computing in mathematics and physics) can be disguised as graphics problems using the shading language to compute on the GPU, which results in n times the speed of running on the CPU, People have new ideas to use the GPU to solve all the problems of parallel computing (not just the graphics field), this is also known as the general-purpose processing of the GPU (GPGPU), a lot of people have tried to do so, a lot of papers in a period of time to write how to use the GPU what to forget what stuff ... But this kind of work is done in the form of graphic processing, and there is no natural language for us to do general computing on the GPU. At this time, Nvidia has brought innovation, the Guda architecture introduced around 09, allows developers to write a general-purpose computing program in high-level languages on their graphics cards, while Cuda is hot, until now n cards are printed with a large Cuda logo, but its limitations are hardware limitations.
OpenCL breaks the barriers to hardware, trying to build a common computing platform on all supported hardware, whether you're CPU or GPU alike, and you can say that OpenCL is about blurring the boundaries of the two important processors on the motherboard, and makes it easier to run code on the GPU.
3 OpenCL Architecture
3.1 Hardware layer:
It's all about universal computing and what OpenCL is, and here's a summary of the OPENCL architecture sketchy:
The following is an abstraction of the OPENCL hardware layer
It is a host (control processing unit, usually a CPU) and a bunch of computer devices (compute processing units, usually by some GPU, CPU other supported chips), where compute Device is cut into many processing Element (This is the smallest unit of independent participation in single data calculation, this different hardware implementation is not the same, such as the GPU may be one of the processor, and the CPU may be a core, I guess.) Because this implementation is hidden from the developer, many of the processing elements can be formed into groups of one computer unit, and an element within a unit can easily share memory, Only one element within a unit can synchronize operations.
3.2 Memory Architecture
Where host has its own memory, and on the compute device is more complex, first there is a constant memory, is all people can use, usually the fastest access but the most scarce, and then each element has its own memory, which is private, The element within a group has a local memery that they share. Careful analysis, this is an efficient and elegant way of memory organization. The data can flow along the host-"gloabal-" local-private channel (which may span a lot of hardware).
3.3 Software-level composition
These have the corresponding data types in the SDK
Setup Related:
Device: Corresponds to one hardware (the CPU in the standard that specifically describes multi-core is an entire device)
Context: Environment contexts, a context that contains several device (single CPU or GPU), a context is a link to these device, only those device in a context to communicate with each other work, There can be a lot of context on your machine. You can create a context with one CPU, or you can create one with one CPU and one GPU.
Command queue: This is a sequence of instructions submitted to each device
Memory-Related:
Buffers: This good understanding, a piece of memory
Images: After all, in parallel computing most of the application foreground is on the graphic image, so there are several types of primitives that represent images of various dimensions.
GPU Code Execution Related:
Program: This is a collection of all the code, which may contain kernel and other libraries, OpenCL is a dynamically compiled language, the code is compiled into an intermediate file (can be implemented as virtual machine code or assembly code, see different implementations), when used to connect to the application read into the processor.
Kernel: This is the kernel function and its parameter set in element run, and if you think of computing devices as a lot of people doing one thing for you at the same time, then Kernel is the one thing that each of them does, and that's the same thing everyone does, but the parameters may be different, This is called a single instruction multi-data system.
Worki tem: This is the most basic computational unit of a processing element on behalf of the hardware.
Synchronization Related:
Events: In such a distributed computing environment, synchronization between different units is a big problem, and the event is used to synchronize
Their relationship is shown below
Article transfer from http://blog.csdn.net/leonwei/article/details/8880012 to thank the Yumbo Lord for his selfless sharing