Learning opencl development from scratch (I) Architecture

Source: Internet
Author: User

Thank you for your attention reprinted this article Please note: http://blog.csdn.net/leonwei/article/details/8880012

 

This article will be the first article in my "start from scratch for opencl development" series.

 

1 Heterogeneous Computing, gpgpu and opencl

Opencl is a common multi-CPU, GPU, and other chip Heterogeneous Computing (heterogeneous) Standard launched by many companies and organizations. It is cross-platform. It aims to fully utilize the powerful parallel computing capability of GPU and the collaboration with CPU, and more efficiently use hardware to efficiently complete large-scale (especially high concurrency) computing. In the past, the GPU Technology Used to accelerate Image Rendering was very mature, but we know that the GPU Chip structure is good at large-scale parallel computing (PC-level GPUs may be tens of thousands of CPUs ), CPU is good at logical control, so it is not only limited to Image Rendering. People want to extend this computing capability to more fields, therefore, this is also known as gpgpu (GPU for general computing processing ).

Simply put, our CPU is not suitable for computing. It is an architecture of Multi-command single data stream (misd). What we are better at is logical control, while data processing is basically a single pipeline, so our code for (I = 0 ;...; I ++) this type of cpu requires repeated iterations, but your graphics GPU is not like this. GPU is a typical single-instruction multi-data (SIMD) architecture, it is not good at logical control, but is indeed a Natural Vector computing machine. For (I = 0 ;...; I ++) such code sometimes only needs to be run once, so many vertices and fragments in the graphic world can be quickly rendered in parallel in the graphics card.

 

The GPU transistor can reach several billion, while the CPU is usually only several hundred million,

For the NVIDIA femi100 structure, it has a large number of parallel computing units.

So people want to move more computing code to the GPU, so that they don't know how to do rendering, and the CPU is only responsible for logical control, such a CPU (Control Unit) + The Architecture of several GPUs (sometimes several CPUs may be added) is the so-called heterogeneous programming (heterogeneous), where the GPU is gpgpu. The prospect and efficiency of heterogeneous programming are very exciting. in many fields, especially high concurrency computing, the efficiency is improved by an order of magnitude not several times, but hundreds of times.

In fact, NVIDIA withdrew from the GPU computing Cuda architecture using its graphics card for a long time. At that time, it had a great impact on a lot of computing work (scientific computing, image rendering, and gaming) the problem has improved the efficiency by several orders of magnitude. Remember that NVIDIA introduced Cuda to Zhejiang University and demonstrated examples of real-time Ray Tracing and collision of a large number of rigid bodies, cuda now seems to have grown to 5.0, and it is the general computing architecture of nvdia. However, the biggest limitation of Cuda is that it can only use n of its own graphics cards, and it is difficult for the majority of a card users. Opencl came into being later. It was jointly initiated by major chip vendors, operating systems, software developers, academic institutions, middleware providers, and other companies. It was initially proposed by Apple, followed by khronos
The group established a working group to coordinate these companies to jointly maintain the general computing language. Khronos group sounds familiar. OpenGL, a well-known software and hardware interface API specification in the image rendering field, is also maintained by this organization. In fact, they have maintained many specifications in the multimedia field, it may also be like the name of open *** (so when I first heard about opencl, I was thinking about its relationship with OpenGL). opencl does not have a specific SDK, khronos group only specifies the standard (you can understand them as defining header files), while the specific implementation is done by different participating companies, in this way, you will find that nvdia implements opencl and then becomes its Cuda
SDK, while amd puts the implementation in the so-called amd app (accelerated paral processing) SDK, and Intel also implements, therefore, the current mainstream CPU and GPU support the opencl architecture. Although different companies have developed different sdks, they all comply with the same opencl specification, that is to say, in principle, if you use the interfaces defined in the standard opencl header, the program compiled using the nvidia sdk can run on a's graphics card. However, different sdks have specific extensions for their chips, which is similar to the relationship between the brick OpenGL library and the GL Library Extension.

The emergence of OpenGL makes AMD catch up with nvidia in the gpgpu field. However, although NVIDIA is a member of opencl, it seems that they pay more attention to their own dedicated weapon Cuda, therefore, N companies implement less extensions than amd for opencl, and AMD seems to be more energetic in opencl because of its concurrent CPU and GPU and Their Apu.

2. What about writing code on the GPU?

Opencl is accelerated by writing code on the GPU, But it encapsulates the CPU, GPU, and other chips in a unified manner, which is a higher layer and more friendly to developers. Speaking of this, I suddenly want to repeat some history of writing code on the GPU ..

In fact, the graphics card did not exist at the beginning. The earliest graphics processing was on the CPU, and later I found that I could put a separate chip on the motherboard to accelerate the drawing. It was also called an image processing unit, it wasn't until NVIDIA made it bigger and bigger, and first changed it to NB, called GPU or image processor, later, the GPU increased performance several times faster than the CPU.

At the beginning, the GPU cannot be programmed, or it is also called a fixed pipeline, that is, the data is completed according to the fixed path.

As the CPU is used as a computing processor, it is easy to program on the GPU. You can only use GPU assembly to write GPU programs and GPU assembly? It sounds very advanced, so the skill to use GPU to draw many special effects was only available to a few graphic engineers. This method is called programmable pipeline.

Soon this sort of middleware was broken, and the advanced programming language on the GPU was born. On some more advanced graphics cards at that time (the memory should start with a three-generation graphics card ), advanced languages like C can make it easier for programmers to write code to the GPU. These languages represent CG created by nvidia and Microsoft, HLSL of Microsoft, and glsl of OpenGL, now they are also known as the shading language, which is widely used in our various games.

When using shading language, some researchers found many non-graphic computing problems (such as parallel computing in mathematics and physics) it can be disguised as a graphics problem. The shading language can be used for computing on the GPU, and the result is n times faster than the CPU. People have new ideas, many people have tried to solve all the parallel computing problems (not only in the graphic field) by using the GPU performance. This is also called the GPU for general processing, for a while, many papers are writing about how to use GPU to calculate which stuff... However, this kind of work is done in the form of image processing, and there is no natural language for us to do general computing on the GPU. At this time, NVIDIA brought about innovations. The guda architecture launched around allows developers to write general-purpose computing programs in advanced languages on their graphics cards, and Cuda is booming, until now, N cards are printed with big Cuda
Logo, but its limitation is hardware limitation.

Opencl breaks through the hardware barriers and tries to build a universal computing collaborative platform on all supported hardware. No matter whether you are a CPU or GPU, you can perform computation equally, it can be said that the significance of opencl is to blur the boundaries between the two important processors on the motherboard and make it easier to run code on the GPU.

 

3Opencl Architecture

3.1 Hardware layer:

The above is about general-purpose computing and what opencl is. The architecture of opencl is summarized as follows:

The following is the abstraction of the opencl hardware layer:

 

It is a host (a control processing unit, usually with one CPU) and a bunch of computer devices (computing processing units, usually with some GPU and other CPU-supported chips ), the compute device is split into many processing elements (this is the smallest unit that independently participates in single data computing. Different hardware implementations are different. For example, the GPU may be one of processor, the CPU may be a core, I guess .. Because this implementation is hidden to developers), many of the processing elements can be grouped into a computer unit, and the elements in a unit can easily share memory, only one element in a unit can be used for synchronization and other operations.

3.2 memory architecture

Among them, the host has its own memory, while the compute device is relatively complicated. First, there is a constant memory which can be used by everyone. It is usually the fastest but least accessible, each element has its own memory, which is private. The elements in a group have a local memery they share. After careful analysis, this is an efficient and elegant Memory organization method. Data can flow along the channel host-> gloabal-> Local-> private (which may span a lot of hardware ).

3.3 composition of software layer

These data types are available in the SDK.

Setup related:

Device: corresponds to a hardware device. (The CPU with multiple cores is an entire device)

 

Context: environment context. One context contains several devices (single CPU or GPU). One context is a link between these devices. Only devices in one context can communicate with each other, there may be many context on your machine at the same time. You can use a CPU to create a context, or a CPU and a gpu to create a context.

 

Command queue: This is a sequence of commands submitted to each device.

 

Memory related:

Buffers: A piece of memory.

Images: after all, most of the application prospects of parallel computing are on graphic images. Therefore, there are several types of native images that represent various dimensions.

 

GPU code execution:

Program: This is a collection of all code, which may contain kernel and other libraries. opencl is a dynamic compilation language, after the code is compiled, an intermediate file is generated (which can be virtual machine code or assembly code, depending on different implementations). When used, the connection enters the program and is read into the processor.

Kernel: This is the kernel function running in the element and its parameter group. If you think of computing devices as many people doing something for you at the same time, the kernel is what each of them does, everyone does the same thing, but the parameters may be different. This is the so-called single-instruction multi-data system.

Worki TEM: This is a processing element on the hardware, the most basic computing unit.

 

Synchronization related:

Events: in such a distributed computing environment, synchronization between different units is a big problem, and event is used for synchronization.

 

Their relationship is as follows:

 

The above is an introduction to opencl. In fact, we have tracked gpgpu-related things in about 10 years. At that time, many related technologies still existed in the lab. Later, after the emergence of cuda, we were excited, after learning for a while, Cuda is too dependent on specific hardware, and the industrial application prospects are not good. It can only be used for engineering experiments. You cannot let users install a game at the same time, let him change the advanced n card by the way. So I was not very interested in this field at one time. I recently saw the emergence of opencl and found that this architecture may still have a good application prospect. It is also a task that many vendors are currently pushing together. It's exciting to think about a 10000 for loop in the next iteration.

In the gaming field, opencl has already had a lot of successful practices, as if Ea's F1 has already applied opencl, there are also some lib application opencl (the FFT operation of Sea Surface Water Waves was very slow in the past), and other libraries simply use opencl to directly modify the existing C code, acceleration for loop, and even the C ++ STL of opencl, called thrust. So I think opencl may actually bring us something ~

 

The following are some important resources about opencl:

Home page for http://www.khronos.org/opencl/ Organization

Home page for N https://developer.nvidia.com/opencl

Home of http://developer.amd.com/resources/heterogeneous-computing/opencl-zone/

Http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/ Standard Reference

Http://developer.amd.com/wordpress/media/2012/10/opencl-1.2.pdf must see the latest version 1.2 Standard

Http://www.khronos.org/assets/uploads/developers/library/overview/opencl-overview.pdf is a must-see, getting started Review

A teaching website in http://www.kimicat.com/opencl-1/opencl-jiao-xue-yi

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.