Directory
- 1 Abstract
- 2. Why opencl?
- 3. opencl Architecture
- 3.1 Introduction
- 3.2 platform model
- 3.3 execution model
- 3.3.1 Kernel
- 3.3.2 Context
- 3.3.3 command queue
- 3.4 Memory Model
- 3.5 Programming Model
- 4. opencl-based programming example
- 4.1 Process
- 4.2 Image Rotation
- 4.2.1 image rotation principle
- 4.3 implementation process
- 4.4 image rotation
- 5. Summary
- 6 References
1 Summary
Due to the limitations of transistor power consumption and physical performance, the development of CPU is greatly constrained. People are looking for other ways to improve system performance, such as multi-core processors and heterogeneous platforms. The emergence of Open Computing language (opencl) provides a standard for Parallel Computing of a large number of heterogeneous systems. Opencl provides hardware-independentProgramming Language, IsProgramMembers provide a flexible and efficient programming environment. This article points out the advantages and disadvantages of opencl programming through an in-depth discussion of the opencl computing architecture. Related programming practices are carried out. The parallel programming tests on different devices show that the opencl parallel programming architecture can significantly improve the program running efficiency.
As for the current situation, heterogeneous systems have high cost effectiveness. We believe that in the near future, opencl will become an important part of parallel and heterogeneous computing.
Keywords: opencl, heterogeneous computing, CPU/GPU computing, parallel computing
2 Why opencl?
Over the past few decades, the computer industry has undergone tremendous changes. The continuous improvement of computer performance provides a powerful guarantee for various current applications. The speed of a computer, as described by Moore's Law, is achieved by increasing the number of transistors to increase the frequency. However, since the beginning of the 21st century, this growth method has been limited and the size of the transistor has become small. Its physical characteristics make it difficult to increase the number of transistors on a large scale to increase the frequency, since the power consumption also increases at a non-linear speed, this method is greatly restricted. In the future, this trend will continue to become one of the most important factors affecting computer systems.
There are usually two ways to solve this problem. The first is to provide support for multiple tasks and multithreading by increasing the number of cores of the processor, so as to improve the overall system performance. The second method is heterogeneous, such as CPU (Central Processing Unit), GPU (Graphic Processing Unit), and even Apu (accelerated processing units, integration of CPU and GPU) and other computing devices to increase the speed of the system.
Heterogeneous systems are becoming more and more common, and more attention is being paid to computing that supports such environments. Currently, different vendors only provide programming for their own devices. For heterogeneous systems, it is difficult to use the same programming language to implement institutional programming, and it is also very difficult to use different devices as a unified computing unit.
Open Computing language (opencl) is designed to meet this important requirement. Define a mechanism to implement an independent hardware software development environment. The use of opencl can fully leverage the parallel features of devices, support parallel operations at different levels, and effectively map to the field-programmable gate array by CPU, GPU, and FPGA) it is a single-device or multi-device system that is homogeneous or heterogeneous to future devices. Opencl defines the runtime that can be used to manage resources and combine different types of hardware in the same execution environment, and hopefully in the near future, supports dynamic balancing of computing, power consumption, and other resources in a more natural way.
I believe that in the near future, opencl will be widely used in heterogeneous parallel programming.
3 Opencl Architecture 3.1 Introduction
Opencl provides an open framework standard for writing programs, especially parallel programs, for heterogeneous platforms. The heterogeneous platforms supported by opencl can be composed of multi-core CPUs, GPUs, or other types of processors. Opencl consists of two parts: one is used to compile the Kernel Program (run on the opencl DeviceCodeThe second is to define and control the platform's APIs. Opencl provides two parallel computing mechanisms: Task-based and data-based. It greatly extends the application scope of GPU and makes it no longer confined to the graphic field.
Opencl is maintained by khronos group. Khronos group is a non-profit technical organization that maintains multiple open industrial standards, such as OpenGL and openal. These two standards are used for Three-Dimensional Graphics and computer audio respectively.
The opencl source program can be compiled and executed on both multi-core CPU and GPU, which greatly improves the Code Performance and portability. Opencl standards are developed by the appropriate Standards Committee, which consists of major vendors in the industry (including AMD, Intel, IBM, and NVIDIA ). As something users and programmers have been waiting for, opencl has brought about two important changes: A Cross-vendor non-proprietary software solution; A cross-platform heterogeneous framework can simultaneously leverage the capabilities of all computing units in the system.
Opencl supports a wide range of applications and it is difficult to generalize the process of application development. However, generally, an application based on a heterogeneous platform mainly includes the following steps [3]:
- Find all components that form a heterogeneous platform.
- Evaluate the features of components so that the software can be implemented based on different hardware features.
- Creates a set of kernels running on the platform.
- Set Computing-related storage objects.
- Run the kernel in the correct order on the appropriate components.
- Collect results.
These steps are implemented through a series of APIs and kernel programming environments in opencl. This implementation adopts the "divide and conquer" strategy. The problem can be broken down into the following model [1] platform model execution model storage model Programming Model
These concepts are at the core of opencl's overall architecture. These four models run through the entire opencl programming process.
The following describes the content of the four models.
3.2 Platform model
Platform model (1) specifies a processor (host) to coordinate program execution. One or more processors (device devices) are used to execute opencl C code. Here is actually just an abstract hardware model, so that programmers can easily write opencl C functions (called Kernels) and execute them on different devices.
The devices in the figure can be viewed as CPUs/GPUs, while the computing units in the devices can be viewed as CPU/GPU cores, all processing nodes of a computing unit act as SIMD units or SPMD units (each processing node maintains its own program counters) to execute a single command flow. The abstract platform model is closer to the current GPU architecture.
The platform can be considered the implementation of opencl APIs provided by different vendors. If a platform is selected, you can only run the devices supported by the platform. Based on the current situation, if you select Intel's opencl SDK, you can only use intel's CPU for computing, if you select amd app SDK, you can perform amd cpu and amd gpu computing. Generally, Company A's platform cannot communicate with Company B's platform after it is selected.
3.3 Execution model
The most important thing in the execution model is the kernel, context, and command queue concepts. Context manages multiple devices. Each device has a command queue. The host Program submits the kernel program to different command queues for execution.
3.3.1 Kernel
The kernel is the core of the execution model and can be executed on devices. Before executing a kernel, you must specify an n-dimensional range (NDRC Ange ). A single-dimensional, two-dimensional, or three-dimensional index space. You must specify the number of global worker nodes and the number of worker nodes. The global work node range is {12, 12}, and the Work Group node range is {4, 4}, with a total of 9 Working Groups.
For example, a kernel program that adds vectors:
_ KERNEL void vectoradd (_ global int * a, _ global int * B, _ global int * c) {int id = get_global_id (0 ); c [ID] = A [ID] + B [ID];}
If the vector is defined as 1024 dimensions, in particular, we can define a global worker node as 1024 and a worker node as 128, there are a total of eight groups. Defining Working Groups is mainly convenient for some programs that only need to exchange data within the group. Of course, the number of working nodes is limited by the number of devices. If a device has 1024 processing nodes, the 1024-dimensional vector can be completed once each node is computed. If a device has only 128 processing nodes, each node needs to be calculated eight times. Reasonable Setting of the number of nodes and the number of working groups can improve the degree of parallelism of the program.
3.3.2 Context
To make the kernel run on a device on a host, a context must be provided to interact with the device. A context is an abstract container that manages memory objects on devices and tracks programs and kernels created on devices.
3.3.3 Command queue
The Host Program uses the command queue to submit commands to the device. A device has a command queue that is contextual. The command queue schedules the commands executed on the device. These commands are executed asynchronously in host programs and devices. There are two modes for the relationship between commands during execution: (1) sequential execution and (2) disorderly execution.
The execution of the kernel and the memory commands submitted to a queue generate event objects. This is used to control command execution and coordinate the running of host machines and devices.
3.4 Memory Model
Generally, different platforms have different storage systems. For example, if the CPU has a high-speed cache, the GPU does not. For program portability, opencl defines an abstract memory model. During program implementation, you only need to pay attention to the image pulling memory model. The specific ing to the hardware is completed by the driver. The definition of memory space and the projection of hardware.
The memory space can be specified using keywords in the program. Different definitions are related to the location of Data existence. There are several basic concepts as follows [2]:
- Global memory: All work items in all working groups can be read and written. Work Items can read and write any element of this memory object. Global memory read/write operations may be cached, depending on the capabilities of the device.
- Unchanged memory: a part of the global memory remains unchanged during kernel execution. The host machine is responsible for allocating and initializing memory objects.
- Local Memory: the memory area of a working group. It can be used to allocate some variables which are shared by all work items in the Working Group. On the opencl device, a dedicated memory area may be available or mapped to the global memory.
- Private memory: memory area of a work item. Variables defined in the private memory of one work item are invisible to another work item.
3.5 Programming Model
Opencl supports data parallelism, concurrent task programming, and a mixture of two modes. Opencl supports synchronization of work items in the same working group and commands in the same context in the command queue.
4 Opencl-based programming example
In this section, we will describe how to program opencl by taking an example of image rotation. First, the implementation process is provided, and then the C-cycle implementation of image rotation and opencl C kernel implementation are provided.
4.1 Process
4.2 Image Rotation 4.2.1 Image rotation principle
Image Rotation refers to the rotation of a defined image around a certain point to a certain angle in the clockwise or clockwise direction, usually refers to the rotation of the center around the image in the clockwise direction. Assume that the upper-left corner of the image is (L, t), and the lower-right corner is (R, B) after the θ angle is rotated counterclockwise, the formula for calculating the new Coordinate Position (x', y') is as follows:
X' = (X-xcenter) COS θ-(Y-ycenter) sin θ + xcenter,
Y' = (X-xcenter) sin θ + (Y-ycenter) COS θ + ycenter.
C code:
Void Rotate ( Unsigned Char * Inbuf, Unsigned Char * Outbuf, Int W, Int H,Float Sintheta, Float Costheta ){ Int I, J; Int XC = W/2; Int YC = H/2; For (I = 0; I For (J = 0; j <W; j ++ ){ Int Xpos = (J-XC) * costheta-(I-YC) * sintheta + XC; Int Ypos = (J-XC) * sintheta + (I-YC) * costheta + YC; If (Xpos> = 0 & ypos> = 0 & xpos <W & ypos Opencl C kernel code:
# Pragma opencl extension cl_amd_printf: Enable _ KERNEL Void Image_rotate (_ global uchar * src_data, _ global uchar * dest_data, // Data in global memory Int W, Int H, // Image dimensions Float Sintheta, Float Costheta) // Rotation Parameters {Const Int IX = get_global_id (0 ); Const Int Iy = get_global_id (1 ); Int XC = W/2; Int YC = H/2; Int Xpos = (ix-XC) * costheta-(Iy-YC) * sintheta + XC; Int Ypos = (ix-XC) * sintheta + (Iy-YC) * costheta + YC; If (Xpos> = 0) & (xpos <W) & (ypos> = 0) & (ypos <H )) dest_data [ypos * w + xpos] = src_data [Iy * w + ix];}
Rotate 45 degrees
As shown in the code above, two repeated loops are required in the C code to calculate the new coordinate positions on the horizontal and vertical coordinates. In fact, in the image rotationAlgorithmThe calculation of each vertex can be performed independently and has no relationship with the coordinates of other vertices. Therefore, parallel processing is more convenient. Parallel processing is used in opencl C kernel code.
The above code was tested on Intel's opencl platform. The processor is a dual-core processor with an image size of 4288*3216. If the running time in a loop is stable at around 6s, if opencl C kernel is used in parallel, the running time is stable at around 0.132 seconds. GPU testing is performed on nvidia's geforce g105m graphics card and runs stably at around 0.0810s. From the circular method, dual-core CPU Parallel and GPU parallel computing, we can see that opencl programming can greatly improve the execution efficiency.
5 Summary
Through the analysis and experiment of opencl programming, we can conclude that the applications written with opencl have good portability and can run on different devices. Opencl C kernel is generally processed in parallel, so it can greatly improve the program running efficiency.
Heterogeneous parallel computing has become more and more common. However, for the current opencl version, there are still many shortcomings. For example, to compile the kernel, we need to make a more in-depth analysis on the parallel problem, for memory management, it is still necessary for programmers to explicitly declare and explicitly move between the primary memory and the memory of the device, which cannot be automatically handed over to the system. From these aspects, opencl does need to be strengthened. There is still much work to be done to enable efficient and flexible application development.
6 References
[1] aaftab Munshi. The opencl specification version1.1 document revision: 44 [M]. khronos opencl Working Group. 2011.6.1.
[2] aaftab Munshi. Qingliang translation. opencl specification version1.0 document revision: 48 [M]. khronos opencl Working Group. 2009.10.6.
[3] aaftab Munshi, Benedict R. gaster, Timothy G. Mattson, James Fung, Dan Ginsburg. opencl Programming Guide [M]. Addison-Wesley professional. 2011.7.23.
[4] Benedict gaster, Lee Howes, David R. Kaeli and perhaad Mistry. Heterogeneous Computing with opencl [M]. Morgan Kaufmann, 1 edition. 2011.8.31.
[5] slo-Li Chu, Chih-chieh Hsiao. opencl: Make ubiquitous supercomputing possible [J]. IEEE International Conference on high performance computing and communications. 2010 12th 556-561.
[6] John E. stone, David Gohara, guochun Shi. opencl: a parallel programming standard for heterogeneous computing systems [J]. copublished by the ieee cs and the AIP. 2010.5/6 66-72.
[7] Kyle spafford, Jeremy Meredith, Jeffrey Vetter. maestro: Data orchestration and tuning for opencl devices [J]. p. d 'ambra, M. guarracino, and D. talia (eds .): Euro-par 2010, Part II, lncs6272, pp. 275-286,201 0. \ copyright Springer-Verlag Berlin Heidelberg 2010.
Http://www.cnblogs.com/wangshide/archive/2012/01/07/2315830.htmlauthor: let it be! Date: 2011-11-13 00:12:07
Copyright reserved.