New Trend for GPU parallel computing

Source: Internet
Author: User

As the GPU's programmability continues to increase, the GPU's application capabilities have far exceeded the graphics rendering task, and the use of GPU to complete general-purpose computing is becoming increasingly active, use GPU for computing outside of graphics rendering to become
Purpose computing on graphics processing units, GPU-based general-purpose computing ). At the same time, the CPU encountered some obstacles. In order to pursue universality, the CPU mainly used most of the transistors to build control circuits (such as branch prediction) and cache, only a small number of transistors are used to complete the actual computation.

CPU + GPU is a powerful combination, because the CPU
Includes several cores optimized for serial processing, and GPUs consist of thousands of smaller, more energy-efficient cores designed to provide strong parallel performance. The serial part of the program lies in the CPU
While the parallel part runs on the GPU. GPU
It has developed into a mature stage, allowing you to easily execute various applications in real life, and the program running speed is far higher than when using multi-core systems. In the future, the computing architecture will be the parallel core GPU
A hybrid system that runs with a multi-core CPU.

1. CPU multi-core to GPU parallelization (suitable for arithmetic-intensive scenarios)

Although the GPU is not suitable for solving all problems, we find that scientific propositions that consume a lot of computing power have a natural "" characteristic. This type of program has a very high computing density, number of concurrent threads, and frequent memory access during operation, both audio processing, visual simulation, molecular dynamics simulation, and financial risk assessment are involved in a large number of fields. If this problem can be smoothly migrated to a GPU-based computing environment, it will bring us a more efficient solution.

In the traditional sense, GPUs are not good at running branch code, but after long-term improvements to their internal architecture, ATI and NVIDIA have enabled GPUs to run complex code such as branches and loops more efficiently. At the same time, Because GPUs belong to parallel machines, they can achieve the best performance when the same operation can be applied to each data element. In the CPU programming environment, it is easy to write a program with different numbers of input data elements, but there is still a lot of trouble on the GPU parallel machine.

The general data structure is one of the greatest difficulties in GPU programming. The data structures that CPU programmers often use, such as lists and trees, are not easy to implement on the GPU. GPU currently does not allow access from any memory, and the GPU operation unit is designed to operate on Four-Dimensional vectors of the presentation position and color.

However, these cannot block the accelerated development of GPU programming, because the GPU is not really designed for general-purpose computing, and some efforts are required to allow the GPU to serve general-purpose computing programs at high speed. These efforts were implemented independently by programmers in the past few years. As ATI and NVIDIA began to see the hardware requirements in the high-performance computing market, we can see that no matter whether it is the Fermi architecture that adds the all-powerful second-level cache and unified addressing or the rv870 architecture that constantly optimizes the LDS and scales up the number of concurrent threads, these are the changes made by the GPU hardware system to adapt to the future computing environment.

Ii. Advantages of Parallel Programming

Opencl is a good choice for GPU parallel programming. Opencl is short for Open Computing language (Open Computing language). It is the first unified and free standard for general parallel programming of heterogeneous systems. Opencl supports heterogeneous systems consisting of multi-core CPU, GPU, cell architecture, signal processor (DSP), and other parallel devices. The emergence of opencl makes it easier for software developers to write code for high-performance servers, desktop computing systems, and handheld devices. Opencl is composed of languages used to write kernel programs and APIs for defining and controlling the platform. It provides two parallel computer systems, task-based and data-based, this allows GPU computing not only in the graphic field, but more parallel computing. However, it is difficult to develop a program that can run on a heterogeneous platform (on a CPU and GPU platform) through the traditional method. GPUs of different manufacturers and product models generally have different architectures, so it is difficult to develop a software that can efficiently use all the computing resources of different platforms. The emergence of opencl effectively solves the problem of heterogeneous platforms.

Opencl specifications are introduced by khronos group. opencl programs can run on not only multi-core CPUs but also GPUs, which fully reflects the cross-platform and portability of opencl, it also allows programmers to make full use of the powerful parallel computing capabilities of the GPU. Compared with the CPU, the GPU has many features.

L GPUs have more cores than high-end CPUs. Although each computing core of the GPU does not have a high operating frequency for each computing core, the overall performance of the GPU-chip area ratio and performance-power consumption ratio is much higher than that of the CPU, therefore, the performance of parallel computing tasks that process multiple threads is much higher.

L GPU can hide global latency through a large number of parallel threads. In addition, GPU also has a large number of registers, local memory and cache to Improve the access performance of external storage.

L in traditional CPU operations, switching between threads requires a lot of overhead. Therefore, the efficiency of algorithms that enable a large number of threads is very low. However, in GPUs, switching between threads is cheap.

L GPU has much stronger computing power than CPU.

Iii. Parallel Programming in opencl Environment

Opencl is an open industrial standard that can program heterogeneous platforms composed of different devices such as CPU and GPU. Opencl is a language and a framework for parallel programming. programmers can use opencl to compile a general program that can be executed on the GPU.

The technical core of opencl is packaged with the following four models:

L platform model: The opencl platform model defines the roles of hosts and devices, and provides an abstract hardware model for the opencl C function (kernel) that programmers write on the device. The platform model determines that the processors on the host can be executed in a coordinated manner, and one or more processors can execute opencl.
C code (device ).

L execution model: defines how to configure the opencl environment on the host and how the kernel is executed on the device. This includes setting the opencl context on the host, providing a mechanism for interaction between the host and the device, and defining the method in which the kernel runs on the device.

L Memory Model: defines the abstract memory layers used by the kernel.

L programming model: defines how the concurrency model maps to physical hardware.

The opencl framework is divided into platform-layer APIs and runtime APIs. The platform-layer APIs allow applications to query platforms and devices and manage them through context. The Runtime API uses context to manage the execution of the kernel on the device.

Iv. opencl parallel debugging tool

After using opencl for programming, we can use gdebugger for debugging. gdebugger is an advanced opencl and OpenGL debugger, analyzer and memory analyzer. It can do the work that other tools cannot do: track the activities of applications on opencl and OpenGL, and find out what happened inside the system implementation.

Programmers can use gdebugger in the following scenarios

L optimizes the performance of opencl and OpenGL applications.

L quickly find Bugs related to opencl and OpenGL.

L improve program performance and robustness

5. Shared memory space between CPU and GPU

In the past, although the GPU and CPU have been integrated into the same chip (gpgpu technology), it is still complicated to locate the memory location during computing, this is because the memory pool of the CPU and GPU is still operating independently. Previously, to solve the independent operation problem between the two memory pools, when the CPU program needs to perform partial operations on the GPU, the CPU must copy all the data from the memory of the CPU to the memory of the GPU. When the operation on the GPU is completed, the data must be copied back to the CPU memory. These steps will consume time and program processing efficiency. In 2012, amd joined hands with arm, Qualcomm, Samsung, and mediatek to set up HSA (heterogeneous)
Systems Architecture) Foundation, hoping to expand a new architecture for collaborative computing between CPU and GPU, and assist in the development of a new heterogeneous computing software development environment.

A few days ago, amd further announced the new technology of this computing architecture: huma (heterogeneous uniform memory access ). Through Huma, the CPU and GPU can share the same memory space, and the CPU can directly access the GPU's memory address, without having to spend the effort to rewrite GPU computing data to the CPU. Recently, at the hotchips conference, amd successively announced the steamroller architecture used by the desktop FX processor and the jaruar architecture for Low-Power platforms. However, this is not the ultimate goal of AMD, they claim that competition for processor speed has ended, and the future belongs to HSA. (Link
Http://www.btc800.com /)

Vi. Future Development Trend

In the course of computer development, in order to solve various specific problems, there are constantly incompatible computing modules added to the system, but they are rarely investigated from the perspective of global optimization. The current situation of low overall computer efficiency is the direct consequence of this design model. It is common that the computing load of the software is scheduled on a computing device that is not suitable for the current task for inefficient execution. HSA presents a new architecture that can adapt to computing tasks with various features.

The HSA version can seamlessly share data between the CPU and GPU without memory copy and cache refresh, because tasks are scheduled to the appropriate processor at an extremely low cost. The final result is that the performance of HSA is 2.3 times higher, while the power consumption is 2.4 times lower *. In comparison, both multi-core CPU, GPU, and non-HSA hybrid CPU and GPU cannot achieve this level of performance. It is equally important that programs can be implemented through simple extensions of C ++ without the need to convert to different programming models.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.