Transferred from Http://blog.csdn.net/fengbingchun/article/details/19619491#comments
The following content comes from the network summary:
When Nvidia launched the GeForce256 in 1999, it first presented the concept of GPU (graphics processor), and then a large number of complex application requirements prompted the industry to flourish so far.
GPU English full name graphic processing unit, Chinese translation as "graphics processor." GPU from the date of birth to go beyond the Moore's law of the speed of development, the ability to continue to improve the operation. The industry's major researchers noted the potential for computing on the GPU and presented the concept of GPGPU (general-purposecomputing on graphics units) at the SIGGRAPH conference in 2003. The GPU gradually transforms from a dedicated parallel processor made up of a number of dedicated fixed functional units (the fixed function unit) to a general-purpose computing resource with a fixed function unit as a supplement.
Although GPU computing has begun to emerge, the GPU does not completely replace the X86 solution, and many operating systems, software, and parts of the code are not currently running on the GPU, and the so-called GPU+CPU heterogeneous supercomputer is not entirely based on the GPU. In general, applications suitable for GPU operations have the following characteristics: dense operation, high parallelism, simple control, execution in multiple phases, conforming to these conditions, or applications that can be changed to a similar feature, to achieve higher performance on the GPU.
GPU is the display card "heart, Brain", also equivalent to the CPU in the computer role, it determines the graphics card and most of the performance, but also the 2D display card and 3D display card difference basis. 2d display chips in the processing of 3D images and special effects mainly rely on CPU processing capacity, known as "soft acceleration." 3D display Chip is the three-dimensional image and special effects processing functions in the display chip, also known as the "Hardware acceleration" function. The display chip is usually the largest chip (and most pins) on the display card. Most of the graphics cards on the market now use graphic processing chips from Nvidia (NVIDIA) and AMD two companies.
GPU General Computing programming Model:
GPU General computing usually uses CPU+GPU heterogeneous mode, the CPU is responsible for the implementation of complex logic processing and transaction processing, such as not suitable for data parallel computation, the GPU is responsible for compute-intensive large-scale data parallel computation. This is a significant advantage in terms of cost and price/performance by leveraging the power of the GPU and high bandwidth to compensate for the lack of CPU performance in order to maximize the potential performance of the computer. In 2007 Nivdia launched Cuda (Compute Unified devicearchitecture, Unified computing Equipment Architecture), the GPU general computing by the hardware programmability and Development mode constraints, the development of a greater difficulty. After 2007, while Cuda continued to evolve, other GPU-generic computing standards were proposed, such as the stream SDK launched by Apple Khronos Group's final release OPENCL,AMD, Microsoft has integrated DirectCompute in its latest WINDOWS7 system to support general-purpose computing using the GPU.
Cuda is a software and hardware system that uses GPU as a data parallel computing device, and the GPU (including GeForce, ION, Quadro, Tesla series) on the hardware of Nvidia GeForce8 series has adopted a support cuda architecture. The software development package Cuda has also been developed to Cuda Toolkit3.2 (up to November 2010) and supports the widows, Linux, MacOS three main operating systems. Cuda is developed using a more manageable Class C language and is developing a FORTRAN version of the CUDA architecture for scientific computing. Whether it's Cuda C-language or OPENCL, the instructions are eventually converted to the PTX (Parallel threadexecution, parallel thread execution, cuda the instruction set in the schema, similar to the assembly language) code, which is assigned to the display core calculation.
The CUDA programming model takes the CPU as the host, the GPU as a coprocessor (co-processor), or a device (Device). There can be one host and several devices in a system. CPU, GPU each have a separate storage address space: host-side memory and device-side video memory. The operation of Cuda on memory is basically the same as that of General C program, but it adds a new pinned memory; The operation of video memory requires calling the Cuda API memory management function. Once you have identified the parallel part of the program, consider handing this part of the calculation to the GPU. Cuda Parallel computing functions that run on the GPU are called kernel (kernel functions). A complete Cuda program is composed of a series of device-side kernel function parallel steps and host-side serial processing steps. These processing steps are executed sequentially according to the order of the corresponding statements in the program, satisfying the order consistency.
The APIs provided by the CUDA SDK are divided into CUDA runtime APIs (runtime APIs) and CUDA driver APIs (driver APIs). The CUDA runtime API is encapsulated on the basis of the CUDA driver API, hiding some implementation details and making programming more convenient. CUDA Runtime API functions are preceded by a CUDA prefix. The CUDA driver API is an underlying interface based on a handle that can load binary or assembly-form kernel modules, specify parameters, and initiate operations. CUDA Driver API programming is complex, but sometimes it can achieve more complex functions or achieve higher performance through direct manipulation of hardware execution. Because the device-side code it uses is binary or assembly code, it can be invoked in a variety of languages. CUDA Driver API All functions are prefixed with CU. In addition, the CUDA SDK also provides CUFFT (CUDA fast Fourier Transform, Fast Fourier transforms based on CUDA), Cublas (CUDA Basic Linear algebra, Functions such as basic matrix and vector operation library based on CUDA and CUDPP (CUDA Data Parallel Primitives, common parallel operation functions based on CUDA) provide simple and efficient common functions for developers to use directly.
Start with Cuda Toolkit3.0 to support Nvidia's newest Fermi architecture to maximize the benefits of the Fermi architecture in general computing. CUDA 3.0 is also beginning to support C + + inheritance and template mechanisms to improve programming flexibility, while the CUDA/C + + kernel is now compiled in standard elf format, starting to support hardware debugging, and adding a new Direct3D, OpenGL unified Collaboration API, Supports OpenGL textures and Direct3D 11 standards to support all OpenCL features.
Nvidia announces the latest version of the Parallel Computing development tool CUDA 6, a revolutionary and dramatic advance over previous CUDA5.5. Key features of CUDA 6 include the same addressing, direct access to CPU memory, GPU video memory, and the need to manually copy data between each other to add GPU acceleration support more simply in a large number of programming languages.
OpenCL (open Computing Language) is a framework for programming programs for heterogeneous platforms that can be composed of CPUs, GPU, or other types of processors. OpenCL consists of the language used to write kernels (functions that run on OpenCL devices) (based on C99) and a set of APIs that define and control the platform. OpenCL provides a parallel computer system based on task partitioning and data partitioning.
OpenCL was originally developed by Apple and Apple has its trademark rights and is initially perfected in collaboration with Amd,ibm,intel and Nvidia's technical team. Apple then submitted the draft to Khronos Group. June 16, 2008, Khronos's general computing Working Group was established. 5 months later, November 18, 2008, the Working Group completed the technical details of the OPENCL 1.0 specification. The technical specification was published on December 8, 2008 after a review by Khronos members. June 14, 2010, OpenCL1.1 released.
OpenCL is also a programming language based on C, divided into Platform Layer, Runtime, compiler three parts: Platform Layer is used to manage computing devices, provide an interface for initializing devices, and use to build compute Contexts and work-queues. Runtime is used to manage resources and to execute kernel of programs. compiler is a subset of the ISO C99 and adds a OpenCL special syntax. In the OPENCL implementation model, there are so-called compute kernel and compute program. Compute kernel is basically similar to the CUDA definition of kernel, is the most basic computing unit, and Compute program is a collection of Compute kernel and built-in functions, similar to a dynamic function library. To a large extent, OpenCL is similar to the Cuda Driver API.
Since December 2008 when NVIDIA presented the world's first OpenCL GPU demo on laptops at the Siggraphasia convention, AMD, NVIDIA, Apple, RapidMind, Gallium3D, Ziilabs, IBM, Intel has released its own OPENCL specification implementations (when there are different manufacturers ' support OPENCL devices on a single machine, which can also cause problems with the development of applications). In addition to AMD and NVIDIA, other manufacturers such as S3, via and so on have also released their support for OPENCL hardware products.
OpenCL is the first open and free standard for parallel programming for heterogeneous systems, and it is also a unified programming environment for software developers to write efficient and lightweight code for High-performance computing servers, desktop computing systems, handheld devices, and is widely applicable to multiple core processors (CPU), graphics processors ( GPU, cell type architecture and digital signal Processor (DSP) and other parallel processor, in the game, entertainment, research, medical and other fields have broad prospects for development, Amd-ati, NVIDIA products are now supporting open CL.
DirectCompute is an application interface developed and promoted by Microsoft for GPU Generic computing, integrated within Microsoft DirectX, allowing Windows Vista or Windows The program running on 7 platform uses GPU for general calculation. Although DirectCompute was initially implemented in the DirectX API, a DX10 GPU can use a subset of this API for general computing (DirectX 10 integration DirectCompute 4.0,directx 10.1 Integration DirectCompute 4.1), support DirectX11 GPU can use complete DirectCompute function (DirectX 11 Integration DirectCompute 5.0). Both DirectCompute and OpenCL are open standards and are supported by the NVIDIA CUDA architecture and ATI stream technology.
Windows 7 Adds a video instant drag-and-drop conversion that converts video from your computer directly to a mobile media player, and if the GPU on your computer supports DirectCompute, the conversion process will be done by the GPU. The conversion speed will reach 5-6 times the CPU. Internet Explorer 9 adds support for DirectCompute technology by calling the GPU to speed up computation of large computational elements in Web pages, while Excel2010 and Powerpoint2010 provide DirectCompute technical support.
The flow computing model of AMD also contains the flow processor architecture and corresponding software packages. AMD released the Steam SDK v1.0, which runs under Windows XP system in December 2007, which uses brook+ as the development language, and brook+ is an improved version of the Brook language (ANSI C) developed by AMD for Stanford University. The Stream SDK provides developers with a standard for open systems and platforms to facilitate collaborators in developing third-party tools. The package includes the following components: A compiler that supports brook+, device-driven CALs for streaming processors (Compute abstraction Layer), library is ACML (AMD coremath Library), and kernel function Analyzer.
In the stream programming model, a program executed on a stream processor is called a kernel (kernel function), and each kernel instance on a stream processor running on the SIMD engine is called the thread (thread), and the thread maps to a physical run area called an execution domain. The stream processor dispatches the thread array to the thread processor until all threads have finished before running the next kernel function.
Brook+ is the upper language of stream computing, abstracting hardware details, and developers writing kernel functions that run on the stream processor, simply specifying input and output and execution domains, without knowing the implementation of the streaming processor hardware. The two key features of the brook+ language are: Stream and kernel. A stream is a collection of elements of the same type that can be executed in parallel, and kernel is a function that can be executed in parallel on the execution domain. The brook+ package contains BRCC and BRT. BRCC is a source language compiler for the source language, capable of translating brook+ programs into device-related IL (intermediatelanguage), which are subsequently linked and executed. BRT is a run-time library that can perform kernel functions, some running on the CPU and some running on the stream processor. The kernel function library running on the stream processor is also called a cal (Compute abstraction Layer). A CAL is a device-driven library written in C that allows developers to optimize the convection processor core from the bottom while ensuring front-end consistency. CALs provide device management, resource management, kernel load and execution, multiple device support, and interaction with the 3D graphics API. At the same time, the Stream SDK also provides a common mathematical function library ACML (AMD Core Math Library) for developers to quickly obtain high-performance computing. The ACML includes the basic complete linear algebra subroutine, the FFT operation routine, the random number generation routine and the transcendental function routine.
In the face of Nvidia's constant innovation in GPU computing, AMD is not outdone, constantly improving its own stream SDK. By November 2010, AMD had released the stream SDK v2.2, which was able to run on Windows 7 and some Linux distributions, to support the OpenCL 1.1 specification and the double-precision floating-point number operation.
Reference documents:
1, http://baike.baidu.com/view/1196.htm
2, http://www.cnblogs.com/chunshan/archive/2011/07/18/2110076.html
3, http://blog.csdn.net/caiye917015406/article/details/9166115
4, http://www.rosoo.net/a/201306/16652.html