Cuda Parallel Computing Framework (i) Conceptual correlation, content comparison abstraction

Last Update:2017-02-27 Source: Internet

Author: User

Tags definition comparison data structures thread

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

I. Concept.

1. Related keywords.

CUDA (Compute Unified Device Architecture).

GPU English full name graphic processing unit, Chinese translation as "graphics processor."

2. Cuda is a general-purpose parallel computing architecture introduced by NVIDIA, which enables the GPU to solve complex computational problems. It contains the CUDA instruction set architecture (ISA) and the parallel computing engine within the GPU.

3. From the composition of the Cuda architecture, there are three parts: the development library, the runtime environment, and the drivers. The development library is an application development library based on CUDA technology. The runtime environment provides application development interfaces and Run-time components, including the definition of basic data types and various functions such as calculation, type conversion, memory management, device access and execution scheduling. The drive section is basically understood to be a device abstraction layer of the cuda-enable GPU, providing an abstract access interface for hardware devices.

4. Applications such as games, high-definition video, satellite imagery and other data of a large scale.

Two. Preliminary discussion.

1. Cuda Framework

CUDA is the NVIDIA GPGPU model, which uses the C language and can write programs that are executed on the display chip directly in the C language that most people are familiar with, without having to learn specific display chip instructions or special structures.

Under the CUDA architecture, a program is divided into two parts: host side and Device end. The Host side is the part that executes on the CPU, and the device end is the part that is executed on the display chip. Device end of the program is also known as "kernel." Usually the host program will be ready to copy the data to the memory of the video card, then the display chip to execute the device-side program, completed and then by the host program to retrieve the results from the memory of the video card.

2. CPU and GPU in their respective fields

CPU and GPU have their own strengths. In general, CPUs are adept at handling irregular data structures and unpredictable access patterns, as well as recursive algorithms, branching-intensive code, and single-threaded routines. This kind of program task has the complicated instruction scheduling, the loop, the branch, the logic judgment and the execution and so on steps. For example, system software and common applications such as operating system, word processing, interoperability applications, common computing, System control, and virtualization technology. The GPU is adept at processing rule data structures and predictable access patterns. For example, light and shadow processing, 3D coordinate transformation, oil and gas exploration, financial analysis, medical imaging, finite element, gene analysis and geographic information system, and scientific computing applications. The display chip usually has a larger memory bandwidth. has a larger number of execution units. Compared with higher-order CPUs, graphics cards are cheaper.

At present, the guiding ideology of designing GPU+CPU architecture platform is: Let the CPU more resources for caching, GPU more resources for data calculation. Putting the two together will not only reduce the cost of transmission bandwidth, but also allow the CPU and GPU, the two PCs with the fastest-running components to each other instrumental. The reason is that the processor in the CPU usually has only a few alu, while the ALU in the GPU is much more than the number of CPUs. In addition, the CPU in the cache is relatively large, and the GPU in the cache is much less than the CPU. When necessary, the CPU can help the GPU share part of the software rendering work, on the other hand the GPU can use the mainstream programming language to deal with the common computing problem. This is equivalent to a CPU with a powerful floating-point operator, and the GPU has a pixel processing unit.

From a micro-architecture perspective, CPU and GPU do not seem to follow the same design ideas, the current micro-architecture of the CPU is based on the consideration of "instruction parallel Execution" and "data parallel operation" of the idea is designed to take into account the program execution and data operations parallelism, versatility and their balance. CPU micro-architecture focuses on the efficiency of program execution, and does not blindly pursue the efficiency of execution of a program at the ultimate speed. The micro-architecture of GPU is designed for the numerical calculation of matrix type, which is a large number of iterative design of computational units, which can be divided into many independent numerical calculations-a large number of numerical operations of the thread, and the data is not like the logic of the implementation of the logical correlation.

From the frequency point of view, the GPU to perform each numerical calculation of the speed and no faster than the CPU, from the current mainstream CPU and GPU can be seen, the main frequency of the CPU is more than 1ghz,2ghz, or even 3GHz, and the maximum frequency of the GPU is less than 1GHz, the mainstream of the 500~ 600MHz. Want to know 1GHz = 1000MHz. So the GPU does not exceed the CPU when performing a numerical calculation of a small number of threads.

From the number of instructions executed per clock cycle, this aspect, the CPU and the GPU can not be compared, because most of the GPU instructions are numerical calculation, a small number of control instructions can not be used directly by the operating system and software. If the ipc,gpu of the data instruction is obviously higher than the CPU, because of the reason for parallelism. However, if you compare the IPC for control instructions, the CPU is naturally much higher. The reason is simple, the CPU is focused on the parallelism of instruction execution.

The GPU design allows more transistors to be used for data processing than for data caching and flow control.

More specifically, the GPU is designed to solve problems that can be represented as data parallel computing-programs that execute in parallel on many data elements, with extremely high computational density (the ratio of mathematical operations to memory operations). Because all data elements perform the same program, the demand for precision flow control is low; because it runs on many data elements and has a high computational density, you can compute a hidden memory access delay without having to use a larger data cache.

3. GPU Thread Hierarchy

Each thread that executes the kernel is assigned a unique thread ID. The thread's index and its thread ID have a direct relationship: for a one-dimensional block, the two are the same; for a two-dimensional block of size (Dx,dy), the ID of the thread indexed (x,y) is (x + ydx), and for a three-dimensional block of size (Dx,dy, Dz), a line with an index (x, y, z) The ID of the process is (x + YDX + zdxdy).

Threads within a block can collaborate with one another, share data through shared memory, and synchronize their execution to coordinate memory access. A thread block can contain up to 512 threads. However, a kernel may be executed by multiple thread blocks of the same size, so the total number of threads should equal the number of threads per block multiplied by the number of blocks. These blocks are organized into a one-dimensional or two-dimensional thread block grid.

Thread blocks need to be executed independently: they must be able to execute in any order, in parallel, or sequentially. This independence requirement allows the thread block to be arranged across any number of cores, allowing the programmer to write scalable code.

Back to the column page: http://www.bianceng.cnhttp://www.bianceng.cn/Programming/extra/

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Cuda Parallel Computing Framework (i) Conceptual correlation, content comparison abstraction

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Cuda Parallel Computing Framework (i) Conceptual correlation, content comparison abstraction

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support