AMD opencl university course (1)

Source: Internet
Author: User
The amd opencl university course is a very good entry-level opencl tutorial. by reading the PPT In the tutorial, we can quickly learn about the opencl mechanism and programming methods. : Http://developer.amd.com/zones/OpenCLZone/universities/Pages/default.aspx
The English in the tutorial is very simple. I believe that anyone who learns opencl can understand it and understand the original English expressions, which is more helpful for us to understand the ins and outs of various terms.
I have translated these tutorials into my own Chinese expressions, mainly to enhance understanding. In fact, my English is poor.
1. Overview of Parallel Computing
In computer terms, parallelism refers to the ability to break down a complex problem into multiple sub-problems that can be simultaneously processed. To implement parallel computing, we must first have physical hardware devices that can implement parallel computing, such as multi-core CPUs. Each nuclear energy can simultaneously perform arithmetic or logical operations.
Generally, two types of parallel computing are implemented using GPU:
Parallel tasks: divides a problem into multiple tasks that can be executed simultaneously.
Data parallelism: all parts of a task are executed simultaneously.
The following example describes different types of parallel computing by employing a farmer to pick up an apple.
 
  • The worker who picked apple is the Parallel Processing Unit (process elements) on the hardware ).

  • Tree is the task to be executed.
  • Apple is the data to be processed.
As shown in serial task processing, a worker picks up the apples on all trees with a ladder (one Processing Unit processes the data of all tasks ).

Data parallelism is like a farmer who has hired a lot of workers to extract an apple from a tree (multiple processing units complete the data in a task in parallel), so that the apple from a tree can be quickly picked up.

A farmer can also arrange a worker for each tree, which is like parallel tasks. In each task, because there is only one worker, it is executed in serial mode, but tasks are executed in parallel.

There are many factors that affect parallel computing for a complex problem. Generally, we implement and perform algorithms by decomposing the problem.

This includes two aspects:

  • Task Decomposition: divides algorithms into many small tasks, as in the previous example, dividing orchards By Apple Trees. At this time, we do not focus on data, that is to say, we do not pay attention to the number of apples on each tree.
  • Data decomposition: divides a lot of data into different discrete small pieces, which can be executed in parallel, just like apple in the preceding example.

Generally, tasks are decomposed according to the dependency between algorithms, thus forming a task relationship diagram. A task can be executed only when it is not dependent on the task.

This is similar to a directed acyclic graph in a data structure. Two tasks without a connection path can be executed in parallel. The following is an example of another toast. As shown in the following example, two tasks can be executed in parallel: prefetch the oven and buy the flour sugar.

For most scientific computing and engineering applications, data decomposition is generally based on output data, for example:

  • In a pair of images, you can perform a filtering operation on the pixels in a sliding window (for example, 3*3 pixels) to obtain the convolution of an output pixel.
  • Multiply the row I of the first input matrix by column J of the second input matrix. The obtained vector and element of column J of the row I of the output matrix are obtained.

This method is effective for one-to-one or multiple-to-one mappings between input and output data.

Some data decomposition algorithms are based on input data. In this case, the relationship between input data and output data is generally one-to-many, such as the histogram of the image, we need to place each pixel in the corresponding slot (bins, for Grayscale Images, the bin quantity is usually 256 ). A search function may input multiple data records, but only one value is output. For such applications, we generally use each thread to calculate a portion of the output, and then get the final value through synchronization and atomic operations, the kernel function used to calculate the minimum value in opencl is a typical example [You can refer to the example of finding the minimum value in chapter 2 of ATI stream computing opencl Programming Guide].

In general, how to break down the problem is related to specific algorithms, and you also need to consider your own hardware and software, such as amd GPU platform and nvdia GPU platform optimization there are many differences.

2. Commonly Used parallel hardware and software-based

In 1990s, parallel computing mainly studied how to implement automatic command-level parallelism on the CPU.

  • Multiple commands (no dependency between them) are sent at the same time to execute these commands in parallel.
  • In this tutorial, I will not describe Automatic hardware-Level Parallelism. If you are interested, you can look at the computer architecture tutorial.

High-Level Parallelism, such as line-Level Parallelism, is generally difficult to automate and requires programmers to tell computers what to do and what not to do. At this time, programmers also need to consider specific hardware indicators. Generally, specific hardware is suitable for a certain type of parallel programming. For example, multi-core CPU is suitable for Task-based parallel programming, GPU is more suitable for Data Parallel Programming.

Hardware type

Examples

Parallelism

Multi-core superscalar processors

Phenom II CPU

Task

Vector or SIMD Processors

SSE units (x86 CPUs)

Data

Multi-core SIMD Processors

Radeon 5870 GPU

Data

 

Modern GPUs consist of many independent processor cores, which are stream cores on AMD GPUs that can execute SIMD operations (single commands and multiple data ), therefore, it is particularly suitable for parallel data operations. Generally, when a task is executed on a GPU, data in the task is allocated to each independent core for execution.

On GPUs, we generally use loop expansion and loop strip mining technology to convert serial code into parallel execution. For example, on the CPU, if we implement a vector addition, the code is usually as follows:

 1: for(i = 0; i < n; i++)

 2: {

 3: C[i] = A[i] + B[i];

 4: }

On the GPU, we can set n threads, and each thread executes an addition, which greatly improves the parallelism of Vector Addition.

 1: __kernel void VectorAdd(__global const float* a, __global const float* b, __global float* c, int n)

 2: {

 3: int i = get_global_id(0);

 4: c[i] = a[i] + b[i];

 5: }

The figure above shows the SPMD (single-instruction multi-thread) implementation of vector addition. We can see how to implement the loop strip mining operation.

A gpu program is generally called a kernel program. It is a SPMD programming model (the Single Program Multiple Data ). SPMD executes multiple instances of the same code segment, and each instance operates on different parts of the data.

In parallel data applications, loop strip mining is the most common method to implement SPMD:

    • In a distributed system, message passing interface (MPI) is used to implement SPMD.
    • In the parallel system of shared memory, we use POSIX Threads to implement SPMD.
    • In GPU, we use kernel to show SPMD.

In the modern CPU, the overhead of creating a thread is still very high. If SPMD is to be implemented on the CPU, the data block processed by each thread should be as big as possible, do more to reduce the average thread overhead. However, the GPU is a lightweight thread, and the overhead of the creation and scheduling threads is relatively small. Therefore, we can completely expand the loop, and one thread processes one data.

The hardware for parallel programming on GPU is generally called SIMD. Generally, after an instruction is sent, it is executed in multiple ALU units (the number of ALU even if the width of SIMD ), this design reduces the number of other hardware related to the control flow units at the level of Alu.

Shows the hardware of SIMD:

 

 

In vector addition, a SIMD unit with a width of 4 can divide the entire cycle into four parts for simultaneous execution. In the example of a worker picking an apple, the worker's hands are similar to the width of SIMD to 2. In addition, we need to know that the current GPU hardware is based on the SIMD design, and the GPU hardware implicitly maps the SPMD thread to the SIMD core. For developers, we don't need to pay attention to whether the hardware execution results are correct. We just need to pay attention to its performance.

The CPU generally supports parallel atomic operations, which ensure that different threads read and write data without interfering with each other. Some GPUs support system-wide parallel operations, but they are very popular, such as global memory synchronization.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.