AMD opencl university course (6)

Source: Internet
Author: User
Tags scalar

GPU Architecture

The content includes:

1. Relationship between openclspec and multi-core hardware

  • Amd gpu Architecture
  • Nvdia GPU Architecture
  • Cell broadband Engine

2. Some special topics about opencl

  • Opencl compilation system
  • Installable client driver

 

First of all, we may have doubts: Since opencl is platform-independent, why should we study special hardware devices of different vendors?

  • Understand how the loop and data in the program are mapped to opencl kernel, so that we can improve the Code Quality and achieve higher performance.
  • Measure the test taker's knowledge about the differences between AMD and nvdia graphics cards.
  • Understanding the differences between various hardware helps us to use some special opencl extensions based on these hardware, which will be discussed later in the course.

3. Traditional CPU architecture

  • For a single thread, CPU optimization can obtain the minimum latency, And the CPU is also suitable for handling control flow intensive tasks, such as if, else, or tasks with many jump commands.
  • The control logic unit occupies more space than the ALU unit in the chip.
  • The multi-level cache design is used to hide latency (the spatial and temporal locality principles can be well utilized)
  • The limited number of registers makes it possible to have too many active threads at the same time.
  • Control the execution of the logical unit recording program, provide the instruction set parallelism (ILP), and minimize the vacant cycle of the CPU pipeline (STLs, at which the ALU does nothing ).

4. Modern gpgpu Architecture

 

  • For modern GPUs, the control logic unit is usually relatively simple (compared with the CPU), and the cache is relatively small.
  • The thread switching overhead is relatively small and is a lightweight thread.
  • Each "core" of the GPU has a large number of ALU and a small number of caches that can be managed by users. [The core here should be the entire GPU].
  • The memory bus is optimized based on bandwidth. The 150 Gbit/s bandwidth allows a large number of ALU to perform memory operations at the same time.

5. amd gpu hardware architecture

Now let's take a look at the architecture of AMD 5870 graphics card (cypress ).

  • 20 SIMD engines, each of which contains 16 SIMD engines.
  • Each SIMD contains 16 stream cores
  • Each stream core is a 5-way multiplication-addition unit (VLIW processing ).
  • The single precision operation can reach teraflops.
  • Double precision calculation can reach 544 Gb/s

A simd engine. Each SIMD engine consists of a series of stream cores.

  • Each stream core is a five-way VLIW processor. In a VLIW command, up to five scalar operations can be initiated. The scalar operation is executed on each PE.
  • The stream core in the Cu (SIMD of the hardware corresponding to the 8xx series Cu) executes the same VLIW command.
  • A work item that is executed simultaneously in a Cu (or SIMD) is put together and called a wave, which is the number of threads simultaneously executed in the Cu. In 5870, the wave size is 64, that is, up to 64 work items can be executed simultaneously in a CU.

Note: Five operations correspond to (X, Y, Z, W) and T (beyond function). T has been removed from the Cayman and changed to four.

 

Now let's take a look at the relationship between amd gpu hardware in opencl:

  • A workitme corresponds to a PE, and a PE is a single VLIW core.
  • One Cu corresponds to multiple PES, and the Cu is the SIMD engine.

It is the memory architecture of amd gpu (the figure in the original courseware is a little incorrect, and the global memory is written as LDs)

  • For each Cu, the memory used includes the on-chip LDS and related registers. In 5870, each lDs is 32 K, 32 banks in total, 1 K for each bank, and the read/write unit is 4 bytes.
  • For the Cu, there is an 8 K L1 cache. (For 5870)
  • The L2 cache shared by each Cu is 5870 K in 512.
  • Fast path can only perform 32-bit or 32-Bit Memory operations.
  • The complete path can perform atomic operations and memory operations smaller than 32 bits.

The relationship between amd gpu memory architecture and opencl Memory Model:

  • LDS corresponds to local memeory, which is used to share data between work times in a work group. Steam core can access lDs more quickly than global memory.
  • Private memory corresponds to the registers of each PE.
  • Constant memory mainly uses L1 Cache

Note: There are three access methods for amd cpu and constant memory: Direct-addressing patterns. This mode does not include the determinant, and its value is determined when the kernel function is initialized, for example, a fixed parameter is input. Same index patterns. All work items access the same index address. Globally scoped constant arrays, the determinant will be initialized, if less than 16 K, will use L1 cache, thus speeding up access.

When all work items access different index addresses, they cannot be cached. In this case, they must be read in global memory.

 

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.