This article mainly summarizes some officially mentioned terms in amd HD graphics (the architecture after r700) and some links with the terms in opencl. This article mainly explains the hardware architecture and execution model. The content is excerpted from amd_accelerated_parallel_processing_opencl_programming_guide.pdf.
The GPU computing device is composed of compute units (see Figure 1.1 ). Different GPU computing devices have different features (such as the number of device units), but follow a similar design pattern.
Computing Unit (Note: AMD's compute unit, as a hardware term, corresponds to the logical abstraction concept of compute unit in opencl, which is called SIMD; in NVIDIA, gpgpu is called an SM, that is, streaming multiprocessor.) Contains several stream cores, which are responsible for executing the kernel program. Each stream core operates on an independent data stream. Each stream core (Note: The old name is stream processor, which is also called SP on nvidia gpgpu.Contains several processing elements, which are basic programmable computing units for integer, single-precision floating point, double-precision floating point, and complex functional operations. All stream cores in a computing unit execute the same command sequence in lock-step mode. Different computing units can execute different commands.
A stream core is arranged as a 5-or 4-way (dependent on the GPU type )(Note: The r700 and r800 GPUs have five channels, that is, five PES, while the hd6900 series GPUs have four channels, that is, four PES.) Ultra-long script (VLIW) processor (see the bottom of Figure 1.2 ). In a VLIW command, up to five scalar operations (or four dependent on the GPU type) can be collaboratively released. Processing Elements can perform single-precision floating point or integer operations. One of the five processing elements can also perform complex operations (sine, cosine, logarithm, and so on ). A double-precision floating-point operation is executed by connecting two or four processing elements (except the other four of the complex operation cores. The stream core also contains a branch Execution Unit to process branch commands.
Different GPU computing devices have different numbers of stream cores. For example, the ATI Radeon HD 5870 GPU contains 20 computing units, each of which contains 16 stream cores and each stream core contains 5 processing elements, which generates 1600 physical processing elements.
Each instance of a kernel program running on a computing unit is called a work-item ). A specified rectangle range mapped to the output cache of a work item is called the n-dimensional Index space, and is called a single piece of data. The GPU schedules the scope of work items on a set of stream cores until all work items are processed. Only after the application is completed can subsequent kernel programs be executed.
Opencl maps all work items to be fired to an n-dimensional mesh (Nd-range. Developers can specify how to divide these work items into working groups ). Amd gpu is executed on wavefront (a set of work items in a computing unit [Translator's note: A Working Group] in a locked step on a wavefront.
1.3.1 handling by the Working Group
All stream cores in a computing unit execute the same command for each cycle. A vliw command can be released for each cycle of a work item. The block of the work item that is executed together is called a wavefront. To hide the latency caused by memory access and processing of element operations, up to four work items from the same wavefront can be processed in the same stream core. (Note: in short, a stream core can run the same command of up to four work items in a wavefront. In this case, if a computing unit has 16 stream cores, A wavefront can execute the same command of 64 work items within four clock cycles. We can consider these four cycles as a wavefront execution cycle.)
Computing units are executed independently of each other. Therefore, different commands may be executed for each array.
1.3.2 Flow Control
Before discussing the control flow, it is necessary to clarify the relationship between a wavefront and a working group. If you define a workgroup, it is composed of one or more wavefront. Wavefront is the Execution Unit. A wavefront consists of 64 or fewer work items. If the wavefront size on a device is 64, the two wavefront can have 65 to 128 work items. For optimal hardware usage, we recommend that you use a 64-bit integer multiple of work items (Note: if the current GPU's wavefront size is 64, actually, it is an integer multiple of the wavefront size ).
Stream control such as branch is implemented by combining all necessary execution paths as a wavefront. If a work item in a wavefront is divided, all paths must be executed in a serial mode. For example, if a work item contains a branch with two paths, the wavefront executes one branch and the other. The total execution time of the branch is the total execution time of each path. It is important that, even if only one work item in a wavefront has a branch, the remaining work items in the wavefront also need to execute that branch. [Translator's note: for example,
_ KERNEL void test (_ global int * porg) {size_t id = get_global_id (0); If (ID & 63) = 0) // All work items in a wavefront will execute this branch {// Work Item 0 will execute the following branch // when work item 0 executes this branch, the rest of the work items of this wavefront will wait for its execution to complete if (porg [ID] <0) porg [ID] ++;} porg [ID] * = 2 ;}
] In a branch, the number of work items that must be executed is calledBranch Granularity. On amd hardware, the branch granularity is the same as that of wavefront. (Note: All work items on one wavefront participate in one branch execution of this wavefront, while work items on other wavefront are not affected by the current branch .)
The execute mask of wavefront (Translator's note: whether to execute the flag of the current Branch) is effective through the following construction:
If (x) {// work item in this braces = A} else {// work item in this braces = B}
The wavefront mask is set to true for X (element/work item) and executes. [Translator's note: the so-called swimming track often corresponds to an element in a SIMD operation in the high-performance intensive computing field. For example,
; X86 sse2
The length of xmm1 and xmm2 is 16 bytes, and each byte is used as an element separately,
This operation adds xmm1 to each single-byte element in xmm2 and puts the result in xmm1.
During this operation, each of the 16 single-byte elements in xmm1 and xmm2 is called a swimming path.
Paddb xmm1, xmm2
Take every four bytes in xmm1 and xmm2 as a separate element. There are four elements in total.
This operation adds each four-byte element of xmm1 and xmm2, and stores the result in xmm1.
During this operation, each four-byte element in xmm1 and xmm2 is called a swimming track.
Paddd xmm1, xmm2
So here, if a wavefront has 64 work items, then each work item is called a swimming channel of wavefront] Mask and then reversely, B is executed.
Example 1: if there are two branches, A and B, and the same time t is executed on a wavefront, if any work item has a branch, the total execution time is 2 TB.
The loop is executed in a similar way. As long as at least one work item in the wavefront is still being processed, the wavefront still occupies a computing unit.
Therefore, the total execution time of wavefront is determined by the maximum execution time.
Example 2: If t is the time it takes to execute a single iteration of a loop, and in a wavefront, all work items execute this loop once, however, if one of the work items executes the cycle for 100 times, it takes 100 TB to execute the whole wavefront.
1.6 GPU computing device scheduling
GPU computing devices efficiently process a large number of work items in parallel in a transparent way to applications. Each GPU computing device uses a large number of wavefront to hide the memory access latency by switching the resource scheduler to the active Wavefront in a given computing unit, every time the current wavefront is waiting for a memory access to complete. Hiding memory access latency requires that each work item contain a large number of ALU operations each time the memory is loaded/stored.
Figure 1.9 shows a Simplified execution sequence for work items in a stream core. At the moment 0, the work item is lined up and waiting for execution. In this example, only four work items (t0... T3) are scheduled for this computing unit. The hardware limit of the number of active work items depends on the resource usage of the program being executed (such as the number of active registers used ). A gpu computing device with optimal programming generally has thousands of active work items.
At runtime, work item t0 is executed until the cycle is 20; at this time, a delay occurs due to a memory Read Request. The scheduler starts to execute the next work item, T1. Work Item T1 is executed until it is delayed or completed. New work items are executed, and the process continues until the number of work items reaches available activities (Translator's note: for example, the maximum number of work items in a working group ). The scheduler then returns to the first work item t0.
If the pending data work item t0 is returned from the memory operation, t0 continues to execute. In the example in Figure 1.9, the data is ready, so t0 continues. Because there are enough work items and processing element operations to cover long memory latency, the stream core will not be idle. This memory latency hiding method helps GPU computing devices achieve maximum performance.
If T0 and T3 are not executed, the stream core waits (delays) until one of T0 to T3 is ready for execution. In the example shown in Figure 1.10, t0 is the first to continue execution.
1.7 terms
1.7.1 computing Kernel
To defineComputing KernelFirst, it is necessary to define a kernel. OneKernelIt is a small, user-developed program that re-runs on a data stream. It is a parallel function that operates each element of the input stream (called a piece of heat map. Unless otherwise specified, an AMD computing device is a kernel consisting of one primary function and zero or multiple functions. This is also called a coloring program. This kernel should not be confused with an OS kernel that operates the hardware. The most basic form of a single piece of data is simple ndring to input data and generating an output item for each input tuple. Subsequent extensions of the basic model provide the random access function, variable output quantity, and reduction/accumulation operations. The kernel is specified using the kernel keyword.
Multiple kernel types are run on an AMD accelerated parallel processing device, including vertices, pixels, ry, domain, outer profile, and current computing. Before the computing kernel is developed, the pixel shader is sometimes used for non-graphic computing. Currently, computing is not performed through the pixel coloring tool. The new hardware supports the computing kernel, which is better suited to general computing and can also be used to support graphic applications, allows rendering technology based on traditional graphic pipelines. A computing kernel is a specific type of kernel and is not part of a traditional graphics pipeline. The computing kernel can be used for graphic processing, but is more conducive to the execution of non-graphic fields, such as physical, AI, modeling, HPC and other intensive computing applications.
1.7.1.1 work item generation order
In a computing kernel, the order of work items is sequential. This means that on a chip with each wavefront containing n work items, the first n work items go to wavefront 1, the second n work items go to wavefront 2, and so on. Therefore, the work item ID for wavefront K ranges from (K * n) to (k + 1) * n)-1.
1.7.2 wavefront and workgroup
Wavefront and workgroup are two concepts related to the computing kernel. The computing Kernel provides parallel data granularity. Wavefront executes n work items in parallel, while N is specific to the hardware chip (64 for the ATI radeon HD 5870 series ). A single instruction is executed in parallel on a wavefront through all work items. It is the lowest layer that can be affected by the control flow. This means that if two work items in a wavefront run the control flow branching path, all work items in the wavefront must run the control flow path.
Grouping is the high-level granularity of data parallelism, which is implemented by software rather than hardware. The synchronization point in a kernel ensures that all work items in a working group reach the (fence) Point in the code before the next statement is executed.
The workgroup is composed of wavefront. The optimal performance is achieved when the working group size is an integer multiple of the wavefront size.
1.7.3 local data storage (LDS)
LDS is a high-speed, low-latency storage, private to each computing unit. It is a fully-collected/scatter Model: A Working Group can be written at any location in the space it allocates. This model is of the ATI radeon hd5xxx series. Current limits of LDS include:
1. All the read/write operations are 32-bit and the two-character (Translator's note: 32-bit) alignment.
2. The size of LDS is allocated to each working group. Each working group specifies the number of LDS required. The hardware scheduler uses this information to determine which working groups can share a single computing unit.
3. data can only be shared within the work item of a working group.
4. undefined behaviors are caused by memory access outside the workgroup.
4.7 local memory (LDS) Optimization
AMD evergreen GPU contains a local data storage (LDS) cache, which accelerates access to local storage. LDS is not supported in opencl on the amd r700 family GPU. LDS provides high-bandwidth access (10 times higher than global memory), effective data transmission between any two work items in a working group, and high-performance atomic support. When data is reused, the local memory provides an important advantage. For example, subsequent access can be read from the local memory to reduce the global memory bandwidth. Another advantage is that local memory does not require coalescing ).
Determine the local memory size:
clGetDeviceInfo( ..., CL_DEVICE_LOCAL_MEM_SIZE, ... );
Of all AMD evergreen GPUs, each cell contains a 32kb lDs. On high-end GPUs, LDS contains 32 segments, each of which is 4-byte-width and 256-byte-depth. (Note: there are 256 entries in each segment, that is, 256 entries in each segment ); the segment address is determined by in the address. On low-end GPUs, LDS contains 16 segments, and each segment is still 4 bytes in width, and the segments are determined by the 5-2 bits in the address.
In a single cycle, local memory can serve one request for each segment (up to 32 accesses per cycle on ATI radeon HD 5870 GPU ). For an ATI radeon HD graphics 5870 GPU, this provides each computing unit with a memory bandwidth of more than 100 Gbit/s, and for an entire chip larger than 2 tb/s. This is 14 times higher than the global memory bandwidth. However, access mapped to the same segment is serialized and is served on a continuous cycle. A wavefront that generates a segment conflict is delayed on the Computing Unit until all the LDs access is completed. The GPU re-processes the wavefront in the subsequent cycle and only allows the channels that receive data (Note: The swimming channels here refer to a swimming channel on the wavefront, that is, a corresponding work item ), until all conflicting access is completed. The segment with the most conflicting access determines the latency of wavefront to complete local memory operations. When all 64 work items are mapped to the same segment, the worst case is that each access is at a rate per cycle; in this case, it takes 64 cycles to complete local memory access for the wavefront. A program with a large number of segment conflicts may benefit from using constant or image storage.
Therefore, the key to effective use of the local cache memory is to control the access mode, so that the access generated in the same cycle is mapped to different segments of the local memory. One exception should be noted that access to the same address (even if they all have the same position at) can be broadcast to all the requesters without a segment conflict. The lDs hardware is a request generated by the segment conflict check. Two cycles are required (32 work items are executed ). Make sure that the memory requests generated by a 1/4 wavefront use a unique address at to avoid segment conflicts. An example of a Simple Sequential address access mode on the ATI radeon HD 5870 GPU, where each work item reads a float2 value from the LDs to generate a non-conflicting access mode. Note: For such an access mode, each work item reads a float4 value from the LDS and only uses half of the segment in each cycle, therefore, only half of the float2 access mode is provided.
Each stream processor can generate up to two 4-byte lDs requests each cycle. Byte and Short read also consume four bytes of LDS bandwidth. Since each stream processor can perform five operations (or four operations) in VLIW in each cycle (10-15 input operands are generally required ), therefore, two local memory requests may fail to provide sufficient bandwidth to serve the entire command (Note: a vliw command ). Developers can use huge register files: each computing unit has KB of available register space (8 times the size of LDS) it can also provide up to 12 4-byte values/period (6 times of LDS bandwidth ). Registers do not provide the same indexing flexibility as LDs, but for some algorithms, this can be overcome through loop expansion and explicit addressing.
LDS read requires an ALU operation to initialize them. Each operation can initialize two loads, each of which can load up to four bytes.
Local Memory is a software-controlled "NOTE" memory. In comparison, it is generally used for the cache on the CPU to monitor the access stream and capture the most recent access in a tagged cache. The local memory allows the kernel to explicitly load data to the memory; the data remains in the local memory until the kernel replaces them, or the Working Group ends. To declare a local memory, use the _ local keyword. For example,
_ Local float localbuffer [64];
These declarations can be included either in the declared parameter of the kernel or in the code block of the kernel. _ Local syntax allocates a single block of memory, which is shared among all work items in the Working Group.
To write data to the local storage, write it to an array allocated with _ local. For example:
Localbuffer [I] = 5.0;
A typical access mode writes each work item to the local memory collaboratively: each work item writes a subarea, and when these work items write the entire local memory array, these work items are executed in parallel. By combining the appropriate access mode and segment alignment, these cooperative write methods can produce efficient memory access. Local memory only ensures consistency between work items on a workgroup fence; therefore, the kernel must contain a barrier () command before reading values written in collaboration.
The following example is a simple kernel snippet that demonstrates collaborative writing to local memory and then reading:
__kernel void localMemoryExample (__global float *In, __global float *Out){ __local float localBuffer[64]; uint tx = get_local_id(0); uint gx = get_global_id(0); // Initialize local memory: // Copy from this work-group’s section of global memory to local: // Each work-item writes one element; together they write it all localBuffer[tx] = In[gx]; // Ensure writes have completed: barrier(CLK_LOCAL_MEM_FENCE); // Toy computation to compute a partial factorial, shows re-use from local float f = localBuffer[tx];
for (uint i=tx+1; i<64; i++) { f *= localBuffer[i]; } Out[gx] = f; }
Note: The host Code cannot read or write local memory.