R700 Instruction Set Architecture Reference Manual Chapter 2-2.6 Data Sharing

Source: Internet
Author: User

The r700 family of stream processors can share data between different execution threads. Data sharing can significantly improve performance. Figure 2.1 shows the memory level available for each thread.

 

(The translator comments on the graph:

In a SIMD, the figure is marked with processor 0 to processor 63. Physically, a SIMD has only 16 stream processors ), therefore, processor 0 to processor 63 can be understood as logically organized into 64 Logic processors (if you cannot imagine it, you can think that because the core frequency is higher than the system frequency, one command can be sent twice in a system cycle. for other reasons, for example, the parity wavefront mentioned later can be executed in parallel, therefore, we can increase the number of commands that can be sent in a system cycle to 4. In this way, although a SIMD has only 16 sp, however, 64 independent threads can be executed at the same time per unit of time, where each SP can correspond to 4 thread indexes), which exactly corresponds to a wavefront with 64 threads. Each thread (a logical processor in the figure) can access 128 GPR with a width of 256 double characters (1024 bytes.

The following describes some terms:

Thread group): In r700, the thread group corresponds to a block in the Cuda model. In the Cuda model, a block can have a maximum of 512 threads. In r700, a thread group can have a maximum of 1024 threads. It is the same as the block in cuda, where the number of threads is variable and can take any value from 1 to 1024. But for performance consideration, we should take a multiple of SIMD width. Here, that is, a multiple of 64.

Wavefront: A wavefront is a sub-part of the thread group. This concept can correspond to the warp concept in the Cuda model. In cuda, a warp contains 32 threads, although warp is built on a physical model, for the sake of actual execution performance (All threads in a warp strictly execute the same command, so if a warp contains the branch command to jump to different branch targets, then the commands on two different branches will be sent to each sp, which seriously affects the performance), but it can also be regarded as a logical part. The wavefront here is similar. A wavefront can contain up to 64 threads. In cuda, if a block contains 512 threads, the block contains a total of 16 warp; correspondingly, in r700, if a thread group contains 1024 threads, then there are 16 wavefront.

Lane): A wavefront can contain up to 64 threads, each of which corresponds to a swimming track. Figure 2.3 shows the wavefront array of the thread group. Every four swimming channels are divided into groups. This may be due to consideration of the access to the LDS and the memory block (memory bank), which will be described in detail later.

)

 

2.6.1 type of shared register

 

The shared registers allow data to be shared among threads residing in one swimming track of different wavefront, and these threads are scheduled to run on a given SIMD. (For more information, see Figure 2.3. In Figure 2.3, one of the columns from wave0 to wave15 is a swimming track, so up to 16 threads can share this register) an absolute addressing mode for each source and target operand allows data to be obtained from a global (absolute addressing) Register, rather than from a private (relative addressing) Register of wavefront. The maximum number of shared registers is 128 minus two times the number of temporary sub-register used. Registers placed in this pool are removed from the general pool of the wavefront private register.

 

2.6.1.1 shared GPR pool

 

Each source and destination operand has an absolute addressing mode. This allows access to each register pool relative to the address, rather than based on the allocated register pool for the respective wavefront (see Figure 2.2 ). To use this pool, a status register must be created and defined as the number of registers retained for global use.

Global GPR is accessed through an index_mode (SIMD-Global) in the ALU script. In the new mode, the SRC or dest gpr address is interpreted as an absolute address ranging from 0 to 127. This indexing mode is used together with the Src-REL/DEST/Rel domain, allowing commands to mix global and wavefront local GPR.

The additional index mode allows additional indexed addressing. The address is GPR + offset from the instruction or index_global_ar_x (only ar. X; see section 4.6.1, "relative addressing "). This allows inter-thread communication and kernel-based addressing. (This requires the use of a mova * command to copy the index to the AR. x register .)

The global GPR pool can be used to provide many powerful features, including:

1. the atomic reduction variable of each lane (the number depends on the number of GPR), such:

-- Maximum, minimum, and small histograms of each lane

-- Software-based fence or synchronization primitive

2. A group of constants that are unique in each lane. This prevents:

-- Duplicate Load

-- Thread-based execution due to constant search

 

2.6.1.2 clause temporary GPR pool

 

The GPR pool can contain parts that hold the temporary (temp) GPR clause. The clause temporary GPR prevents latency and allows peaks because they are stored in two parts, odd and even wavefront (see Figure 2.2 ). Because there are two unique segments for each wavefront executed on SIMD, it can be seen that each wavefront actually corresponds to only one segment, either odd or even), so there is no conflict between the temporary read/write clause between the odd and even wavefront. (Note: here, the translator understands this mechanism as: a temporary register of a sub-Statement on a swimming track is mapped to two parts, when both odd wavefront and even wavefront access the temporary register of the same clause, one read and one write will not conflict during this period, and the read thread will read the original value, the written value may be updated by the hardware in the background to the sub-statement temporary register ontology .) When the global shared register is used, both wavefront map the Register to the same location in the memory, which leads to a conflict and a delay. This is because it takes a whole instruction to write to make the write visible; thus, if one read and one write occur in the same instruction group but come from different wavefront, there will be a read/write conflict. The hardware delays one of the wavefront until the write is visible to the read.

(Translator's note: Based on the above description, combined with "the maximum number of shared registers is 128 minus two times the number of used sub-statement temporary registers" described in section 2.6.1 ." We can understand why the number of temporary registers in the 128-2 * clause is. Because the temporary register of each sub-statement corresponding to each swimming track has two parts, multiply by 2. A thread Working Group has exactly 64 swimming channels, that is, each wavefront has 64 threads. Therefore, it is sufficient to allocate a temporary register for each swimming pool .)

Physically, the GPR sequence starts from 0 and is global, clause_temp, and private. Note that this order allows the program to use a mov_index_global command to access the temporary register that crosses the Global Register to the sub-statement. The global shared register and the temporary sub-register must be within the first 128 GPR, due to the limitation of the size of the ALU command DEST-GPR field.

SIMD global GPR is only allowed in dynamic GPR mode.

 

2.6.2 local data sharing (LDS)

 

Each SIMD has a 16 KB storage space that allows low-latency communication between threads in a thread group or between threads in a wavefront. The memory is composed of four segments, each with 256 16-byte entries. The write port of the memory uses a write model of the owner, which allows each thread to write data to a private location. All write address logic is provided in discrete hardware, and the instruction provides a span of each thread and a 16-byte line offset within the current span. Write mode prevents conflicts between segments or addresses. The read address is then computed in the kernel and can read up to four 32-bit aligned characters from any other index in the thread group.

Write is statically specified during compilation; read is dynamically specified during runtime. Each write can contain up to four dual-words for each thread and is always aligned with four dual-words. The thread group size can be changed between 1 and 1024 threads (preferably a multiple of the SIMD width, )). The number of available lDs space for each thread is reciprocal to the number of threads. If a thread group has 1024 threads, each thread can have 4 dual-word writable memory; if there are 64 or fewer threads, each thread can have 64 dual-word writable memory. The absolute addressing mode allows each thread to automatically use 64 dual-words regardless of the group size. However, all subsequent writes must be completed in a non-interrupted sub-statement of a wavefront.

Figure 2.3 shows a diagram of the LDs memory.

(Translator's note: we can see that the distribution of LDS is based on the following layout: first, we can see horizontally that every four lanes form a group, A wavefront can contain a maximum of 16 such groups. For each group, each lane corresponds to one lDs memory segment (so there are exactly four memory segments ), one such group in a wavefront corresponds to an entry in the LDs memory. Vertically, each wavefront has different entries. Because a wavefront contains 16 groups, for the entire thread group of a SIMD, there are 16 wavefront, so there are 16*16 = 256 entries in total .)

 

Memory allows two write access modes:

1. wavefront relative addressing (private), and

2. Absolute (global) addressing.

When a write is scheduled, the data is read from the GPR and written to an address in the LDS. For each four-thread group, the address and segment of each thread are determined by a command that provides dst_stride and dst_index, and thread_id in the thread group or wavefront, depends on the address mode used.

Bank_id = thread_id mod 4

Bank_offset = (thread_id> 2) * dst_stride + dst_index

 

Thread_id -- simd_wave_rel Mode Control:

0: absolute -- relative to the thread in a wavefront

1: relative to the thread at the beginning of each group

 

Dst_stride -- the span of the sub-commands written to the shared storage. The unit is double characters. Valid values include: 4, 8, 12, 16, and 64.

Dst_index -- the index for the purpose of writing the slave command to the shared memory. The unit is dual-word. Valid values include: 4, 8, 12, 16, and 64.

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.