I personally think that both the SM4.0 hardware specification and DX10 are transitional products. Now SM5.0 and DX11 are born, and I believe everyone is waiting for a long time. DX11 exists in the dx sdk at the end of last year, but there is no DOCUMENT, and DX12 is believed to be under development. At present, ATI's commercial products should be RV770, and in the third quarter of, it will sell RV870 that supports DX11. Then, let's take a look at NV, which is currently in a quandary of GT200, the GT300 that supports DX11 does not know when it will come out. I guess it should be in the 4th quarter. I believe that the most exciting war will appear in the first half of 2010. AMD's brand-new architecture, RV970, GT300, and INTEL's monster LARREBEE, will all appear. In this article, all GPU architectures are based on RV770 and GT200, making guesses about RV870 and GT300, and then explaining them with DX11. Therefore, it cannot be guaranteed that the content is correct for your reference only :).
RV870: The last warrior of R600
R770 and RV870 are both R600 architectures. After AMD acquired ATI, it grafted the advanced chip internal BUS Transmission Technology on its CPU to the GPU, compared with the GPU earlier than ATI, the GPU has greatly improved its performance, and the architecture has not encountered much technical bottleneck since it was used. RV870 only increases the frequency, number of stream processors, texture grating units, and other parameters. The new architecture will not be used until RV970. RV870 texture units and raster units are also increased from 10 and 4 of RV770 to 12 and 8 (this is speculation ). AMD claims that RV770 has 800 stream processors, but it refers to the Arithmetic Logic computing unit. The meaningful data should be 160, and the RV870 stream processing unit should be up to 1200, however, we need to divide him into five to get a meaningful number, that is, 240. Based on the speculation that his Texture unit may be 12, RV870 will combine 20 complete stream processors into a group of 12 groups, each of which corresponds to a texture unit, which is indeed much more than RV770.
The RV770 uses a two-layer organizational design. The first layer is the internal 10 clusters, which are connected using the cross bus, the switch hud used in the amd cpu, each cluster contains 16 cores, one Texture unit (including one vertex address generator, one vertex swarm, four grain interpolation filters, and 16 texture samplervers, 4 Texture address generators), a 16 KB share memory (used to share data R/W between cores at high speed), and a L1 Texture Cache (the size is not clear ); each core contains one branch Command Execution Unit and five exceeded computing units. One of them can also execute a function beyond the limit. Let's take a look at the GT200. the first layer of architecture is 10 so-called TPCs. Each TPC contains three SM instances, and each SM contains eight SP instances, A dual-precision floating point computing unit and two superfunction execution units, each SP contains a branch instruction calculator, an integer Single-precision floating point number ALU and a MAD computing unit, which is really difficult. The most important thing is the difference between ATI's over-standard computing architecture and NV's pure scalar computing architecture. Let's talk about the pure scalar architecture of GT200. The SM of GT200 has two Commands: issue port, one to the branch calculator ALUMAD calculator and DPU, and the other to the fmul sfu unit, A single ISSUE can capture up to 2 commands (which must meet the requirements) and deliver up to 8 of them to a total of 16. GT200 divides vector computation into scalar computation, for example, the float4 multiplication is divided into four float multiplication commands, which obviously increases the number of commands. To achieve good concurrency, NV has to increase the number of SP. RV770 uses the so-called VLIW, which is a wide instruction set. Each SP has five computing units. These five computing units can be used in any combination, such as float2 f0 = f1 * f2; float3 f3 = f4 * f5; these two commands can be combined into a SIMD command with a width of 5. Only one ISSUE cycle and one computing cycle are required to complete five FLOAT calculations, in this case, it seems that ATI's advantages are not small. First, there are fewer commands. The single ALU and MAD computing units are far more than GT200, and the performance should be far higher than that of GT200, but this is not the case, the main reasons are as follows: 1. Currently, the SHADER compiler in the atidriver does not have the out-of-order analysis function, and the command splitting and reorganization function. When it cannot perfectly form a 5D vector computing, the SHADER utilization is not high, for example, two float3 multiplications cannot be merged, and only two ISSUE computations can be performed. In this case, in the two computations, the Shader utilization rate is only 60%, and two units are idle. The utilization rate of the GT200 SP should be 100% for a long time. 2. The shader operation frequency is completely different. For example, the popular GT200 core GPU GTX280 core frequency is generally about 1300 MHZ, while the SHADER frequency is generally MHZ, in contrast, the GPU HD4870 core frequency and shader frequency with RV770 core are both around MHz. In this way, we can roughly calculate the floating point computing capability of RV770 and GT200: GT200: 1.3*240*2 (MAD calculates two floating point calculations) = 624 GFLOPS (Here we didn't calculate 2 sfus and a DPU in nv sm because they have too many limitations in computing), RV770: 0.75*800*2 = 1200 GFLOPS, t. Another super bad concept of NV is WARP. It should have been a pure soft CUDA programming concept. I didn't think it was related to the actual hardware architecture. The WARP of GT200 is 32 threads, in fact, there is no thread concept in the GPU hardware layer, and there are only instructions. The so-called 32WARP is nothing more than 32 instructions. It is clear that there are only 8 SP Why can 32 commands be executed at a time, I am not quite sure. Generally, the average latency of various GPU commands is about four cycles. You can reuse idle waiting hardware or use a high core frequency to achieve this purpose, i'm sure it is the former, because I didn't see the LOAD and STORE execution units in the SP, it should be part of the SP. The core frequency of RV870 is estimated to be as high as 950 MHZ, which is about 25% higher than that of RV770 core. The internal and external frequencies are still used.
The GT300 adopts the 40nm technology and was originally designed to be manufactured by TSMC. the number of transistors is as high as 2.4 billion and the number of 512 stream processors is 240 more than that of the GT200. It still uses the TPC architecture, there are 16 groups of TPCs, each with 32 stream processors. However, each TPC has eight texture units, with an encapsulation area of 495mm2 square meters. asynchronous frequencies are used. According to NVIDIA's practice, the first GPU code using the GT300 core should be GF GTX380. In general, the system architecture is basically the same as that of the GT200. The biggest improvement is that the process has been upgraded to 40nm, as the number of stream processors increases, some peripheral functions supporting DX11 are added. Then, the GT200 dual-issue port can execute an operation every two cycles in extreme cases. However, this requirement for command type matching is relatively high, we believe this feature will be retained or enhanced in GT300.
DX11 provides parallel programming features similar to CUDA and BROOK through computer shader, which may upset NV and ATI. Because parallel programming is actually a process in which a piece of code simultaneously performs operations on a lot of data and then outputs the results to the target at the same time, cuda brook and DX11 CS all abstract this process software into a thread. In CS, the code to be executed on data is called a thread, these threads are organized into ThreadGroup in the form of a three-dimensional matrix, and the Thread Group can be further organized into Thread Grid. All threads in each Thread Group have a share memory, the capacity in DX11 is 32 KB. We know that in GT200 and RV770, this share memory is only 16 kb in hardware, and in GT200, this share memory is in SM (Multi-stream processor, shared by 8 SP (stream processor) in SM, and in RV770, it is located in PC (stream processor cluster). It is shared by 16 5 Excess shader processor, it can be seen that the share memory of GT300 and RV870 should grow to 32 KB. When the thread needs to read data When being Modified by other threads, storage access synchronization is required. The MOSI protocol (Modified Owned Shared Invalid) may be used, which is somewhat different from the MESI commonly used in CPU, it will be elaborated later. In fact, GT200 and RV770 support CS on the hardware. According to the hardware specification, each Thread Group of RV770 can have a maximum of 768 threads, while the GT200 is 1024.