64 kB shared storage
In the first generation of CUDA architecture, to improve the efficiency of application execution, NVIDIA first added the concept of shared memory, and indeed achieved good results. Shared storage is designed in each SM array and directly connected to each stream processor, greatly improving the data uptake accuracy.
After discovering the importance of shared memory, NVIDIA provides 64 kB shared memory and L1 cache for each group of SM products in this GF100 product.
In each SM array, 64 KB On-chip memory is designed. The total capacity of 64 KB is actually composed of 16 KB and 48 KB.
. There are two modes:16 KB L1 cache and 48 KB shared cache; or 16 KB shared cache and 48 KB L1 cache.
By combining two different forms, L1 cache can better complement high-speed shared cache. The main difference between the two lies in that the shared cache can improve the memory access speed by clearly defining the Memory Access Algorithm, while the L1 cache can improve the memory access speed for the rest of the irregular algorithms, in addition, these irregular algorithms do not know the data address in advance.
During graphic processing, 16 KB L1 cache can be used in each SM array, and L1 cache can be used as a buffer for register overflow to Improve the efficiency. In parallel computing, L1 cache and shared memory can work together to allow threads in a thread block to collaborate with each other to reduce off-chip data communication and greatly improve the execution efficiency of CUDA programs. Based on different requirements, 64KB memory is allocated reasonably to achieve better performance.
Global shared second-level high-speed cache
In addition to the primary Cache, NVIDIA also designed a kb L2 Cache (secondary Cache) for the GF100 among the four GPC ). The L2 cache mainly serves devices that need to load, store, and texture requests, and the data in the L2 cache can provide data sharing for the entire GPU, this greatly improves the data communication capability between GPC and SM.
The L2 cache in GF100 is designed as read-only and write operations, which is more flexible than the read-only L2 cache in the GT200 architecture. NVIDIA said they adopted a priority algorithm to clear data in the L2 cache, which contains various checks to help ensure that the required data can reside in the cache.
For example, the L2 cache can provide faster efficiency for uncertain data addresses, such as computing physical effects, ray tracing, and coefficient data structures. However, when multiple SM arrays are required to read the same data (such as the post-processing filter), the L2 cache is also a better solution.
Solution.
In addition, L2 caching can balance the cache among various SM arrays. For example, in a group of SM arrays, after a high-speed cache is preemptible by a program, the program cannot be stored across the SM array, however, although the cache in another group of SM arrays is not fully occupied, there will still be idle. In this case, the L2 cache can transfer the excess pre-configured cache overflow part to another group of cache with free space, so as to make full use of the cache.
In GF100, L2 cache replaces L2 texture cache, drop cache, and on-chip FIFO in nvidia gpu.
In addition, the memory stored in the L2 cache executes access commands in the program sequence, which also provides a solid foundation for nvidia cuda to support the C/C ++ language. When the Read and Write paths are separated (for example, a read-only texture path and a write-only drop path), there may be a risk of first writing and then reading. A unified read/write path ensures that the program runs properly.