8.5 system bus Optimization
The system bus service requests a bus proxy (such as a logic processor) to obtain data or code from the memory subsystem. The performance impact caused by retrieving data traffic from the memory depends on the features of the workload, and the degree to which the software optimizes access to the memory, and the position implemented in the software code is enhanced. In appendix A, we discuss some techniques for describing a workload's memory traffic. The optimization principles for location enhancement are also discussed in section 3.6.10.
The technologies discussed in chapter 3rd and Chapter 7th obtain the benefits of application performance on a platform in a single-threaded environment of bus system services. In a multi-threaded environment, a bus system generally serves more logic processors, each of which can independently publish bus requests. Therefore, the technology of location enhancement, saving bus bandwidth, and reducing the long-span cache failure delay will have a great impact on the performance increase of the processor.
8.5.1 save bus bandwidth
In a multi-threaded environment, bus bandwidth can be shared by memory traffic initiated by multiple bus proxies (these proxies can be several logical processors and/or several processor cores. Maintaining the bus bandwidth can improve the performance gain of the processor. At the same time, if a large-span cache fails, the effective bus bandwidth will generally decrease. Reducing the number of large-span cache failures (or decreasing the number of dtlb failures) will slow down the bandwidth reduction problem caused by the large-span cache failure.
One way to save available bus command bandwidth is to improve the Code and data location. This improves the Data Location and reduces the number of times the cache row is evicted and data is requested. This technology also reduces the number of times commands are obtained from the system memory.
User/source code writing rules 26: Increase the data and code location to save the bus command bandwidth.
A compiler that supports analyzer optimization can improve code location by keeping frequently used code paths in the cache. This reduces the number of commands. Cyclic blocks can also enhance data location. Other location enhancement techniques can also be used in a multi-threaded environment to save the bus bandwidth (see section 7.6 ).
Because the system bus is shared among many bus proxies (logical processor or processor core), software adjustments should identify signs of bus saturation. A useful technique is to check the depth of the bus reading traffic queue (see Appendix a.2.1.3 ). When the bus queue depth is high, the location enhancement improves the cache utilization. Compared with other technologies, such as inserting more software for prefetch or using stacked bus reads to shield memory latency, more performance gains can be obtained. A general working principle for software to operate below bus saturation is to check the depth of the bus read queue, which should be noted below 5.
Some MP and workstation platforms may have a chipset that provides two system buses, each serving one or more physical processors. The rules for saving bus bandwidth described above are also applied to each bus domain.
8.5.2 understand bus and cache Interaction
When the dataset contained in the parallel code segment causes the entire working set to exceed the L2 cache and/or the bandwidth consumed exceeds the bus capacity, be careful. On an Intel Core Duo processor, if only one thread is using L2 cache and/or bus, it is expected to obtain the best benefits of cache and bus systems, because the other core does not interact with the first thread. However, if two threads use L2 cache concurrently, if one of the following conditions is true, the performance will decrease:
● The working sets bound to these two threads are larger than the L2 cache.
● The two threads bind more bus than the bus.
● These two threads have a lot of access to the same data set in the L2 cache, and at least one of them writes the cache row.
To avoid these defects, multithreading software should try to review the parallel mode. In these modes, only one thread can access the L2 cache at a time, or the use of L2 cache and bus cannot exceed their capacity limit.
8.5.3 avoid excessive software prefetch
The Pentium 4 and Intel Xeon processors have an automatic hardware prefetch. It can bring data and commands to the unified L2 cache based on the previous reference mode. In most cases, the hardware prefetch can reduce system memory latency without the explicit interference of software prefetch. It is also more desirable to adjust the data access mode in the code to take advantage of the features of the automatic hardware prefetch to improve location or shield memory latency. Intel Core-based processors provide several advanced hardware prefetch mechanisms. Data access modes that utilize the hardware prefetch mechanism in earlier times can usually be implemented using closer hardware prefetch.
Excessive or non-differential use of software prefetch commands will inevitably lead to performance penalties. This is because the excessive or non-differential use of software prefetch commands wastes bus commands and data bandwidth.
Software prefetch delays the hardware prefetch from getting the data required by the processor core. It also consumes critical execution resources and causes execution delays. In some cases, computing software prefetch redundancy or elimination and migration to more effective use of the hardware prefetch mechanism will be fruitful. The guidelines for using software prefetch commands are described in Chapter 3rd. The technology that uses the automatic hardware prefetch is described in Chapter 7th.
User/source code writing rules 27: Avoid excessive use of software prefetch commands and allow automatic hardware prefetch to work. Excessive use of software prefetch can significantly increase the use of the bus without any need.
8.5.4 improve the effective delay of cache failure
System memory access latency caused by cache failure is affected by bus traffic. This is because the system bus read requests must be arbitrated together with other requests for bus transactions. Reducing the number of pending bus transactions helps improve effective memory access latency.
One technique that improves the effective latency of memory read transactions is to use multiple stacked bus reads to reduce the latency of sparse reads. When there is almost no data location or when the memory read needs to be arbitrated with other bus transactions, effective latency of distributed memory reads can be improved by publishing Multiple Memory reads back to Stack multiple outstanding memory read transactions. The average latency of back-to-back bus reading may be lower than the average latency of scattered reads distributed with other bus firms. This is because only the first read of the memory needs to wait for a cache to fail.
User/source code writing rules 28: Consider using stacked multiple back-to-back memory reads to improve the effective cache failure latency.
It is also possible to reduce the latency of effective storage. If you can adjust the data access mode, for example, the access span that causes successive cache failures in the last layer is less than the trigger upper limit of the automatic hardware prefetch. See section 7.6.3.
User/source code writing rules 29: Adjust the sequence of memory references, for example, the distance between successive cache failures of the last layer and the peak distribution to 64 bytes.
8.5.5 use a complete write transaction to achieve a higher data rate
Write transactions that run through the bus cause or use the full 64-byte row size or less than the row size to write to the physical memory. The latter is called partial write. Generally, the size of the write-back (WB) memory address is full, and the write-back (WC) or non-Cache (UC) type memory address leads to partial write. Both the cached WB storage operations and WC storage operations use a group of six WC caches (64-byte width) to manage the traffic for writing transactions. When a WC cache is closed before all writes to the cache are completed, this leads to a series of 8-byte bus transactions, rather than a single 64-byte write transaction.
User/source code writing rules 30: Use a complete write transaction to achieve higher data throughput.
Multiple frequently written WC memory parts can be combined to a full-size write using a software write Union technology to separate WC storage operations from competition with WB storage traffic. To implement software write union, memory with WC attributes is written to a small, temporary cache (WB type ), this cache is applicable to L1 data cache. When the temporary cache is full, the application copies the temporarily cached content to the final WC.
When some data is written on the bus, the effective data rate of the system memory is reduced to 1/8 of the bandwidth of the system bus.