Intel64 and IA-32 Architecture Optimization Guide Chapter 1 multi-core and hyper-Threading Technology-8th Memory Optimization

Source: Internet
Author: User

8.6 Memory Optimization

Efficient Cache operations are a key aspect of memory optimization. Note the following points for Efficient Cache operations:

● Cache parts

● Shared storage Optimization

● Eliminate 64 K bytes of overlapping data access

● Prevent excessive L1 cache eviction

8.6.1 cache Partitioning technology

Cyclic partitioning is useful for reducing cache failures and improving memory access performance. When the cyclic block technology is applied, it is critical to select a proper block size. Cyclic chunks can be applied to single-threaded applications and multi-threaded applications running on processors that support or do not support HT technology. This technology converts the memory access mode to a block that efficiently hits the target cache size.

When the target is directed to an Intel processor that supports HT technology, you can select a block size not greater than half the size of the target cache for the cyclic block technology of a unified cache, if two logical processors share the cache. The maximum block size used for cyclic chunks should be determined by dividing the target cache size by the number of available logical processors in a physical processor package. Generally, some cache lines need to access data that is not used as the source or target cache part in the cache block, in this way, the block size can be selected from 1/4 to 1/2 of the target cache size (see chapter 1 ).

The software can use the cpuid's deterministic cache parameters to find which subsets of the logical processor are sharing a given cache (See Chapter 7th ). Therefore, the above rules can be extended to allow all logic processors of a given cache service to use the cache at the same time, set the block size to the total cache size divided by the upper limit of the number of logical processors of the cache service. This technology can also be applied to single-threaded applications that are part of a multi-thread workload.

User/source code writing rules 31: Use cache segments to improve data access location. When the target is an Intel processor supporting HT technology, or the target is set to a block size that allows all logic processors of a cache service to share the cache at the same time, set the block size to 1/4 to 1/2 of the cache size.

8.6.2 Shared Memory Optimization

Frequently maintaining cache consistency between independent processors involves moving data through a bus, which operates at a much lower clock frequency than the processor.

8.6.2.1 minimize data sharing among physical Processors

When two threads execute and share data on two physical processors, reading or writing shared data often involves several bus transactions (including requests for Snoop, ownership changes, and sometimes cross-bus data acquisition ). A thread accessing a large amount of shared memory may have a poor processor performance gain.

User/source code writing rules 32: Minimize data sharing between threads that execute different bus proxies on a shared public bus. A platform composed of multiple bus domains should also minimize data sharing across bus domains.

One technique to minimize data sharing is to copy data to a local stack variable if the data is repeatedly accessed for a long period of time. If necessary, the results produced by multiple threads can be written back to a shared memory location for later union. This method can also minimize the time spent to synchronize access to shared data.

8.6.2.2 producer-consumer model for Batch Processing

The key benefit of a threaded producer-consumer design is that a shared L2 cache is used between the producer and consumer to minimize bus traffic. On an Intel Core Duo processor and when the working cache is small enough to adapt to L1 cache, rescheduling the producer and consumer tasks is necessary to achieve optimal performance. This is because it is much faster to retrieve data from L2 to L1 than to invalidate a cache row in a core and then fetch data from the bus.

Figure 8-5 describes a batch processing producer-consumer model that can be used to overcome the defect of using a small working cache in a producer-consumer model. In a batch processing producer and consumer model, each scheduling batch processes two or more producer tasks, and each producer works on an assigned cache. The number of tasks to be processed is determined by the standard that the total working set is larger than the L1 cache but smaller than the L2 cache.

Example 8-8 shows the batch processing Implementation of the producer and consumer thread functions.

Example 8-8: Batch Processing of producer and consumer threads

Void producer_thread () {int iter_num = workamount-batchsize; int mode1; For (mode1 = 0; mode1 <batchsize; mode1 ++) produce (buffs [mode1], count ); while (tier_num --) {signal (& signal1, 1); produce (buffs [mode1], count); // placeholder function waitforsignal (& end1); mode1 ++; if (mode1> batchsize) mode1 = 0 ;}} void consumer_thread () {int mode2 = 0; int iter_num = workamount-batchsize; while (iter_num --) {waitforsignal (& signal1); consume (buffs [mode2], count); // placeholder function signal (& end1, 1); mode2 ++; If (mode2> batchsize) mode2 = 0 ;}for (I = 0; I <batchsize; I ++) {consume (buffs [mode2], count); mode2 ++; If (mode2> batchsize) mode2 = 0 ;}}

8.6.3 eliminate 64 K bytes of overlapping data access

64 K Bytes mixing is discussed in Chapter 3rd. Memory Access that meets the 64 K Bytes mixing condition will cause excessive eviction of L1 data cache. Eliminate 64 K bytes of overlapped Data Access generally originated from each thread to help increase frequency calibration. In addition, it allows efficient execution of L1 data cache when HT technology is fully utilized by software applications.

User/source code writing rule 33: Minimizes the data access mode. The offset in each thread is a multiple of 64 K bytes.

The performance monitoring event of the Pentium 4 processor can be used to detect whether 64 K bytes of overlapping data access exists. Appendix B contains an updated list of performance metrics of the Pentium 4 processor. These measurements are based on event access using the Intel vtune Performance Analyzer.

The performance penalty associated with 64 K-byte aliasing mainly applies to the processor or Intel netburst microarchitecture that currently implements HT technology. The next chapter discusses the memory optimization technology that can be applied to multithreading applications running on a processor supporting HT technology.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.