CPU Cache Line Principle _ computer column

Last Update:2018-08-21 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

General knowledge,
CPU cache is usually large, such as 128KB, is divided into a number of fixed-size cache line, the cache line is usually 32Byte or 64Byte.

There are at least three types of cache inside the CPU
1) Command Cache
2 data cache usually has multilevel multi-level
3 TLB accelerates virtual address 2 Physical Address translation

Cache entry (Cache entry)
Contains the following sections
1 Cache line: Copy data size from main memory at one time
2 Tag: Mark the address of main memory corresponding to cache line
3) Falg: Mark the current cache line is invalid, if it is data cache, and whether dirty

The rule of CPU accessing main memory
1 CPU never directly access main memory, are through the cache indirect access to main memory
2 every time need to access main memory, traverse all cache line, find the main memory address is in a cache line.
3 if the cache is not found, then allocate a new cache entry, the main memory copy to the cache line, and then read from the cache line.

Cache entry entries are limited, so there must be a suitable cache elimination strategy
The LRU strategy is generally used.
Mark some main memory area as non-cacheble, can increase cache hit rate, reduce useless cache

Write Back Policy
Cache in the data update, need to write back to main memory, write back the timing of a variety of
1 Write back every update. Write-through Cache
2) After the update does not write back, marked as dirty, only when the cache entry was evict to write back
3) After the update, the cache entry sent as a write back queue, when the queue is collected to multiple entry batch back to write.

Cache consistency Problem
There are two situations that can cause data in the cache to expire
1 DMA, there are other devices to directly update main memory data
2 SMP, the same cache line exists in the cache of multiple CPUs. One of the CPUs has been updated.

CPU Stall CPU Stall
Refers to when the cache miss (especially read cache miss), the CPU waiting for data from memory read into the cache, nothing can be done.
The way to solve this problem is
1) Hyper-Threading technology. At the hardware level, the CPU simulates a CPU as two CPUs, and at the top it appears to be two CPUs. Concurrent execution of two threads. This allows another thread to execute when a thread is waiting for the cache miss.

An address of main memory, which cache line should be mapped into? (Term: associativity)
Different depending on the mapping strategy

1 The stupidest, an address can be mapped into any cache line (fully associative)
The problem is that it is unacceptable to traverse each cache line when looking for an address that is already being cache.
Just like parking spaces can be casually parked like, stop when simple, looking for a car when you need a parking space to find.
You think, the CPU wants to know whether an address is already in the cache, need to put all the cache line to find one side, that how slow?

2 Direct mapped Cache (equivalent to 1-way associative)
This is the equivalent of hash, each address can be mapped to the cache line is fixed.
Each person's parking space is fixed and well distributed. can be found directly.
The disadvantage is that, because many cars less, it is likely that several people competing with a parking space, resulting in frequent cache elimination. Need to frequently read data from main memory to cache, this cost is also high.
Because the cache line number of cache line is 2 of the number of points. Then, the hash algorithm is very simple, do not take the model, directly to the memory address of a few bit bits to take out. For example, cache line has 128 (2^7), the cache line size is 32 (2^5) byte,
Then a 32-bit address 0~4 bit as the cache line internal offset, 5~11 bit as the cache line index. The remaining bit12~31 as the current cache line tag. tag, when another address is mapped to the same cache line, tag is used to compare two addresses to the same address. After all, the same cache-line can correspond to the location of the memory very many.

3) 2-way Associative
Is the compromise between fully associative and direct mapped cache.
2-way, each person can have two parking spaces, so when a parking space is occupied, there is a chance to find another one. Despite the large number, there are not many people at the same time looking for parking spaces. (the equivalent of a lot of people's cars outside, did not turn back)
Therefore, the 2-way associative approximation is equivalent to twice times the size of the cache, using the direct mapped cache strategy.

Note that this chart only counts the cache miss rate, and obviously full-associative is doing well. But Full-associative's decision to determine whether an address is in cache is very expensive. Therefore, the production environment is generally 2-way associative ===================================== =================
The way in which multithreading becomes a shared variable to avoid and identify errors mainly solves the problem that cache line is frequently refreshed in SMP environment avoiding and identifying False sharing among Threads
http://software.intel.com/en-us/articles/avoiding-and-identifying-false-sharing-among-threads/Examples:

The following code in SMP environment exists cache frequent refresh problem
double sum=0.0, sum_local[num_threads];
#pragma omp parallel num_threads (num_threads)
{
 int me = Omp_get_thread_num ();
 Sum_local[me] = 0.0;

 #pragma omp for
 (i = 0; i < N; i++)
 sum_local[me] + = x[i] * Y[i];

 #pragma omp atomic
 sum + = Sum_local[me];
}

Because the sum_local array is a global variable, multiple threads will access it, and the location of each thread is close enough to cause one thread to update, and the other CPU's cache line will fail. The way to solve this problem is
1 to access global variables as little as possible between different threads, using thread-local variables as much as possible.
2 If you are sure to access, try to make each thread to access the region cacheline alignment.
3 frequently updated storage and infrequently updated storage are separated.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

CPU Cache Line Principle _ computer column

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support