Reference
A very easy-to-read article explaining direct Mapped cache:
Http://www.cs.umd.edu/class/sum2003/cmsc311/Notes/Memory/direct.html
CPU Cachehttp://en.wikipedia.org/wiki/cpu_cache
http://blog.csdn.net/zqy2000zqy/article/details/1137895
=============================================
General knowledge,
The CPU cache is usually larger, such as 128KB, and is divided into several fixed-size cache line, the cache line is usually 32Byte or 64Byte.
There are at least three types of caches inside the CPU
1) Instruction Cache
2) data cache usually has multilevel multi-level
3) TLB accelerated virtual address 2 Physical Address translation
Cache entry (Cache entry)
Contains the following sections
1) Cache line: Copy data size from main memory once)
2) Tag: Mark the address of main memory of cache line
3) Falg: Mark the current cache line is invalid, if it is the data cache, and whether dirty
The rule of CPU accessing main memory
1) The CPU never accesses the main memory directly, it accesses the main memory indirectly through the cache.
2) Each time you need to access main memory, walk through all of the cache line, to find the main memory address is in a cache line.
3) If not found in the cache, assign a new cache entry, copy the memory of main storage to the cache line, and then read from the cache line.
Cache entry entries are limited, so there must be a suitable cache culling policy
The LRU policy is generally used.
By marking some main memory areas as non-cacheble, you can increase the cache hit rate and reduce the useless cache
Write Back Policy
After the data in the cache is updated, it needs to write back to main memory, and there are many times to write back
1) write back each update. Write-through Cache
2) do not write back after update, mark as dirty, write back only when cache entry is evict
3) After the update, the cache entry sent as a write-back queue, waiting for the queue to collect multiple entry when the bulk writeback.
Cache consistency Issues
There are two scenarios in which data in the cache may expire
1) DMA, other devices directly update the main memory data
2) SMP, the same cache line has multiple CPUs in their respective cache. One of the CPUs has updated it.
CPU Stall CPU Stall
When the cache miss (especially the read cache miss), the CPU is nothing to do while waiting for the data to be read from memory into the cache.
There are ways to solve this problem
1) Hyper-Threading technology. At the hardware level, the CPU simulates a CPU to two CPUs, which appears to be two CPUs at the top. Concurrent execution of two threads. This allows the other thread to execute when a thread is waiting because of the cache miss.
An address of main memory that needs to be mapped into which cache line? (Term: associativity)
Varies by mapping strategy
1) The stupidest, an address can be mapped into any cache line (fully associative)
The problem is that it is unacceptable to go through each of the cache line to find out if an address is already in the cache.
Just like parking spaces can be stopped casually, the time to stop simple, looking for a car needs a parking space to find.
You think, the CPU wants to know if an address is already in the cache, how slow it needs to get all the cache line on one side?
2) Direct Mapped Cache (equivalent to 1-way associative)
This is the equivalent of a hash, each address can be mapped to the cache line is fixed.
Each person's parking space is fixed and allocated well. can be found directly.
The disadvantage is, because people more cars less, it is likely that several people contend with the same parking space, resulting in cache elimination frequently. It is also expensive to read data from main memory frequently to the cache.
Because the cache line number of caches is a number of 2. Then, the hash algorithm is very simple, do not take the mold, directly to the memory address of a certain number of bits to take out. For example, cache line has 128 (2^7), the size of the cache line is 32 (2^5) bytes,
Then the 0~4 bit of a 32-bit address is shifted internally as the cache line, and the 5~11 bit is indexed as the cache line. The remaining bit12~31 is the tag of the current cache line. tag, when there is another address mapped to the same cache line, tag is used to compare two addresses is not the same address. After all, the same cache-line can correspond to the location of the memory very many.
3) 2-way Associative
Is the tradeoff between fully associative and direct Mapped cache.
2-way, each person can have two parking spaces, so when a parking space is occupied, there is a chance to look for another one. Although the number of people, but also to find parking space is not many. (equivalent to a lot of people's car outside, did not open back)
Therefore, the 2-way associative approximates the equivalent of a twice-fold cache, using the direct Mapped cache policy.
Note that this figure only counts the cache miss rate, and it is clear that full-associative is doing well. But full-associative the cost of determining whether an address is in the cache is very expensive. Therefore, the production environment is generally 2-way associative===================================== =================
The method of sharing variables to avoid and identify errors in multithreading mainly solves the problem that cache line is flushed frequently in SMP environment avoiding and identifying False sharing among Threads
http://software.intel.com/en-us/articles/avoiding-and-identifying-false-sharing-among-threads/
Example:
The following code exists in the SMP environment with cache frequent refresh problem double sum=0.0, sum_local[num_threads]; #pragma omp parallel num_threads (num_threads) {int me = Omp_get_thread_num (); Sum_local[me] = 0.0; #pragma omp for for (i = 0; i < N; i++) Sum_local[me] + = x[i] * Y[i]; #pragma omp atomic sum + = Sum_local[me];}
Becausesum_localThe array is a global variable, and multiple threads are accessed, and each thread accesses the location very close, causing one thread to update and the other CPU's cache line to fail.
The way to solve this problem is to
1) Minimize access to global variables between different threads, using thread-local variables as much as possible.
2) If you must access it, try to align the areas that each thread accesses cacheline.
3) Frequently updated storage is separated from infrequently updated storage.
CPU Performance Inquiry: Cache line principle