Analysis of TLB mechanism under Linux X86

Source: Internet
Author: User

Tlb-translation lookaside Buffer

Fast table, literal translation as a fallback buffer, can also be understood as page table buffering, address transformation cache.

Since the page table is stored in main memory, it takes at least two times for each time the program is visited: one time to fetch the physical address and the second time to retrieve the data. The key to improve the performance of the visit is to rely on the local access of the page table. When a converted virtual page number is used, it may be used again in the near future.

TLB is a cache that the memory management hardware uses to improve the conversion speed of virtual addresses to physical addresses. All current personal desktops, notebooks, and server processors use TLB to map virtual addresses to physical addresses. Using the TLB kernel, you can quickly find the virtual address pointing to the physical address without requiring RAM memory to obtain a mapping of the virtual address to the physical address. This is very similar to the data cache and instruction caches.

TLB principle

When the CPU accesses a virtual address/linear address, the CPU is looked up in the TLB first based on the high 20 bits of the virtual address (20 is x86 specific, different schemas have different values). If there is no corresponding table entry in the table, called TLB miss, the corresponding physical address needs to be computed by accessing the page table in slow RAM. At the same time, the physical address is stored in a TLB table entry, and later access to the same linear address, directly from the TLB table entry to obtain the physical address, called the TLB hit.

Imagine the absence of a TLB in the X86_32 schema, access to a linear address, first obtaining PTEs from the PGD (first memory access), obtaining the Page box address (second memory access) in the PTE, and finally accessing the physical address, which requires 3 RAM access. If a TLB is present and the TLB hits, then only one RAM access is required.


TLB Table entry

The basic unit of the TLB is the page table entry, which corresponds to the page table entry stored in RAM. The size of the page table entry is fixed, so the greater the TLB capacity, the More page table entries you can hold, and the more likely the TLB hit will be. However, the TLB capacity is limited after all, so the Ram page table and the TLB page table entries do not correspond to one by one. So the CPU receives a linear address, so it has to make two quick judgments:

1 required also indicates that no has been cached within the TLB (TLB Miss or TLB hit)

2 which entry in the TLB is required for the page table

In order to minimize the time required for the CPU to make these judgments, it must be done in the same way as the TLB page table entries and the memory page table entries.

Fully Connected-full associative

In this organization mode, there is no relationship between the table entry and the linear address in the TLB cache, that is, a TLB table entry can be associated with a page table entry of any linear address. This correlation makes the utilization of the TLB table entry space the most. But the latency can also be quite large, because each time the CPU requests, the TLB hardware compares the linear address to the TLB table entries one by one until the TLB hit or all the TLB table entries are completed. In particular, as the CPU caches more and more, a large number of TLB table entries need to be compared, so this organization is only suitable for small-capacity TLB

Direct match

Each linear address block can correspond to a unique TLB table entry through modulo operations, so that only one comparison is made, reducing the latency of comparisons within the TLB. However, this is a very high probability of conflict, resulting in the occurrence of TLB miss, which reduces the hit rate.

For example, we assume that the TLB cache contains a total of 16 table entries, and the CPU sequentially accesses the following linear address blocks: 1, 17, 1, 33. When the CPU accesses address Block 1 o'clock, 1 mod = 1,tlb to see whether its first page table entry contains the specified linear address Block 1, which contains the hit, otherwise loaded from RAM; then the CPU azimuth address Block 17,17 mod 16 = 1,tlb finds that its first page table entry corresponds to a non-linear address block 17,tlb Miss occurs, the TLB access RAM loads the page table entry for address block 17 into the TLB;CPU next accesses the address Block 1, and then the miss,tlb has to access the RAM to reload the corresponding page table entry for Address Block 1. Therefore, in some specific access modes, the performance of direct matching is poor to the pole

Group Connection-Set-associative

In order to solve the conflict between low efficiency and direct matching, the group was connected. In this way, all the TLB table entries are divided into groups, and each linear address block is no longer a TLB table entry, but a TLB table item group. When the CPU makes address translation, it first calculates which TLB table item group The linear address block corresponds to, and then matches the order in this TLB table item group. According to the length of the group, we can call it 2, 4, 8.

After long-term engineering practice, it is found that the 8-way group is a performance demarcation point. The 8-way group is connected with almost the same hit rate as the full-connected hit rate, more than 8, and the disadvantages of the intra-group comparison delay outweigh the benefits of increased hit ratio.

These three ways have advantages and disadvantages, the group is a compromise choice, suitable for most of the application environment. Of course, for different areas, you can also use other cache organization.

TLB Table Entry Update

TLB table entry updates can have TLB hardware auto-initiated, or can have software active updates

1. After the TLB miss occurs, the CPU Gets the page table entries from RAM and the TLB table entries are automatically updated

2. The table entries in the TLB are not valid in some cases, such as process switching, changing the kernel page table, and so on, when the CPU hardware does not know which TLB table entries are invalid and can only be refreshed by software in these scenarios.

In the Linux kernel software layer, a rich TLB table item Refresh method is provided, but different architectures provide different hardware interfaces. For example, X86_32 only provides two hardware interfaces to flush the TLB table entries:

1. When writing a value to the CR3 register, it causes the processor to automatically refresh the TLB table entries for non-global pages

2. After Pentium Pro, the INVLPG assembly instruction is invalid for a single TLB table entry that is used to specify a linear address.

TLB Refresh mechanism

In the case of the MMU open, the linear address to the physical address of the conversion needs to go through the page table lookup, if each time it is obvious to the system performance impact, so there is a cache, so that the previous search results are saved in this TLB. Obviously TLB cannot accommodate all linear addresses to physical address conversions because of capacity limitations, and when all entry in the TLB are placed full, the processor decides to replace an old entry.

TLB is essentially a cache, so there is also the problem of cache consistency, such as if the operating system modifies the mapping of an item in a page table, if the mapping of the item is stored in the TLB, then there is a consistency problem. Unlike the cache consistency between the system's physical memory and the processor under x86, the consistency of the TLB requires system software to be resolved, not hardware. x86 provides two ways to resolve TLB consistency issues:

1. Update CR3. If the CR3 register is reloaded, then the entire TLB will be invalidated, and the OS can pass such as: mov eax, CR3; Move CR3, eax; Such an instruction to flush the entire TLB. Also, we know that the CR3 registers are reloaded when the Linux process switches, because the page table entries for the new and old processes are not the same, so you need to invalidate the TLB to prevent inconsistencies.

2. x86 provides a INVLPG directive, which is a privileged instruction. The operating system can use this directive to update a separate entry in the TLB. For more information on this instruction, refer to Intel IA32 Architecture software Developer ' s Manual. Examples of the use of this directive in Linux kernel are:
<arch/x86/mm/pgtable.c>
void set_pte_vaddr (unsigned long vaddr, pte_t pteval)
{
...
/*
* It ' s enough to flush this one mapping.
* (PGE mappings get flushed as well)
*/
__flush_tlb_one (VADDR);
}

The SET_PTE_VADDR function is used in the kernel to manipulate the page directory table entry directly, which involves the principle of x86 linear address mapping physical address, and the function in the last call __flush_tlb_one to flush the corresponding entry in the TLB:
<arch/x86/mm/pgtable_32.c>
static inline void __flush_tlb_one (unsigned long addr)
{
if (CPU_HAS_INVLPG)
__flush_tlb_single (addr);
Else
__flush_tlb ();
}
From the implementation of the function, you can see that if the CPU supports the INVLPG directive, then call the The __flush_tlb_single function refreshes the entries in the TLB (which are specified by the virtual linear address addr), and the __flush_tlb_single implementation is:

static inline void __native_flush_tlb_single (unsigned long addr)
{
ASM volatile ("INVLPG (%0)":: "R" (addr): "Memory");
}
Above is a very intuitive example of the use of INVLPG instructions. In the __flush_tlb_one function, we can also see the __flush_tlb call, which is used to refresh the entire TLB, is actually mentioned earlier in this article with the MOV command to rewrite the CR3.

When the paging mechanism is open, when the CPU accesses a linear address, the TLB is first looked up, and if the conversion of the linear address is already in the TLB, then the physical address is obtained (in fact the TLB is a mapping of the VPN and PPN), and the page table of contents does not need to be checked at this time.
If the mapping of the current linear address is not yet saved in the TLB, a TLB miss occurs, where the page directory and page table entries need to be found, and the mapping results are recorded in the TLB.

Analysis of TLB mechanism under Linux X86

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.