Linux process switch (2) TLB processing

Last Update:2018-02-13 Source: Internet

Author: User

Tags goto switches in domain

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

First, preface

Process switching is a complex process, and this article is not intended to describe in detail all aspects of the process switching, but rather to focus on a little bit of knowledge in process switching: the processing of TLB. To be able to articulate this problem, we describe in the second chapter some of the TLB-related details in a single CPU scenario, and the third chapter advances to the multi-core scenario, at which point the theoretical part ends. In the second chapter and the third chapter, we from the basic logic point of view, do not rigidly adhere to the specific CPU and the specific OS, here you need to understand the basic TLB organization principles, specific reference to this site, "TLB operation" article. The good logic also needs to be reflected in the HW block and SW block design, in the fourth chapter, we give the linux4.4.6 kernel on the ARM64 platform of the TLB Code processing details (in the description TLB lazy mode when the introduction of some x86 architecture code), We hope to deepen our understanding of the principle through specific code and actual CPU hardware behavior.

Second, the single-core scene of the working principle

1. Block Diagram

Let's take a look at the logical block associated with process switching in a single-core scenario:

A number of user-space processes and kernel threads are running on the CPU, and HW blocks such as TLB and cache are often designed for faster performance. For faster access to the data and instructions in main memory, the TLB caches part of the page table content in translation Lookasid buffer for faster address translation, avoiding the process of accessing the page table from main memory.

Without any processing, the data for both A and B processes exist in the TLB and cache when process a switches to process B. For kernel space does not matter, because all processes are shared, but for a and B processes, they have their own independent user address space, that is, the same virtual address X, in the address space of a can be translated into PA, and in the B address space will be translated into PB, If the data of both A and B processes exist in the TLB during address translation, then the cache entries of the old a address space will affect the translation of the address space of the B process, so when the process switches, the TLB operation is required to clear the effects of the old process. We have one by one discussions below.

2, absolutely no problem, but poor performance of the program

When the system has a process switch, from process A to process B, which causes the address space to switch from A to B, we can assume that all TLB and cache data are for a process during a process execution, and once switched to B, the entire address space is different. So you need to flush it all out (note: I use the term Linux kernel here, flush means to set the TLB or cache entries to be invalid, for an embedded engineer on an arm platform, generally we are more accustomed to using the term invalidate, However, in this article, flush equals invalidate).

This is certainly not a problem, when process B is cut into execution, its CPU is a clean, from the beginning of the hardware environment, TLB and cache will not have any residual a process of data to affect the execution of the current B process. Of course, a little bit of a pity is that the TLB and the cache are cold (empty) when the B process starts to execute, so the TLB miss and cache miss are very serious when the B process starts, which results in a degraded performance.

3, how to improve the performance of TLB?

The optimization of a module often requires a more detailed analysis of the characteristics of the module, classification, the previous section, we use the process address space such terminology, in fact, it can be further subdivided into the kernel address space and user address space. For all processes (including kernel threads), the kernel address space is the same, so for this part of the address translation, no matter how the process switches, the kernel address space conversion to the physical address of the relationship is never changed, in fact, when process a switch to B, do not need to flush off, Because the B process can also continue to use this part of the TLB content (medium, orange block). For the user address space, each process has its own independent address space, when process a switches to B, the TLB and a process-related entry (medium, cyan block) for B is completely meaningless, need to flush off.

Under the guidance of such ideas, we actually need to distinguish between global and local (in fact, the meaning of process-specific) of the two types of address translation, so there is often a bit in the page table descriptor to identify whether the address translation is global or local, Similarly, in the TLB, the flag of this identity global or local will be cached. With this design, we can flush all or just flush the local TLB entry according to different scenarios.

4. Consideration of special circumstances

We consider the following scenario: After process a switches to kernel thread K, there is no need to switch the address space at all, and the thread K can access those addresses of the kernel space, which are also shared with process a. Since the address space is not switched, then there is no need to flush those process-specific TLB entry, and when the process from K is switched, all TLB data are valid, from a greatly reduced TLB miss. In addition, for multi-threaded environments, the switchover may occur in two threads in a process, when the thread is in the same address space and does not need flush TLB at all.

4, further improve the performance of TLB

Is it possible to further improve the performance of TLB? Is it possible not to flush the TLB at all?

Sure, but this requires us to identify the TLB entry of the process specific when we design the TLB block, which means that the TLB block needs to perceive the address space of the individual processes. To complete this design, we need to identify different address spaces, where there is a term called ASID (address space ID). The original TLB lookup is the virtual address VA to determine if the TLB hit. With the support of ASID, the criteria for TLB hit are modified to (virtual address +asid), ASID is each process assigned one, identifying its own process address space. How does TLB block know about the asid of a TLB entry? Usually comes from the system register of the CPU (for the ARM64 platform, it comes from the Ttbrx_el1 register), so that the TLB block in the cache (Va-pa-global flag), but also the current asid slow existence of the corresponding TLB entry, Such a TLB is included in the entry (Va-pa-global flag-asid).

With the support of ASID, the a process switches to the B process and no more flush TLB is required, because a process execution is cached in the TLB in the residual a address space-related entry does not affect the B-process, although a and B may have the same VA, However, ASID guarantees that the hardware can differentiate between A and B process address spaces.

Three, multi-core TLB operations

1. Block Diagram

After completing the analysis in a single-core scenario, let's take a look at multi-core scenarios. The process switch related TLB logic block is as follows:

In multi-core systems, when the process switches, the TLB operation is more complex, the main reason is two points: one is that each CPU core has its own TLB, so the operation of the TLB can be divided into two categories, one is flush all, the TLB on all CPU core flush off, Another type of operation is the flush local TLB, which simply flush the TLB of this CPU core. Another reason is that the process can be scheduled to execute on any CPU core (which is, of course, related to the CPU affinity setting), causing the task to be lenient (leaving a residual TLB entry on each CPU).

2. The basic thinking of TLB operation

As described in the previous section, we learned that address translation has global (individual process sharing) and local (process-specific) concepts, so TLB entry also has global and local distinctions. If the two concepts are not differentiated, then the process switches directly flush all the remaining on the CPU. Thus, when process A is cut out, it leaves the next process B with a crisp TLB, and when process A is dispatched again on the other CPU, it faces an all-empty TLB (the other CPU's TLB will not be affected). Of course, if you differentiate between global and local, the TLB operation is basically similar, except that when the process switches, it is not flush all TLB entry on the CPU, but flush all the TLB local entry is OK.

The local TLB entry can also be further subdivided, which is the concept of ASID (address space ID) or PCID (process context ID) (Global TLB entry does not differentiate ASID). If ASID (or PCID) is supported, the TLB operation becomes simpler, or we do not need to perform the TLB operation, because it is possible to distinguish between task contexts when a tlb is searched, so that the remaining TLB in each CPU does not affect the execution of other tasks. In a single-core system, such operations can achieve good performance. For example a---B--->a such a scenario, if the TLB is large enough to hold 2 task TLB entry (modern CPUs can generally do this), then a once again cut back, the TLB is hot, greatly improving performance.

However, for multicore systems, this situation is a bit of a problem, in fact, the legendary TLB shootdown brings performance problems. In multicore systems, if the CPU supports PCID and does not flush the TLB while the process is switching, the TLB entry of each CPU in the system retains the TLB entry of the various tasks, and when a process is destroyed on a CPU, or modify your own page table (that is, change the VA pa mapping relationship), we must remove the task's associated TLB entry from the system. At this point, you need to flush not only the corresponding TLB entry on the CPU, but also the TLB remnants that are related to the task on the other CPU shootdown. This action is usually implemented through the IPI (e.g. X86), which introduces overhead. In addition, the allocation and management of PCID also brings additional overhead, so whether the OS supports pcid (or ASID) is up to the individual arch code (x86 is not supported for Linux, and the arm platform is supported).

Analysis of TLB operation code in process switching

1. tlb Lazy Mode

There is a piece of code in Context_switch:

if (!mm) {
next->active_mm = OLDMM;
Atomic_inc (&oldmm->mm_count);
Enter_lazy_tlb (OLDMM, next);
} else
SWITCH_MM (OLDMM, MM, next);

The meaning of this code is that if the next task to be cut is a kernel thread (next->mm = = NULL), then the next task on this CPU can be tagged with the enter_lazy_tlb function into the lazy TLB mode. Because the Enter_lazy_tlb function on the ARM64 platform is an empty function, we use the X86 to describe the lazy TLB mode.

Of course, we need some preparatory work, after all, for the embedded engineers familiar with the arm platform, x86 somewhat unfamiliar.

So far, we've all been describing TLB operations from a logical point of view, but in practice, is the TLB operation in process switching either HW complete or SW complete? Different processor ideas are not the same (for specific reasons unknown), some processors are HW complete, such as X86, when loading the CR3 register for the address space switch, HW will automatically operate the TLB. While some processing requires the software to participate in the completion of TLB operations, such as the ARM series of processors, when switching TTBR registers, HW no TLB action, need SW complete TLB operation. Therefore, on the x86 platform, when the process switches, the software does not need to display the call TLB flush function, in the SWITCH_MM function will use the MM->PGD in the next task to load the CR3 register, the load The CR3 action will cause the local TLB entry in this CPU to be completely flush out.

What happens when x86 supports PCID (X86 terminology, rather with arm's ASID)? Will you flush the local TLB entry on all the native CPUs when load CR3? In fact, in Linux, because of the TLB shootdown, normal Linux does not support PCID (which is used in KVM, but not within the scope of this article), so for x86 process address space switching, it will have flush local TLB Entry such a side effect.

Another point is that ARM64 and x86 are different: ARM64 supports the command to execute TLB flush on a CPU core, such as Tlbi Vmalle1is, inner shareablity all CPUs in domain The core TLB is all flush out. x86 cannot, if you want to flush off the system with more CPU core TLB, can only be processed by the IPI notification to other CPUs.

Okay, so far, all the prep knowledge is ready, and we're going to the TLB Lazy mode topic. Although process switching accompanies TLB flush operations, some scenarios can be avoided. In the following scenario, we can not flush the TLB (we still use a--->b task scenario to describe):

(1) If the next task B to be cut is a kernel thread, then we also temporarily do not need flush TLB, because kernel threads do not access usersapce, and those of process a residual TLB entry do not affect kernel thread execution, after all, B does not have its own user address space, And a shares the kernel address space.

(2) If A and B are in an address space (two threads in a process), then we do not need flush TLB for the time being.

In addition to process switching, there are other TLB flush scenarios. Let's look at a common TLB flush scenario, as shown in:

In a 4-core system, the A0 A1 and A2 tasks belong to the same process address space, and cpu_2 and A0 A2 on Cpu_0 and task,cpu_1 are a bit special, it is running a kernel thread, but the kernel thread is borrowing the address space of A1 task, cpu_ 3 does not run a related B task.

When the A0 task modifies its own address translation, it cannot just flush the cpu_0 tlb, but also notifies Cpu_1 and Cpu_2, because the current active address space on the two CPUs is the same as the cpu_0. Because of the modification of the A1 task, these cached TLB entry on Cpu_1 and cpu_2 have been invalidated and need to be flush. Similarly, it can be generalized to more CPUs, that is, a task that runs on a CPUX modifies the address mapping relationship, and the TLB flush needs to be passed to all relevant CPUs (current mm equals Cpux). In multicore systems, this message that passes the TLB flush via the IPI will increase with CPU core, is there any way to reduce the unnecessary TLB flush? There is, of course, the A1 task scene in which the legendary lazy tlb mode is.

Let me look at the code first. In code, if the next task is a kernel thread, we do not execute switch_mm (the function causes TLB flush action), but instead call enter_lazy_tlb into the lazy tlb mode. Under the x86 architecture, the code is as follows:

static inline void enter_lazy_tlb (struct mm_struct *mm, struct task_struct *tsk)
{
#ifdef CONFIG_SMP
if (This_cpu_read (cpu_tlbstate.state) = = TLBSTATE_OK)
This_cpu_write (Cpu_tlbstate.state, Tlbstate_lazy);
#endif
}

In the x86 architecture, entering the lazy TLB mode means setting the Tlbstate_lazy state in the cpu_tlbstate variable of the CPU is OK. Therefore, when you enter lazy mode, you do not need to call SWITCH_MM to switch the process address space, and you will not perform meaningless actions like flush tlb. Enter_lazy_tlb does not operate the hardware, so long as the software status of the CPU is logged is OK.

After switching, the kernel thread enters the execution state, and the Cpu_1 TLB remains the entry of process A, which has no effect on kernel thread execution, but when other CPUs send the IPI to require flush TLB? It is supposed to flush the TLB immediately, but under lazy tlb mode, we can not perform the flush TLB operation. So the question is: When do you flush out the TLB entry of the residual a process? The answer is in the next process switch. Because once the kernel thread is schedule out and into a new process C, in switch_mm, when you cut into the C-process address space, all previous residues will be erased (because of the load CR3 action). Therefore, we can defer the TLB invalidate request while executing the kernel thread. That is, when we receive the IPI interrupt request to do the action of the MM TLB invalidate, we do not need to execute the temporarily, only need to record the status is OK.

2, how to manage Asid in ARM64?

Unlike x86: ARM64 supports ASID (Pcid like x86), does ARM64 solve the TLB shootdown problem? I'm actually thinking about it, but I haven't figured it out yet. Obviously, in ARM64, we do not need to use the IPI to perform all of the CPU core TLB flush actions, ARM64 support the TLB flush action on all PES in shareable domain at the instruction set level, perhaps such instructions let the TLB Flush overhead is not so large, then you can choose to support ASID, no need to do any TLB operation when the process is switched, and because the IPI is not required to pass the TLB flush, then there is no special handling of the lazy TLB mode.

Since Linux, ARM64 chooses to support ASID, it has to face ASID allocation and management problems. Hardware-supported ASID has a limit of 8 or 16 bits, with a maximum of 256 or 65,535 IDs. What happens when Asid overflows? This requires some software control to coordinate the processing. We use the hardware support limit for 256 asid scenario to describe this basic idea: when the system in the ASID of each CPU in the TLB is not more than 256 times, the system normal operation, once the upper limit of 256, we will all TLB flush off, and redistribute asid, For each 256 cap, flush TLB is required and the HW ASID is reassigned. The specific allocation ASID code is as follows:

Static U64 new_context (struct mm_struct *mm, unsigned int cpu)
{
static U32 cur_idx = 1;
U64 asid = Atomic64_read (&mm->context.id);
U64 generation = Atomic64_read (&asid_generation);

if (asid! = 0) {-------------------------(1)
U64 Newasid = Generation | (Asid & ~asid_mask);
if (Check_update_reserved_asid (ASID, Newasid))
return newasid;
Asid &= ~asid_mask;
if (!__test_and_set_bit (ASID, Asid_map))
return newasid;
}

Asid = Find_next_zero_bit (Asid_map, Num_user_asids, CUR_IDX);---(2)
if (asid! = num_user_asids)
Goto Set_asid;

Generation = Atomic64_add_return_relaxed (asid_first_version,----(3)
&asid_generation);
Flush_context (CPU);

Asid = Find_next_zero_bit (Asid_map, Num_user_asids, 1); ---(4)

Set_asid:
__set_bit (ASID, Asid_map);
Cur_idx = ASID;
return ASID | Generation
}

(1) When a new process is created, a new mm is assigned, and its software asid (mm->context.id) is initialized to 0. If ASID is not equal to 0 then the MM has been assigned before the software asid (GENERATION+HW asid), then the new context is software The old generation in Asid is updated to the current generation.

(2) If Asid equals 0, we really need to allocate a new HW ASID, this time first to find a spare HW asid, if can find (jump to Set_asid), then directly back to software ASID (currently generation+ the newly assigned HW ASID).

(3) If a spare HW asid is not found, the HW Asid has been exhausted, which is only a lift generation. At this time, all the old generation on the CPU need to be flush off, because the system is ready to enter new generation. Incidentally, the generation variable has been assigned to new generation.

(4) in the Flush_context function, the asid_map of the control HW Asid is all zeroed out, so this is the assignment of the HW Asid in the new generation.

3. ARM64 tlb operation and ASID processing during process switching

The code is check_and_switch_context in arch/arm64/mm/context.c:

void Check_and_switch_context (struct mm_struct *mm, unsigned int cpu)
{
unsigned long flags;
U64 Asid;

Asid = Atomic64_read (&mm->context.id); -------------(1)

if (! ( (asid ^ atomic64_read (&asid_generation)) >> asid_bits)---(2)
&& atomic64_xchg_relaxed (&PER_CPU (active_asids, CPU), ASID)
Goto Switch_mm_fastpath;

Raw_spin_lock_irqsave (&cpu_asid_lock, flags);
Asid = Atomic64_read (&mm->context.id);
if ((asid ^ atomic64_read (&asid_generation)) >> asid_bits) {--(3)
ASID = New_context (mm, CPU);
Atomic64_set (&mm->context.id, ASID);
}

if (CPUMASK_TEST_AND_CLEAR_CPU (CPU, &tlb_flush_pending))---(4)
Local_flush_tlb_all ();

Atomic64_set (&PER_CPU (active_asids, CPU), ASID);
Raw_spin_unlock_irqrestore (&cpu_asid_lock, flags);

Switch_mm_fastpath:
CPU_SWITCH_MM (MM->PGD, MM);
}

You must be very mad when you see this code: Does the process switch require the TLB flush operation if the ASID is expected to be supported? How can there be so much code? Hehe ~ ~ Actually ideal is very good, the reality is very backbone, the code embeds too many management asid content.

(1) Now ready to cut into the address space that the MM variable points to, first get the ID of the address space (software ASID) through the memory descriptor. It should be explained that this ID is not HW asid, in fact Mm->context.id is 64 bit, wherein the low-bit corresponds to the HW Asid (ARM64 support 8bit or 16bit asid, But this assumes that the current system's ASID is 16bit). The rest of the bits are software extensions, which we call generation.

(2) arm64 supports the concept of ASID, in theory process switching does not require TLB operation, but because the HW ASID address space is limited, so we expanded the software asid of the bit, some of which corresponds to the HW Asid, the other part is called ASID Generation Asid generation from Asid_first_version, ASID generation will accumulate every time the HW asid overflows. Asid_bits is the number of bits of hardware-supported ASID, 8 or 16, which can be obtained by ID_AA64MMFR0_EL1 registers.

When the MM software ASID is still in the current batch (generation) of the ASID, the switch does not require any TLB operation, you can directly call CPU_SWITCH_MM to switch the address space, of course, will also set the active _asids this percpu variable.

(3) If the process to be cut is inconsistent with the current ASID generation, then the address space needs a new software asid, more precisely the need to advance to new generation. So call New_context here to assign a new context ID and set it to mm->context.id.

(4) Each CPU will call Local_flush_tlb_all to flush the local TLB when it cuts into the next generation of asid space.

Reference documents:

1, 64-ia-32-architectures-software-developer-manual-325462.pdf

2, Ddi0487a_e_armv8_arm.pdf

3. Linux 4.4.6 Kernel source code

Linux process switch (2) TLB processing

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More