1. kernel space and user space
User space: in Linux, each user process can access a 4 GB linear virtual memory. The virtual address ranging from 0 to 3 GB is the user space. The user process can directly access the page Directory and page table of each process.
Kernel space: the virtual address from 3 GB to 4 GB is the kernel state space, which stores the code and data accessed by the kernel. The user State process cannot be accessed, and only the kernel state process can be addressable. All processes share the same virtual space from 3 GB to 4 GB. In this way, Linux allows kernel-State processes to share code segments and data segments.
Due to the introduction of the virtual mechanism, the process can use all 4G linear spaces supported by the 32-bit address system. The linear address space of a process is divided into two parts:
• Linear addresses ranging from 0x00000000 to 0xbfffff can be addressable, regardless of user or kernel processes.
• From 0xc0000000 to 0xffffffff linear address, only kernel-State processes can be addressable.
When a process runs in the user State, it generates a linear address less than 0xc0000000. When the process runs in the kernel state, it executes the kernel code, the generated linear address is greater than or equal to 0xc0000000, and all processes share it. If it is a multi-core CPU, multiple processes may concurrently access these addresses. This involves critical zone resources, that is, we will focus on the synchronization and mutex content in the future. The value of macro page_offset is 0xc0000000, which is the offset of the process in the linear address space and the beginning of the kernel survival space.
The global directory of the page is divided into two parts. The linear address mapped to the first sub-Table item is smaller than 0xc0000000 (1024 items in total, which is the first 768 Items when the PAE is not started, PAE is the first three items at startup). The specific size depends on the specific process. On the contrary, the remaining table items should be the same for all processes. They are equal to the corresponding table items in the global directory of the main kernel page.
So what is the global directory of the main kernel page? The kernel maintains a group of self-used page tables that reside in the so-called master kernel page global directory. After system initialization, this set of page tables have never been directly used by any process or any kernel thread. It is mainly used to provide reference models for global directory items corresponding to every common process in the system.
How does the kernel initialize its own page table? This process is divided into two phases. In fact, after the kernel image is loaded into the memory, the CPU still runs in the real mode, so the paging function is not enabled.
In the first phase, the kernel creates a limited address space, including the kernel code segments and data segments, the initial page table, and a total of KB space for storing dynamic data structures. The minimum address space is enough to load the kernel into RAM and initialize the core data structure.
In the second stage, the kernel makes full use of the remaining Ram and appropriately creates a paging table.
1.1 kernel page table Initialization
Why should we create a kernel page table in Linux? The purpose of creating a kernel page table in Linux is to perform dynamic management of the kernel's data structure, for example, to swap out the data structures temporarily unavailable to kernel-State processes; the second is to provide a reference for the page table of the process, which will be discussed in detail in the following blog.
During system initialization, you must first create a original page table, that is, the kernel temporary page table. The global directory of the temporary page pointing to this page table is initialized statically during kernel compilation, while the temporary page table is initialized by startup_32 () assembly language functions (defined in arch/i386/kernel/head. s) initialized.
After the kernel is compiled, the global directory address of the temporary page is stored in the swapper_pg_dir variable. The temporary page table starts to be stored in the pg0 variable, followed by the uninitialized data segment in the kernel (this is an unclear blog post on the memory layout. The purpose of creating a kernel temporary page table is to provide a ing mechanism during kernel initialization. Generally, the segments used by the kernel in the initialization phase, the temporary page table, and the memory size of KB can be contained in the first 8 MB of Ram space. In order to map the first 8 Mb space of RAM, we only need to use two page tables, because a page table has 1024 subscripts and 2*1024*4 K is exactly 8 Mb.
The goal of the first phase of paging is to allow the 8 Mb addressing to be easily achieved in real and protection modes, with the aim of converting from real to protection Modes
. Therefore, the kernel must create a ing to map the linear addresses from 0x00000000 to 0x007fffff and the linear addresses from 0xc0000000 to 0xc07fffff to the physical addresses from 0x00000000 to 0x007fffff. Are you dizzy? In the first phase of initialization, the kernel can address the first 8 MB of physical Ram through a linear address of the same physical address or an 8 Mb linear address starting from 0xc0000000. Dizzy? That's all right. Draw a picture and try again and again.
The kernel creates the expected ing (1024 items) by filling all swapper_pg_dir items with 0. However, 0, 1, 0x300 (decimal 768th items) and 0x301 (decimal 769th items). The last two items include all linear addresses of 8 Mb starting from 0xc0000000 to 0xc07fff. 0, 1, 0x300, and 0x301 are initialized as follows:
• The address fields of item 0 and item 0 x are set to the physical address of pg0, while the address fields of item 1 and item 0 x are set to the physical address of the page box following pg0.
• Set the present, read/write, and user/supervisor signs in these four items.
• Clear the accessed, dirty, PCD, PWD, and page size of the four items by 0.
After the temporary kernel page table is created, we need to use this ing immediately, because during initialization, you have to enter the protection mode to initialize the various data structures of the kernel. How can we use it? The paging unit is enabled by the assembly language function startup_32 () during initialization: this goal is achieved by loading the swapper_pg_dir address to the control register of Cr 3 and setting the PG flag of the Cr0 control register. The following is an equivalent code snippet:
Movl $ swapper_pg_dir-0xc0000000, % eax
Movl % eax, % H6/* set the page table pointer... */...
Movl % Cr0, % eax
Orl $0x80000000, % eax
Movl % eax, % Cr0 /*...... Set the paging (PG) bit */
After creating a temporary kernel page table, we can finally leave the actual mode. The story is to use the 80x86 system CPU protection mode to implement virtual storage management of the kernel and various processes. However, this kernel temporary page table only has 8 Mb ing, which is only used to address the kernel in the initialization phase and cannot meet the requirements for the entire memory management. Next, we will establish the final page table of the kernel. In the 32-bit 80 x86 system, the creation of the final page table of the kernel depends on the actual size of RAM:
1.2 final kernel page table When Ram is less than MB
The final ing provided by the kernel page table must convert the linear address starting from 0xc0000000 to the physical address starting from 0. Macro _ Pa is used to convert the linear address starting from page_offset to the corresponding physical address, while macro _ va converts the physical address. The global directory of the main kernel page is still saved in the swapper_pg_dir variable. It is initialized by the paging_init () function. This function performs the following operations:
1. Call pagetable_init () to properly create a page table item.
2. Write the physical IP address of swapper_pg_dir to the control register of S3.
3. If the CPU supports PAE and the kernel supports PAE during compilation, set the PAE flag of the Cr4 control register.
4. Call flush_tlb_all () to invalidate all TLB items.
1.3 final kernel page table when the ram size is between 896mb and 4096mb
In this case, all Ram resources are not mapped to the kernel address space. In the initialization phase, Linux only maps a ram with MB to the Kernel linear address space. If a program needs to address the rest of the existing Ram, it must map some other linear address intervals to the required Ram, by modifying the values of some page table items. The kernel uses the same code as the previous case to initialize the global directory of the page.
1.4 final kernel page table When Ram is greater than 40 96 MB
For modern computers, especially some high-performance servers with a memory of more than 4 GB, how can we initialize the kernel page table? To be more precise, we can handle the following situations:
• The CPU mode supports physical address expansion (PAE)
• Ram capacity greater than 4 GB
• The kernel is compiled with PAE support
Although PAE processes 36-bit physical addresses, the linear address is still 32-bit. As mentioned above, Linux maps a mb ram to the kernel address space. The remaining Ram is not mapped and is processed by dynamic re ing. The main difference from the previous case is the use of a three-level paging model. In fact, even if our CPU supports PAE, we can only have a 64 GB kernel page table with addressing capability. Therefore, to create a server with higher performance, we recommend that you improve the dynamic re ing algorithm, or simply upgrade to a 64-bit processor.
2 fixed linear ing address
We can see that the first MB portion of the Kernel linear address of the fourth GB is mapped to the physical memory of the system. However, linear addresses of at least MB are always used for other purposes, because the kernel uses these linear addresses for non-continuous memory allocation.
Linear address with fixed ing
.
The non-continuous memory allocation is only a special way to dynamically allocate and release the memory page, which will be described later in the blog post. This section focuses on linear addresses of fixed ing.
The Linux Kernel provides a virtual address for fixed ing, that is, fixed map.
Fixed-mapped linear address is a fixed linear address. Its physical address is not obtained through simple linear conversion, it is manually specified. Each fixed linear address is mapped to a physical memory page. Fixed ing linear addresses can be mapped to any page of physical memory.
Fixed ing linear addresses are allocated from the last 4 kb of the entire linear address space, that is, the linear address 0xfffff000 to the low address. Leave a page blank at the top of the last 4 kb space and the fixed ing linear address space (unknown reason). The address space before the fixed ing linear address space is called the Region allocated by vmalloc, there is also a blank page between them.
A linear address of fixed ing is basically a constant linear address similar to 0xffffc000. Its physical address does not have to be equal to a linear address minus 0xc000000, but is created in any way through a page table. Therefore, the linear addresses of each fixed ing are mapped to a page box in the physical memory.
The linear address of each fixed ing is represented by an integer index defined in the enum fixed_addresses enumerated data structure:
Enum fixed_addresses {
Fix_hole,
Fix_vsyscall,
Fix_apic_base,
Fix_io_apic_base_0,
...
_ End_of_fixed_addresses
};
Each linear address of the fixed ing is stored at the end of the fourth GB of the linear address. Fix_to_virt () function compute constant linear addresses starting with a given index:
Inline unsigned long fix_to_virt (const unsigned int idx)
{
If (idx> = _ end_of_fixed_addresses)
_ This_fixmap_does_not_exist ();
Return (0xfffff000ul (idx <page_shift ));
}
For example, let's assume that a kernel function calls fix_to_virt (fix_ioapic_base_0 ). Because the function is declared as "inline", the C compiler does not call fix_to_virt (), but simply inserts its code into the call function. In addition, this index value is never checked during running.
According to the enumeration concept, fix_ioapic_base_0 is a constant equal to 3. Therefore, the compile program can remove the if statement because its condition is false during compilation. On the contrary, if the condition is true, or the fix_to_virt () parameter is not a constant, the Compilation Program produces an error in the connection phase because the symbol _ this_fixmap_does_not_exist is not defined elsewhere.
Finally, compile the program to calculate 0xfffff000 (3 <page_shift), that is, use 0xfffff000-the last and fourth pages. Let's subtract one minus: E, D, C, finally, the constant linear address 0xffffc000 is obtained as the return value of the fix_to_virt () function.
Then, with this fixed ing linear address, how can we associate a physical address with a fixed ing linear address? The kernel uses set_fixmap (idx, Phys) and set_fixmap_nocache (idx, phys) Macro. Both functions initialize a page table entry corresponding to the fix_to_virt (idx) linear address as the physical address phys (note that the page Directory address is still in swapper_pg_dir, and only page table items need to be set here ); however, the second function also sets the PCD flag of the page table item. Therefore, when you access the data in this page box, disable the hardware high-speed cache. In turn, clear_fixmap (idx) used to cancel the connection between a fixed ing linear address idx and a physical address.
What is the purpose of this fixed address ing? It is generally used to replace some frequently used pointers. For pointer variables, linear addresses with fixed ing are more effective. In fact, indirect reference to a pointer variable requires one more memory access than indirect reference to an immediate constant address. For example, we set a fix_apic_base pointer, which indicates that objects exist in the corresponding physical memory. After we establish the relationship between the two through set_fixmap and clear_fixmap, we can directly address them, there is no need for indirect addressing like a pointer.
In addition, checking the value of a pointer variable before indirectly referencing it is a good programming habit; on the contrary, checking a constant linear address is unnecessary.
3 high-end memory kernel ing
We have analyzed above. In Linux memory management, the kernel uses a 3-G 4G linear address space, with a total size of 1 GB. In x86, the linear addresses of m in the kernel page table correspond to physical addresses one by one, while the remaining MB linear addresses are reserved for other users (for non-continuous memory allocation ).
Linear address with fixed ing
). Generally
The area exceeding MB is called high-end memory. How does the kernel manage high-end memory? Let's analyze this issue today.
The kernel has three ways to manage high-end memory. The first type is non-continuous ing. Here, we only briefly mention that when requesting a page in vmalloc, if the request is for high-end memory, it is mapped to vmalloc_start and vmalloc_end. The second method is permanent kernel ing. The last method is temporary kernel ing.
Next, analyze the second and third methods in detail.
The kernel has a global variable named high_memory, which is set to 0x38000000, that is, 896 MB. The space above the 128 MB boundary (the 32-bit 80x86 address range of PAE is not started 3 GB + 4th MB) is not mapped to the GB of Kernel linear address space. Therefore, the kernel cannot directly access them. This means that the page allocation function of the linear address of the allocated page box is returned, that is, a function similar to _ get_free_pages (gfp_highmem, 0), which is not applicable to high-end memory, this is not applicable to pages in the zone_highmem memory management area.
The high-end Memory Page box can only be allocated through the alloc_pages () function and its shortcut function alloc_page (). These functions do not return the linear address of the first allocated page box, because such linear address does not exist if the page box belongs to high-end memory. Instead, these functions return the linear address of the page descriptor of the first allocated page box. These linear addresses always exist, because once all page descriptors are allocated, they must be in the low-end memory. They are allocated during kernel initialization and will never change.
Don't be happy. Although you can use the alloc_pages () function to allocate a page in the high-end area, this page does not have a linear address and cannot be accessed by the kernel. Therefore, the last MB of Kernel linear address space is used to map pages of high-end memory. Of course, this ing is temporary; otherwise, only MB of high-end memory can be accessed. Instead, by reusing linear addresses, the entire high-end memory can be accessed at different times.
3.1 permanent memory ing
Permanent kernel ing allows the kernel to establish a long-term ing between the high-end page boxes and the kernel address space. They only use one special page table in the main kernel page table (Note that there is only one page table)
The address is stored in the pkmap_page_table variable. The number of table items in the page table is generated by the last_pkmap macro. The page table still contains 512 or 1024 items, depending on whether the PAE is activated. Therefore, the kernel can only access a maximum of 2 MB or 4 MB fixed memory ing of high-end memory at a time. That is to say, the space is 4 m or 2 m, so only one page table is required. The kernel searches for this page table through pkmap_page_table.
The linear address mapped to this page table starts with pkmap_base (note that these are some macros. The pkmap_count array contains last_pkmap counters. Each item in the pkmap_page_table page table has one. There are three situations:
The counter is 0: The corresponding page table items are not mapped to any high-end Memory Page boxes and are available.
Counter: 1: The corresponding page table item is not mapped to any high-end Memory Page box, but it cannot be used because its TLB performance has not been refreshed since its last use.
Counter is n (greater than 1): The corresponding page table items map a high-end Memory Page box, which means that there are exactly n-1 kernel components using this page box.
To record the relationship between the high-end Memory Page box and the linear address included in the permanent kernel ing, the kernel uses the page_address_htable hash. The table contains a page_address_map data structure for current ing for each page box in high-end memory. The data structure also contains a pointer to the page Descriptor and a linear address assigned to the page box.
The page_address () function returns the linear address corresponding to the page box. If the page box is in high-end memory and is not mapped, null is returned. This function accepts a page descriptor pointer page as its parameter and distinguishes between the following two situations:
1. if the page box is not in the high-end memory (pg_highmem is 0), the linear address always exists and is calculated by the subscript of the page box, and then converted to the physical address, finally, a linear address is obtained based on the corresponding physical address. This is done by the following code: __va (unsigned long) (page-mem_map) <12)
2. If the page is in the high-end memory (pg_highmem is marked as 1), this function will be searched in the page_address_htable hash list. If the page_address () is found in the hashed list, its linear address is returned. Otherwise, null is returned.
The kmap () function establishes a permanent kernel ing. Essentially, it is equivalent to the following code:
Void * kmap (struct page * Page)
{
If (! Pagehighmem (page ))
Return page_address (PAGE );
Return kmap_high (PAGE );
}
If the page box does belong to the high-end memory, call the kmap_high () function. This function is essentially equivalent to the following code:
Void * kmap_high (struct page * Page)
{
Unsigned long vaddr;
Spin_lock (& kmap_lock );
Vaddr = (unsigned long) page_address (PAGE );
If (! Vaddr)
Vaddr = map_new_virtual (PAGE );
Pkmap_count [(vaddr-PKMAP_BASE)> page_shift] ++;
Spin_unlock (& kmap_lock );
Return (void *) vaddr;
}
This function obtains the kmap_lock spin lock to protect page tables from concurrent access on multi-processor systems. Next, the kmap_high () function checks whether the page box has been mapped by calling page_address. If not, this function calls the map_new_virtual () function to insert the physical address of the page box to an item in pkmap_page_table and add an element to the page_address_htable hash. Then, kmap_high () adds 1 to the counter corresponding to the linear address of the page box to take into account the new kernel components that call this function. Finally, kmap_high () releases the kmap_lock spin lock and returns the linear address mapped to the box on this page.
The map_new_virtual () function essentially executes two nested loops:
For (;;){
Int count;
Declare_waitqueue (wait, current );
For (COUNT = last_pkmap; count> 0; -- count ){
Last_pkmap_nr = (last_pkmap_nr + 1) & (last_pkmap-1 );
If (! Last_pkmap_nr ){
Flush_all_zero_pkmaps ();
Count = last_pkmap;
}
If (! Pkmap_count [last_pkmap_nr]) {
Unsigned long vaddr = pkmap_base +
(Last_pkmap_nr <page_shift );
Set_pte (& (pkmap_page_table [last_pkmap_nr]),
Mk_pte (page, _ pgprot (0x63 )));
Pkmap_count [last_pkmap_nr] = 1;
Set_page_address (page, (void *) vaddr );
Return vaddr;
}
}
Current-> state = task_uninterruptible;
Add_wait_queue (& pkmap_map_wait, & wait );
Spin_unlock (& kmap_lock );
Schedule ();
Remove_wait_queue (& pkmap_map_wait, & wait );
Spin_lock (& kmap_lock );
If (page_address (page ))
Return (unsigned long) page_address (PAGE );
}
In the internal loop, this function scans all the counters in pkmap_count until a null value is found. When an unused item is found in pkmap_count, the large if code block runs. This Code confirms the linear address corresponding to this item, creates an item for it in the pkmap_page_table page table, and sets count to 1. Because this item is used now, set_page_address () is called () the function inserts a new element into the page_address_htable hash and returns a linear address.
The function runs through the pkmap_count array from where it was last stopped. This is done by saving the index of the previously used page table item in the pkmap_page_table page table in a variable named last_pkmap_nr. Therefore, the search starts again from the place where the map_new_virtual () function is called.
When the last counter is searched in pkmap_count, the search starts again from the counter whose subscript is 0. However, before continuing, map_new_virtual () calls the flush_all_zero_pkmaps () function to start searching for another scan with the counter as 1. Each counter with a value of 1 indicates that the table items in the pkmap_page_table page table are idle, but cannot be used because the corresponding TLB table items have not been refreshed. Flush_all_zero_pkmaps () resets their counters to 0, deletes the corresponding elements in the page_address_htable hash list, and refreshes TLB on all items in pkmap_page_table.
If no empty counter is found in pkmap_count, The map_new_virtual () function blocks the current process until a process releases a table item in the pkmap_page_table page table. Insert current to the pkmap_map_wait wait queue, set the current status to task_uninterruptible, and call schedule () to discard the CPU. Once a process is awakened, the function checks whether another process has mapped the page by calling page_address (). If no other process maps the page, the internal loop starts again.
The kunmap () function removes the permanent kernel ing previously established by kmap. If the page is indeed in high-end memory, the kunmap_high () function is called, which is essentially equivalent to the following code:
Void kunmap_high (struct page * Page)
{
Spin_lock (& kmap_lock );
If (-- pkmap_count [(unsigned long) page_address (page)
-Pkmap_base)> page_shift]) = 1)
If (waitqueue_active (& pkmap_map_wait ))
Wake_up (& pkmap_map_wait );
Spin_unlock (& kmap_lock );
}
The expressions in brackets calculate the index of the pkmap_count array from the linear address of the page. The counter is reduced by 1 and compared with 1. A successful match indicates that no process is in use. This function can eventually wake up the processes added to the waiting queue by map_new_virtual () (if any ).
Well, Let's sum up, if we get the page corresponding to the high-end memory through alloc_page (), how can we find a linear space for it?
The kernel sets aside a linear space from pkmap_base to fixaddr_start, which is used to map high-end memory. On the 2.6 kernel, if PAE is not specified, the address range is between 4G-8 m and 4G-4 m. This space is called "kernel permanent ing space" or "permanent kernel ing space"
This space uses the same page global directory table as other spaces. For the kernel, It is swapper_pg_dir. For common processes, it points to the global directory table through the 33rd register.
Generally, the space is 4 MB, so only one page table is required. The kernel searches for this page table through pkmap_page_table.
With kmap (), you can map a page to this space.
Because the space is 4 MB, up to 1024 pages can be mapped at the same time. Therefore, for unused pages, and should be released from this space (that is, the ing relationship is removed), through kunmap (), you can release the linear address of a page from this space.
3.2 temporary kernel ing
Temporary kernel ing is simpler than permanent kernel ing. Any page box in the high-end memory can be mapped to the kernel address space through a "window" (a page table item reserved for this purpose. The number of windows reserved for temporary kernel ing is very small.
Each CPU has its own set of 13 windows, which are expressed in the enum km_type data structure. Each symbol defined in the data structure, such as km_bounce_read, km_user0, or km_pte0, identifies the linear address of the window.
The kernel must ensure that the same window will never be used by two different control paths at the same time. Therefore, each symbol in the km_type structure can only be used by one kernel component and named after it. The last km_type_nr does not represent a linear address, but each CPU is used to generate different number of available windows.
Each symbol (except the last one) in km_type is a subscript of a fixed ing linear address (see the "fixed ing linear address" blog ).
The enum_fixed_addresses data structure contains the characters fix_kmap_begin and fix_kmap_end. The latter is assigned to the subscript fix_kmap_begin + (km_type_nr * nr_cpus)-1. In this way, each CPU in the system has a linear address with a fixed ing of km_type_nr. In addition, the kernel uses the address of the page table entry corresponding to the linear address fix_to_virt (fix_kmap_begin) to initialize the kmap_pte variable.
To establish a temporary kernel ing, the kernel calls the kmap_atomic () function, which is essentially equivalent to the following code:
Void * kmap_atomic (struct page * Page, Enum km_type type)
{
Enum fixed_addresses idx;
Unsigned long vaddr;
Current_thread_info ()-> preempt_count ++;
If (! Pagehighmem (page ))
Return page_address (PAGE );
Idx = type + km_type_nr * smp_processor_id ();
Vaddr = fix_to_virt (fix_kmap_begin + idx );
Set_pte (kmap_pte-idx, mk_pte (page, 0x063 ));
_ Flush_tlb_single (vaddr );
Return (void *) vaddr;
}
The type parameter and the CPU identifier (via smp_processor_id () specify which fixed ing linear address must be used to map the request page. If the page box does not belong to the high-end memory, the function returns the linear address of the page box; otherwise, create a page table entry corresponding to the linear address of the fixed ing with the physical address of the page and the present, accessed, read/write, and dirty bits. Finally, the function refreshes the appropriate TLB item and returns a linear address.
To cancel the temporary kernel ing, the kernel uses the kunmap_atomic () function. In the 80x86 structure, this function reduces the preempt_count of the current process. Therefore, if you can preemptible the kernel control path before requesting a temporary kernel image, you can seize the same ing again after it is revoked. In addition, kunmap_atomic () checks whether the tif_need_resched flag of the current process is set. If yes, schedule () is called ().
Now let's summarize the temporary kernel ing. The front side mentions that there is a space called "fixed ing space" on the page counting down from 4G to the forward from a linear address. Some of this space is used for temporary ing of high-end memory.
This space has the following features:
1. Each CPU occupies one space
2. the space occupied by each CPU is divided into multiple small spaces. Each small space is one page, and each small space is used for one purpose, these goals are defined in km_type in kmap_types.h.
To perform a temporary ing, You need to specify the ing purpose. Based on the ing purpose, you can find the corresponding small space and use the address of the space as the ing address. This means that a temporary ing will overwrite the previous ing.
You can use kmap_atomic () to implement temporary ing.
Finally, we will use an online figure to summarize today's blog: