First, preface
After the Memory initialization code Analysis (i) and the Memory initialization Code Analysis (ii) of the transition, we finally came to the core of memory initialization: Paging_init. Of course, this article can not parse all the function (it takes too long), we only focus on the creation of system memory address mapping This part of the code implementation, that is, to parse the MAP_MEM function in Paging_init.
Similarly, we chose the kernel code for 4.4.6, and the architecture-related code came from ARM64.
II. Preparation Phase
Before going into the actual code analysis, let's look back at the current state of memory. In the vast physical address space, the system memory occupies one or several segments of the address space, which is stored in an array of memory type in the Memblock module, and each memory region in the array describes a system memory information (base, Size and Node ID). OK, system memory is so much, but not all memory type arrays are free, in fact, some of the areas that have already been used or reserved for use are extracted from a segment or several of the address spaces of the type memory type. and is defined in an array of type reserved. In fact, during the entire system initialization process (more specifically, before the memory management module completes initialization), we need to temporarily manage the memory using a booting phase memory management module such as Memblock (collecting memory layout information is just its sideline), Each allocated memory is either a new item in the reserved type array or a size that expands one of its memory region.
With the Memblock module, we've collected information about the memory layout (an array of type memories type), and we know the free memory resource (an array of type memory type minus reserved type array, from a set theory point of view, An array of type reserved is a true subset of an array of type memory type, but to manage these precious system memory, you first have to be able to access them ah (by the way: the addresses of those arrays defined in the Memblock are physical addresses), through the previous analysis article, We know that two pieces of memory are already visible (address mapping is done), one is the kernel image segment, and the other is the FDT segment. And the vast area of system memory is still in the dark, waiting for us to save (address mapping).
Finally, let's consider the question: is the memory type array representing the entire system RAM address space? Of course not, some drivers may retain a part of the system memory area for their own use, but also do not want the OS to manage this memory (or the OS is not visible), but instead of creating the memory address map itself. If you are familiar with the Memory reserve node in DTS, the reserved memory region actually has a no-map attribute. At this point, during kernel initialization, when the Reserved-memory node is parsed, the address is removed from the Memblock module. In the Map_mem function, when you create an address map for all memory type arrays, the memory address with the No-map attribute will not create an address map, and it will not be within the control of the OS.
Iii. Overview
The code to create the system memory address mapping is in Map_mem, as follows:
static void __init Map_mem (void) {
struct Memblock_region *reg;
phys_addr_t limit;
Limit = Phys_offset + swapper_init_map_size;---------------(1)
Memblock_set_current_limit (limit);
For_each_memblock (Memory, Reg) {------------------------(2)
phys_addr_t start = Reg->base;―― Determines the start address of the region
phys_addr_t end = start + reg->size; --Determine the end address of the region
if (start >= end)--parameter check
Break
if (arm64_swapper_uses_section_maps) {----------------(3)
if (Start < limit)
start = ALIGN (start, section_size);
if (end < limit) {
limit = end & Section_mask;
memblock_set_ Current_limit (limit);
}
}
__map_memblock (start, end);-------------------------(4)
}
Memblock_set_current_limit (Memblock_alloc_anywhere);----------(5)
}
(1) First limits the current memblock limit. This is done because in the case of mapping, the allocation of page table memory is required if any level translation table does not exist. At this point in time, the partner system is not ready and cannot be allocated dynamically. Of course, this time memblock is ready, but if the allocated memory has not yet created an address map (the entire physical memory layout is known and saved in the Memblock module in the Memblock module, not all system memory address mappings have been established, The purpose of our Map_mem function is to create mapping of all system memory, and once the kernel accesses the physical memory allocated by Memblock_alloc, the tragedy will occur. How did it break? The method for limiting the memblock limit is used here. Once the upper limit is set, the Memblock_alloc allocates no more physical memory than this limit.
What is the upper limit set? The basic idea is in the MAP_MEM call process, do not need to allocate translation table, how to do it? Of course, try to make use of those page tables that have been statically defined. Phys_offset is the starting address for physical memory, Swapper_init_map_size is the SIZE of kernel direct mapping during the startup phase. In other words, from Phys_offset to Phys_offset + swapper_init_map_size, all page tables (translation table at each level) are OK, no allocation required, Just write the descriptor to the page table. Therefore, if you set the upper limit of the current Memblock assignment here, the memory allocation action will not be generated (because the page table is ready).
(2) The corresponding address mapping is established for the region of all memory type in the system. Because the memory region of the reserved type is a true subset of the memory type region, the address mapping of the reserved memory is also established.
(3) If you do not use a section map, we statically assign the Pgd~pte page table in the Kernel direct mapping area, with the start address alignment and the memblock limit set to ensure that the Create_ Mapping () Does not allocate page table memory. However, in the following scenario:
(A) the start or end address of the Memory block is not aligned on the 2M
(B) Use of section Map
In this case, when calling Create_mapping (), the PTEs page table memory is allocated (no alignment of 2M, no section mapping). How did it break? Fortunately the first memory block (that is, kernel image block) of the start address is bound to the 2M address, so as long as the end address, it is necessary to properly reduce the limit to end & Section_ Mask to ensure that the allocated page table memory is already established address mapping.
(4) The __map_memblock code is as follows:
static void __init __map_memblock (phys_addr_t start, phys_addr_t end)
{
Create_mapping (Start, __phys_to_virt (start), End-start,
PAGE_KERNEL_EXEC);
}
It is necessary to note that after Map_mem, all descriptors created previously by __create_page_tables are overwritten, replaced by new mappings, and memory attribute as follows:
#define PAGE_KERNEL_EXEC __pgprot (_page_default | Pte_uxn | Pte_dirty | Pte_write)
Most memory attribute remain the same (e.g. Mt_normal, PTE_AF, pte_shared, etc.), with a few bits to illustrate: pte_uxn,unprivileged execute-never bit, That is to say, restrict userspace from this point to execution. Pte_dirty is a software set bit, the hardware does not operate the Bit,os software with this bit to identify the entry is clean or dirty, if it is dirty, that the page data has been written, if the page needs to be swapped Out, you also need to save the dirty data to reclaim the page. About Pte_write's explanation todo.
(5) All the system memory address mapping has been established, cancel the previous upper limit, so that the Memblock module can freely allocate memory.
Iv. descriptors in the fill PGD
Create_mapping actually calls the underlying __create_mapping function to complete the address mapping, with the following code:
Static void __init create_mapping (phys_addr_t phys, unsigned long virt,
& nbsp; phys_addr_t size, pgprot_t prot)
{
if (Virt < Vmalloc_start) {
Pr_warn ("BUG: Not creating mapping for%PA at 0x%016lx-outside kernel range\n ",
& nbsp; &phys, Virt);
return;
}
__create_mapping (&init_mm, Pgd_offset_k (Virt & Page_mask), Phys, Virt,
size, prot, early_alloc);
}
The role of Create_mapping is to set the starting physical address equal to Phys, the size of which is the amount of physical memory mapping to the starting virtual address is virt virtual address space, the mapped memory attribute is prot. The virtual address space of the kernel starts from Vmalloc_start, lower than this address is not right, verify the virtual address, the underlying is called __create_mapping function, pass the parameter situation is such, init_mm is the kernel space memory descriptor, Pgd_offset _k is based on the given virtual address, in kernel space's PGD to find the corresponding descriptor location, Early_alloc is in the mapping process, if you need to allocate memory (page table requires memory), call the function to allocate memory. The specific code for the __create_mapping function is as follows:
static void __create_mapping (struct mm_struct *mm, pgd_t *PGD,
phys_addr_t Phys, unsigned long virt,
phys_addr_t size, pgprot_t prot,
void * (*alloc) (unsigned long size))
{
unsigned long addr, length, end, next;
Addr = Virt & page_mask;------------------------(1)
Length = page_align (size + (Virt & ~page_mask));
End = addr + length;
do {----------------------------------(2)
Next = pgd_addr_end (addr, end);--------------------(3)
Alloc_init_pud (mm, PGD, addr, Next, Phys, Prot, alloc);----------(4)
Phys + = next-addr;
} while (pgd++, addr = Next, addr! = end);
}
Create address mappings be familiar with the address space, different processes have different address space, struct mm_struct is the virtual address space describing a process, of course, our scenario here is to create an address map for the kernel virtual address space, so the parameters passed is init_mm. The starting virtual address where the address mapping needs to be created is virt, the descriptor in the PUD corresponding to the virtual address is a 8 B memory, and the PGD is a pointer to this descriptor memory.
(1) Because the minimum unit of Address mapping is page, the virtual address where the mapping is to be aligned to the page size, the same length needs to be aligned to page size. After an alignment operation, the address range defined by (addr,length) should be a range of addresses defined by (virt,size), and aligned to the page.
(2) (Addr,length) This virtual address range may need to occupy multiple PGD entry, so here we need a loop, constantly call the Alloc_init_pud function to complete (ADDR,LENGTH) This virtual address range mapping, of course, The ALLOC_INIT_PUD function will also establish the entry of the downstream (for example, PUD, PMD, PTE) translation tables.
(3) A descriptor in PGD can only mapping the virtual address of a limited region (pgdir_size), and the Pgd_addr_end macro is the end address of the area where the addr is calculated. Returns the end parameter value if the calculated end address is less than the passed-in end argument. That is, if (addr,length) the mapping of this virtual address range needs to span multiple PGD entry, then the next variable holds the starting virtual address of the next PGD entry.
(4) This function has two functions, one is to fill the PGD entry, and the second is to create a subsequent PUD translation table (if necessary) and to perform downstream translation table.
V. Allocating the PUD page table memory and populating the appropriate descriptors
Alloc_init_pud is not just the operation of PUD, it is actually a entry of the Operation PGD and is assigned the initial PUD as well as subsequent translation table. The entry that fills the PGD need to give the memory address corresponding to the PUD translation table, and if the PUD does not exist, then alloc_init_pud also needs to allocate the PUD translation table (page size). The PGD entry can only be populated if the physical memory address of the PUD translation table is obtained. The specific code is as follows:
static void Alloc_init_pud (struct mm_struct *mm, pgd_t *pgd,
& nbsp; unsigned long addr, unsigned long end,
phys_addr_t Phys, pgprot_t Prot,
void * (*alloc) (unsigned long size))
{
pud_t *pud;
unsigned long next;
if (Pgd_none (*PGD)) {--------------------------(1)
PUD = alloc (ptrs_per_pud * sizeof (pud_t));
Pgd_populate (mm, PGD, PUD);
}
PUD = Pud_offset (PGD, addr); ---------------------(2)
Do {--------------------------------(3)
Next = pud_addr_end (addr, end);
if (Use_1g_block (addr, Next, Phys)) {----------------(4)
pud_t old_pud = *pud;
Set_pud (PUD, __pud (Phys | pgprot_val (Mk_sect_prot (Prot)))); ---(5)
if (!pud_none (old_pud)) {---------------------(6)
Flush_tlb_all (); ------------------------(7)
if (pud_table (old_pud)) {
phys_addr_t table = __pa (Pmd_offset (&old_pud, 0));
if (! Warn_on_once (Slab_is_available ()))
Memblock_free (table, page_size); ------------(8)
}
}
} else {
ALLOC_INIT_PMD (mm, pud, addr, Next, Phys, Prot, alloc);
}
Phys + = next-addr;
} while (pud++, addr = Next, addr! = end);
}
(1) If the current PGD entry is full 0, there is no corresponding subordinate PUD table memory, so it is necessary to allocate the memory of the PUD page table. It should be explained that at this time, the partner system is not ready, allocating memory still uses the Memblock module, pgd_populate to establish the relationship between the PGD entry and the PUD page table memory.
(2) At this point, the PUD page table memory is already available, but addr corresponds to which descriptor in the PUD? Pud_offset gives the answer, and its returned pointer points to the PUD descriptor memory corresponding to the incoming parameter addr address, and our subsequent task is to populate the PUD entry.
(3) Although the virtual address range between (addr,end) shares a PGD entry, the address range corresponding to the PUD entry may have multiple, through the loop, fill the PUD entry one by one, and assign and initialize the next order Page table.
(4) If there is no possible 1G block address mapping, the code logic here is similar to the one in the previous section, except that the constant loop call Alloc_init_pud is changed to ALLOC_INIT_PMD. However, ARM64 's MMU hardware provides a gray, powerful feature that supports the 1G size block mapping, which can get a lot of benefits if applied: no need to allocate subordinate translation table saves memory and, more importantly, significantly reduces the TLB Miss, improves performance. Since this is so good, of course to use, but what conditions? First the system configuration must be 4k page size, in this configuration, a PUD entry can cover 1G memory block. In addition, the starting and ending virtual addresses and the physical addresses mapped to must be aligned on a 1G size.
(5) Fill in a PUD descriptor, once the address mapping 1G size, no PMD and Pte page table memory, no access to PMD and Pte descriptors, how simple, how wonderful ah. Assuming that the system memory is 4G and the physical Address is aligned on 1G (the virtual address Page_offset is originally aligned to 1G), then the 4 PUD descriptor will take care of the linear address mapping interval of the kernel space.
(6) If the PUD entry is non-empty, then it is stated that there is a mapping (perhaps only a partial mapping) of the address. A simple example is the initial phase of the kernel image mapping, which is entry in __create_page_tables creation of PUD and PMD. If section mapping is not possible, then a descriptor in the PTE is also established, and now these descriptors are useless, and we can discard them.
(7) Although a new page table is created, the old page table remains in the TLB and must be "wiped out" to clear the TLB.
(8) If the PUD points to a table descriptor, which means that the entry points to a PMD table, it needs to release its memory.
Vi. Allocating the PMD page table memory and populating the appropriate descriptors
1G block Mapping Although good, not necessarily suitable for all systems, let me take a look at the PUD entry is populated with the Block descriptor case (descriptor points to pmd translation table):
static void Alloc_init_pmd (struct mm_struct *mm, pud_t *pud,
unsigned long addr, unsigned long end,
phys_addr_t Phys, pgprot_t Prot,
void * (*alloc) (unsigned long size))
{
pmd_t *PMD;
unsigned long next;
if (Pud_none (*pud) | | pud_sect (*PUD)) {-------------------(1)
PMD = alloc (PTRS_PER_PMD * sizeof (pmd_t))----Allocating PMD page table memory
if (Pud_sect (*pud)) {--------------------------(2)
Split_pud (PUD, PMD);
}
Pud_populate (mm, pud, PMD);---------------------(3)
Flush_tlb_all ();
}
Bug_on (Pud_bad (*pud));
PMD = Pmd_offset (pud, addr);-----------------------(4)
do {
Next = pmd_addr_end (addr, end);
if (((addr | next | phys) & ~section_mask) = = 0) {------------(5)
pmd_t OLD_PMD =*PMD;
SET_PMD (PMD, __PMD (Phys | pgprot_val (Mk_sect_prot (Prot))));
if (!pmd_none (OLD_PMD)) {----------------------(6)
Flush_tlb_all ();
if (pmd_table (OLD_PMD)) {
phys_addr_t table = __pa (Pte_offset_map (&OLD_PMD, 0));
if (! Warn_on_once (Slab_is_available ()))
Memblock_free (table, page_size);
}
}
} else {
Alloc_init_pte (PMD, addr, Next, __phys_to_pfn (Phys),
Prot, alloc);
}
Phys + = next-addr;
} while (pmd++, addr = Next, addr! = end);
}
(1) There are two scenarios that need to allocate the PMD page table memory, one is that the PUD entry is empty and we need to allocate subsequent PMD page table memory. The other one is the old PUD entry is the section descriptor, which maps the address block of 1G. But now for a variety of reasons, we need to modify it, so we need to remove this 1G block section mapping.
(2) Although the establishment of new mapping, but the original old 1G mapping also want to keep, perhaps this time we just want to update some of the address mapping it. In this case, we first convert a 1G block mapping into a mapping form via PMD (a PUD section mapping descriptor (1G) through the SPLIT_PUD function call. Size) becomes the section mapping descriptor (2M size) in 512 PMD. The form has changed, the flavor is unchanged, the increment does not increase, still is the address map of 1G block.
(3) Fix the PUD entry to point to the new PMD page table memory while flush the contents of the TLB.
(4) The logical start of the following code is similar to Alloc_init_pud. If you can't do 2M section mapping, then loop calls Alloc_init_pte address mapping, here we don't say, focus on 2M section mapping processing.
(5) If the 2M section requirements are met, then call SET_PMD to populate the PMD entry.
(6) If the old section mapping and points to a PTE table, you also need to free the memory that is used by these unwanted PTE page table descriptors.
Vii. Assigning Pte Page Table memory and populating the corresponding descriptor
static void Alloc_init_pte (pmd_t *pmd, unsigned long addr,
unsigned long end, unsigned long pfn,
pgprot_t Prot,
void * (*alloc) ( unsigned long size)
{
pte_t *pte;
if (Pmd_none (*PMD) | | pmd_sect (*PMD)) {----------------(1)
Pte = alloc (ptrs_per_pte * sizeof (pte_t));
if (Pmd_sect (*PMD))
SPLIT_PMD (PMD, Pte);----------------------(2)
__pmd_populate (PMD, __PA (PTE), pmd_type_table);--------(3)
Flush_tlb_all ();
}
Bug_on (Pmd_bad (*PMD));
Pte = Pte_offset_kernel (PMD, addr);
do {
Set_pte (Pte, Pfn_pte (PFN, prot));-------------------(4)
pfn++;
} while (pte++, addr + = page_size, addr! = end);
}
(1) Go to this function, indicating that the subsequent need to establish a level of PTE page Table descriptor, therefore, the need to allocate the PTEs page table memory, the scene has two, one is never mapped, the other is already established mapping, but is section mapping, does not meet the requirements.
(2) If you have section mapping before, then we need to split it into page descriptor in 512 PTEs.
(3) Let PMD entry point to the New PTE page table memory. It should be stated that if the previous PMD entry is empty, then there are 512 invalid descriptor in the new PTE page table, and if there was a section mapping, the new PTE page table was actually passed Split_ PMD is populated with 512 page Descritor.
(4) Page descriptor in the PTE of the address area of the loop setting (Addr,end).
Reference documents:
1. ARMV8 Technical Manual
2. Linux 4.4.6 Kernel source code
Linux memory Initialization (iv) creating system memory address mappings