This article briefly introduces how to switch from the original real mode to the protection mode and enable the paging mechanism after the Linux kernel in the X86-32 architecture is loaded to the memory by the Boot Loader (such as grub. This article does not cover how Boot Loader loads the kernel to the memory, because this is a matter of boot loader and has nothing to do with the kernel itself (of course, there must be a pre-agreed protocol between them ). Because the startup Code does not change frequently, the analysis of this part is basically applicable to the earlier version 2.6.24 to 3.0.4. For the sake of simplicity, we mainly focus on the general situation of not starting the PAE mechanism. Before reading this article, confirm that you have an understanding of the real mode, protection mode, and basic paging mechanism.
First, let's see where Boot Loader loads the kernel. It clearly explains the location of the kernel in the physical memory.
In general, the kernel is divided into two parts. The code under the arch/x86/boot directory is loaded into the first 1 m space of the physical memory, that is, the code between x and x + 0x8000, this part of the code is called setup code, which is a 16-bit real-mode code. The other part is the main part of the kernel. After loading the first 1 m space into the physical memory, the code is compressed before the kernel is fully started. Note that the offset address of the kernel code loaded into the first 1 m space is specified as starting from 0xc0100000 during the link process.
The kernel runs from the _ start label in the x86/boot/header. s file, and the CPU is in real mode. We can note that there is still a piece of code before the _ start label, which is located between the bootsect_start label and the _ start label. The total length is exactly 512 bytes, and the last two bytes are 0xaa55, this code is the classic startup sector code. However, the modern Linux kernel no longer supports running from the startup sector. If you study this code, you can find that it only prints the string at bugger_off_msg. The string prompts you: "boot from a floppy disk is no longer supported. Please Install Boot Loader ".
The first command at the _ start Mark is an in-segment jump command written directly using a binary machine code.
112. byte 0xeb # short (2-byte) Jump
113. byte start_of_setup-1f
114 1:
This command will jump to the start_of_setup label. We will see a large variable string between the two labels. These variables are used for Kernel loading and initialization, some of their values are built during kernel compilation. some are written by the C tool program when the boot loader loads the kernel. These are the Protocols mentioned above for communication between the boot loader and the kernel. We ignore the initialization process of these values, and by default, these values are available after the kernel is successfully loaded.
Starting from the start_of_setup label, the Kernel performs some simple initialization, such as dividing the stack and heap space. Let's look at the lines of code at the end.
292 # Zero the BSS
293 movw $ __bss_start, % di
294 movw $ _ end + 3, % CX
295 xorl % eax, % eax
296 subw % Di, % CX
297 shrw $2, % CX
298 rep; stosl
299
300 # Jump to C code (shocould not return)
301 calll main
The 292-298 line of code is used to initialize the. BBS segment to 0. We know that in the concept of general application development, uninitialized global variables in the source program are placed. BSS segment, the data in this segment will be loaded in the Program (loader, note, this is not the boot loader mentioned above, it is used to load common applications to the loaders running in the operating system. The initialization value is 0 when loading the memory. However, we are studying the kernel code of the operating system, at this time, the operating system is not started, not to mention the loader, so we need to initialize it here. BSS data. The two _ bss_start and _ end labels are not defined in the Code, but in the x86/boot/setup. LD connection script.
The 300-line call command redirects the kernel to the void main (void) function in x86/boot/Main. C. Yes, we entered the C code, but here we still run the C code in real mode. The main function also calls some other initialization functions. We focus on the final function go_to_protected_mode () called by the main function. As the name suggests, this function enables the kernel to switch to the protection mode. This function is located in the x86/boot/PM. c file. After calling enable_a20 () to open the A20 address line, it calls the following three functions in sequence:
122 setup_idt ();
123 setup_gdt ();
124 protected_mode_jump (boot_params.hdr.code32_start,
125 (u32) & boot_params + (DS () <4 ));
These three functions construct IDT tables and gdt tables in sequence, and switch to the protection mode. The following is the setup_gdt () function:
66 static void setup_gdt (void)
67 {
68/* There are machines which are known to not boot with the gdt
69 being 8-byte unaligned. Intel recommends 16 byte alignment .*/
70 static const u64 boot_gdt [] _ attribute _ (aligned (16) = {
71/* cs: code, read/execute, 4 GB, base 0 */
72 [gdt_entry_boot_cs] = gdt_entry (0xc09b, 0, 0 xfffff ),
73/* DS: data, read/write, 4 GB, base 0 */
74 [gdt_entry_boot_ds] = gdt_entry (0xc093, 0, 0 xfffff ),
75/* TSS: 32-bit TSS, 104 bytes, base 4096 */
76/* we only have a TSS here to keep intel VT happy;
77 we don't actually use it for anything .*/
78 [gdt_entry_boot_tss] = gdt_entry (0x0089,409 6, 103 ),
79 };
80/* xen hvm incorrectly stores a pointer to the gdt_ptr, instead
81 of the gdt_ptr contents. Thus, make it static so it will
82 stay in memory, at least long enough that we switch to
83 proper kernel gdt .*/
84 static struct gdt_ptr gdt;
85
86 gdt. Len = sizeof (boot_gdt)-1;
87 gdt. PTR = (u32) & boot_gdt + (DS () <4 );
88
89 ASM volatile ("lgdtl % 0": "M" (gdt ));
90}
This code is very simple. It initializes a boot_gdt array, that is, the legendary gdt table, and then initializes a gdt variable, which specifies the physical address and length of the gdt table, finally, the gdt table address and length are loaded into the GDTR register through the Assembly command lgdtl. Obviously, the kernel defines only three segments, one segment, one data segment, and one TSS segment. Both the code segment and data segment are defined as 4G linear address space starting from 0x00000000.
Note that the "extra" DS () <4 is added when the physical address of the gdt table is specified in row 87, because so far, the kernel is still running in the real mode, and the CPU is still addressed through the "segment base address * 16 + offset address". & boot_gdt returns the offset of the boot_gdt array header in the current data segment, after "DS () <4" is added, the linear address of the boot_gdt array header is obtained. Furthermore, because the paging mechanism is not enabled, the linear address is equal to the physical address.
After the gdt table Initialization is complete, the kernel calls the protected_mode_jump () function to switch to the protection mode. This function is located in x86/boot/pmjump. s, good, this is another piece of assembly code, and later, the kernel will directly jump back and forth between the assembly code and the C code. The Code is as follows:
26 Global (protected_mode_jump)
27 movl % edX, % ESI # pointer to boot_params table
28
29 xorl % EBX, % EBX
30 movw % CS, % BX
31 shll $4, % EBX
32 addl % EBX, 2f
33 JMP 1f # Short jump to serialize on 386/486
34 1:
35
36 movw $ __boot_ds, % CX
37 movw $ __boot_tss, % di
38
39 movl % Cr0, % edX
40 orb $ x86_cr0_pe, % DL # Protected Mode
41 movl % edX, % Cr0
42
43 # transition to 32-Bit mode
44. byte 0x66, 0xea # ljmpl opcode
45 2:. Long in_pm32 # offset
46. Word _ boot_cs # Segment
47 endproc (protected_mode_jump)
Before analyzing the code, we will first introduce the parameter transfer mechanism when calling C language functions here. We know that common function call specifications (STD, cdecl) Use stack to pass parameters, that is, the parameter is pushed to the stack from right to right. Here, the code under ARCH/boot in the kernel adopts another specification (fastcall): parameters of three or less use eax, EDX, from left to right, ECX registers transmit parameters. More than three parameters are passed through the stack.
In this way, combined with the code in the go_to_protected_mode () function, we know that in the protected_mode_jump () function, the value in the % eax register is boot_params.hdr.code32_start, and the value in % edX is () <4. Here, the boot_params.hdr.code32_start parameter is important. This parameter is the starting address of the compressed part of the kernel code mentioned above. Readers may guess that the boot_params.hdr.code32_start is in x86/boot/header. the code32_start variable in S. in the boot_params.hdr copy of the variable starting from the HDR label in S, but this value is no longer 0, because the kernel is
After loader is loaded into the memory, boot_loader changes the value to the memory location where the compressed kernel is located.
After entering the protected_mode_jump () function, the 29-32 line of code has done a very strange thing. It shifts the value of % CS register four places left and adds it to the variable located in label 2, the in_pm32 label offset saved before this variable. In this way, after % CS <4 is added, the variable here becomes the linear address of in_pm32. The role of this Code is shown later. Let's look at the code in line 39-41. This Code sets the PE flag in the % Cr0 register to enable the protection mode flag.
Then, the climax came. Lines 4-46 followed by the previous boot/header. the jump command at the _ start mark in S is also a jump command written in binary notation. However, here is the inter-segment jump command, the variable in row 45 specifies the offset of the destination address. Due to the Code in line 29-32, this offset is equal to the linear address of the in_pm32 label in real mode. In fact, we know that linear addresses can be seen as offset addresses with a base address of 0, bingo !, According to the previously initialized gdt table, the _ boot_cs selector of the 46 rows is a coincidence that the base address of the specified destination address is exactly 0, therefore, the inter-segment jump command happened to jump to the in_pm32 label! Therefore, from now on, the kernel runs in protection mode and jumps to in_pm32.
The last command in in_pm32 is:
76 jmpl * % eax # Jump to the 32-bit entrypoint
This command redirects the kernel to the Code specified by % eax. The code is located at the startup_32 mark in x86/boot/compressed/head_32.s. The main function of this code segment is to decompress the compressed kernel in the memory and jump to the startup_32 label in x86/kernel/head_32.s. Yes, there are two startup_32 labels, but these two labels do not conflict, because the two codes are not linked at the same time. In this Code, the kernel creates a temporary kernel page table (provisional kernel page tables) and enables the paging mechanism.
202page_pde_offset = (_ page_offset> 20 );
203
204 movl $ PA (_ brk_base), % EDI
205 movl $ PA (initial_page_table), % edX
206 movl $ pte_ident_attr, % eax
207 10:
208 Leal pde_ident_attr (% EDI), % ECx/* Create PVDF entry */
209 movl % ECx, (% EDX)/* store identity PVDF entry */
210 movl % ECx, page_pde_offset (% EDX)/* store kernel PVDF entry */
211 addl $4, % edX
212 movl $1024, % ECx
213 11:
214 stosl
215 addl $0x1000, % eax
216 loop 11b
217 /*
218 * end condition: we must map up to the end + mapping_beyond_end.
219 */
220 movl $ PA (_ end) + mapping_beyond_end + pte_ident_attr, % EBP
221 CMPL % EBP, % eax
222 JB 10b
223 addl $ __page_offset, % EDI
224 movl % EDI, PA (_ brk_end)
225 shrl $12, % eax
226 movl % eax, PA (max_pfn_mapped)
The page table constructed by this Code maps the linear space starting from the linear address 0x00000000 and _ page_offset, and maps the two linear spaces to the same physical space. First, we will introduce the definition of key macros and labels in the code.
1) macro _ page_offset is defined as 0xc0000000 by default. The linear address space starting from _ page_offset is the kernel space.
2) The macro page_pde_offset in the row 202 is defined as the page directory entry corresponding to the virtual address _ page_offset, which may be clearer: page_pde_offset = (_ page_offset)> (10 + 12) <2 ).
3) The _ brk_base label is the starting address of the page table, which is similar to the _ bss_start label mentioned above. This label is defined by the kernel/vmlinux. LDS link script.
4) The inital_page_table label is defined behind the kernel/head_32.s file and is reserved for 4 K space. It is used as the page Directory table of the entire temporary page table ).
5) The macros pte_ident_attr and pde_ident_attr are the page attributes in the page table items and page Directory items respectively.
6) macro PA () is the famous function for converting linear addresses from kernel space to physical addresses. It is equivalent to the following definition: # define PA (x) (X) -_ page_offset ). Of course there is another inverse conversion function.
This macro enables the kernel code to correctly access the memory before enabling paging. As mentioned above, this part of the kernel is loaded into the first 1 m space of the physical memory, but its offset address is specified as starting from 0xc0100000 during the link process. Therefore, before enabling paging, the kernel code needs to subtract 0xc00000000 from each address to access the memory. In fact, we can regard the PA () function as a simple paging mechanism.
7) mapping_beyond_end + PA (_ end) is the end address of the physical space to be mapped to the constructed page table.
The following is a detailed analysis of the Code.
In row 204-205, initialize % EDI as the physical address of the first page table, and initialize % edX as the physical address of the first page Directory item in the page Directory table.
Row 3: Construct a page Directory item in % ECx. The page table address field is the value of % EDI, and the attribute field is pde_ident_attr.
Line 3: Fill the page Directory items in % ECx into the page Directory items pointed to by % EDX. Obviously, this is a physical memory ing from the linear address 0x00000000.
Line 3: Fill the page Directory items in the same % ECx to the page Directory items pointed to by (page_pde_offset + % EDX). Obviously, this is the physical memory ing starting from the linear address _ page_offset, the two linear spaces are mapped to the same physical space.
Line 3: Add % edX to 4 to point it to the directory item on the next page.
Row 212-216: This cyclically fills the page table that % EDI points. Row 206, % eax, is initialized as the page table item attribute value pte_ident_attr. Row 3 specifies that this loop is iterated for 212 times. In row 215, a page table item is filled in each iteration, and the physical page address field of the page table item in % eax is added with 0x1000, obviously, this cycle maps physical pages from low addresses to high addresses in sequence. After this cycle ends, % EDI is the physical address of the next page table.
In rows-2, if the last page table item to be filled is smaller than $ PA (_ end) + mapping_beyond_end + PTE _ ident_attr, a page table is created at the 10th vertex. Otherwise, the ing ends.
In the 223-else row, set the variable in the _ brk_end label to the end linear address of the entire page table.
In rows-, set the variables in the max_pfn_mapped label to the total number of physical pages mapped to this page table.
By now, the temporary page table is constructed, and a final page table (final kernel page tables) is created after the kernel. This table is located in kernel_physical_mapping_init () of X86/MM/init_32.c () function. The main code is as follows. Some code that we do not care about is deleted.
279 repeat:
280 pages_2m = pages_4k = 0;
281 PFN = start_pfn;
282 pgd_idx = pgd_index (PFN <page_shift) + page_offset );
283 pgd_base + pgd_idx;
284 for (; pgd_idx <ptrs_per_pgd; PGD ++, pgd_idx ++ ){
285 PMD = one_md_table_init (PGD );
286
287 if (PFN> = end_pfn)
288 continue;
293 pmd_idx = 0;
295 for (; pmd_idx <ptrs_per_pmd & PFN <end_pfn;
296 PMD ++, pmd_idx ++ ){
297 unsigned int ADDR = PFN * page_size + page_offset;
298
330 PTE = one_page_table_init (PMD );
331
332 pte_ofs = pte_index (PFN <page_shift) + page_offset );
333 PTE + = pte_ofs;
334 for (; pte_ofs <ptrs_per_pte & PFN <end_pfn;
335 PTE ++, PFN ++, pte_ofs ++, ADDR ++ = page_size ){
336 pgprot_t prot = page_kernel;
337 /*
338 * First pass will use the same initial
339 * identity mapping attribute.
340 */
341 pgprot_t init_prot = _ pgprot (pte_ident_attr );
342
343 if (is_kernel_text (ADDR ))
344 prot = page_kernel_exec;
345
346 pages_4 K ++;
347 if (mapping_iter = 1 ){
348 set_pte (PTE, pfn_pte (PFN, init_prot ));
349 last_map_addr = (PFN <page_shift) + page_size;
350} else
351 set_pte (PTE, pfn_pte (PFN, Prot ));
352}
353}
354}
355 if (mapping_iter = 1 ){
363 /*
364 * local Global flush TLB, which will flush the previous
365 * mappings present in both small and large page TLB's.
366 */
367 _ flush_tlb_all ();
368
369 /*
370 * second iteration will set the actual desired PTE attributes.
371 */
372 mapping_iter = 2;
373 goto repeat;
374}
Linux abstracts the paging mechanism into a four-layer model, PGD (page Global Directory), pud (page upper directory), PMD (page middle directory), and Pt (page table ). The 32-bit architecture kernel only requires two layers of models, PGD and PT, without the PAE enabled. In this case, pud and PMD are folded, that is, PGD, pud, and PMD are the same. Before explaining the code in detail, we will introduce several key macros and functions.
1) macro pgd_index (vaddr) returns the address of the PGD directory item corresponding to the linear address vaddr.
2) The one_md_table_init (PGD) function is created and returns the next-level PMD table address corresponding to the PGD directory. In the 32-bit architecture kernel, this function simply returns PGD, that is, the PMD table is folded into the PGD table.
3) The one_page_table_init (PMD) function creates a table and returns the address of the next-level pt table corresponding to the PMD directory.
4) The set_pmd (PMD, pmde) function fills in the PMD directory item as pmde.
5) The set_pte (PTE, ptee) function fills the PTE table item with ptee.
6) The macros ptrs_per_pgd, ptrs_per_pmd, and ptrs_per_pte correspond to the items in PGD, PMD, and PT tables respectively. In a 32-bit kernel architecture, these three macros are defined as 1024, 1, and respectively.
This code maps the physical space starting from the physical page pfn to the linear space starting from the linear address _ page_offset. The page table construction process goes through two iterations. The two iterations have different attribute fields except the filled page table items, and the others are the same.
Line 28-283 will initialize the PGD directory item pointer corresponding to the physical page PFN.
The row 285 is a newly created and returns the PMD table dependent on the PGD directory item. However, in a 32-bit architecture, this function simply returns PGD, that is, PMD equals to PGD.
The loop of the 295 rows is iterated only once in the 32-bit architecture.
Create row 330 and return the PT table PTE corresponding to the PMD directory item.
The row 332-333 is set to the PT table item corresponding to the physical page PFN.
The cycle starting from the first row is filled with the PT table items after the PTE table.