So when will the program use the user stack and the kernel stack? Yes, system call. That is, when printf, open, read, and write are executed to execute C language library functions, they will eventually use corresponding system calls, such as sys_open and sys_read. Switch to the kernel stack.
1. Stack switching in Linux
We will discuss the 80x86. In fact, Linux only uses its stack segment in four places (the SS + ESP points to its stack base address ):
• System Boot initializes the stack used in the temporary real-time mode
• After entering the protection mode, a stack is provided for Kernel programs to start using. This stack is also the user-state stack used by later processes 0.
• The stack used by each process to execute the kernel program through system calls. It is called the kernel state stack of the process. Each process has its own independent kernel state stack.
• The stack where the process is executed in user mode, located near the end of the logical address space of the process
The following describes two stacks related to common processes.
Each process has two stacks for execution of user-state and kernel-state programs, which are called user-state stacks and kernel-state stacks.
In addition to being at different CPU privileges, the main difference between the two stacks is that the kernel-state stack of the task is very small, in the subsequent process management topics, we can see that the stored data cannot exceed 8096 bytes at most, while the user-state stack of the process can be at the bottom of the user's 4 GB space, and extend it up.
During User-mode running, that is, when you see the code, each process (except process 0 and process 1) has its own 4 GB address space. When a process is just created, its user-mode Stack pointer is set near the end of its address space, which is used by applications during user-mode running, the actual physical address memory is determined by the CPU paging mechanism.
During kernel-state running, each task has its own kernel-state stack, which is used for the task execution in the kernel code, that is, after the execution of the system call. The position in the linear address is specified by the ss0 and esp0 fields in the TSS segment of the process. Where are the two values from?
As mentioned in our "Memory Management" topic, Linux only uses the segment technology symbolic for the 80x86 system, that is, only the code segment and data segment. The SS registers in the CPU point to the stack segment, but Linux does not use a dedicated stack segment, but uses a part of the data segment as the stack segment. Therefore, when the CPL field in the data segment is 3, the SS register points to the user stack in the user data segment; if the CPL field in the data segment is 0, it points to the kernel stack in the kernel data segment. Note! This is important, especially when we talk about process switching in the future. If you do not know this knowledge, the content will drive you crazy.
In addition to the user data segment, user code segment, kernel data segment, and kernel code segment, Linux also uses several other specialized segments. We will discuss them here, in a single processor system, there is only one gdt, and in a multi-processor system, each CPU corresponds to one gdt. All gdt files are stored in the cpu_gdt_table array, and the addresses and sizes of all gdt files (used when the GDTR register is initialized) are stored in the cpu_gdt_descr array, these symbols are in the file ARCH/i386/kernel/head. S is defined.
Let's extend this knowledge. A new segment emerged after 286 of the 80x86 system, called the task status segment (TSS), which is mainly used to store the content of various registers in the processor. Linux has a TSS-related data structure for each processor. The linear address space of each TSS is a small subset of the corresponding linear address space of the kernel data segment. All Task status segments are sequentially stored in the init_tss array. It is worth noting that the base field of the TSS descriptor of the n-th CPU points to the n-th element of the init_tss array. The G (granularity) mark is cleared by 0, while the limit field is set to 0xeb because the TSS segment is 236 bytes long. The Type field is set to 9 or 11 (available 32-bit TSS), and the DPL is set to 0 because processes in the user mode are not allowed to access the TSS segment.
Well, back to the question we just mentioned, we have talked about the need to switch the user process to access the data structure and functions provided by the kernel, that is, switching from the user State to the kernel state, so where does the kernel stack address come from?
Therefore, when the process enters the kernel state from the user State, the process will be interrupted. Because the CPL of the kernel state has a high priority, stack switching is required. Then the System reads the tr register to access the TSS segment of the process (currently in the user State. Then, the kernel state stack segments ss0 and the stack pointer esp0 In the TSS are used to load SS and ESP registers. This enables switching between the user stack and the kernel stack. At the same time, the kernel uses a set of mov commands to save all registers to the kernel state stack, which also includes the SS and ESP registers in the user State.
At the end of interrupt or exception handling, the CPU control unit executes the iret command to re-read the register content in the stack to update each CPU register to re-execute the user-state process, at this time, it will be based on the stack.
It should also be emphasized that the kernel stack has only one address (for multi-CPU architecture, one for each CPU), and its SS and ESP are stored in the TSS structure, user-state process access is not allowed. The data structure in TSS format described in Linux is tss_struct:
Struct tss_struct {
Unsigned short back_link ,__ bLH;
Unsigned long esp0;
Unsigned short ss0 ,__ ss0h;
Unsigned long esp1;
Unsigned short SS1 ,__ ss1h;/* SS1 is used to cache msr_ia32_sysenter_cs */
Unsigned long esp2;
Unsigned short ss2 ,__ ss2h;
Unsigned long _ 33;
Unsigned long EIP;
Unsigned long eflags;
Unsigned long eax, ECx, EDX, EBX;
Unsigned long ESP;
Unsigned long EBP;
Unsigned long ESI;
Unsigned long EDI;
Unsigned short es, _ esh;
Unsigned short CS, _ CSH;
Unsigned short SS, _ SSH;
Unsigned short ds, _ dsh;
Unsigned short FS, _ FSH;
Unsigned short Gs, _ SH;
Unsigned short LDT, _ ldth;
Unsigned short trace, io_bitmap_base;
/*
* The extra 1 is there because the CPU will access
* Additional byte beyond the end of the IO permission
* Bitmap. The extra byte must be all 1 bits, and must
* Be within the limit.
*/
Unsigned long io_bitmap [io_bitmap_longs + 1];
/*
* Cache the current maximum and the last task that used the bitmap:
*/
Unsigned long io_bitmap_max;
Struct thread_struct * io_bitmap_owner;
/*
* Pads the TSS to be cacheline-aligned (size is 0x100)
*/
Unsigned long _ cacheline_filler [35];
/*
* .. And then another 0x100 bytes for emergency Kernel stack
*/
Unsigned long stack [64];
} _ Attribute _ (packed ));
This is all the content of the TSS segment, not much. During each switch, the Kernel updates certain TSS fields so that the desired CPU control unit can safely retrieve the information it needs, which is also a manifestation of Linux security. Therefore, TSS only reflects the feature level of the current process on the CPU, and there is no need to run the process to keep TSS.
The kernel before linux2.4 has the maximum number of processes. The reason is that each process has its own tss and LDT, while TSS (Task descriptor) and LDT (Private descriptor) it must be placed in gdt. gdt can store a maximum of 8192 descriptors, excluding the 12 descriptors used by the system. The maximum number of processes = (8192-12)/2, a total of 4090 processes.
Since linux2.4, all processes use the same TSS. Specifically, each CPU has one tss and processes on the same CPU use the same TSS. The definition of TSS in asm-i386/processer. H is defined as follows:
Extern struct tss_struct init_tss [nr_cpus];
Initialize and load TSS in start_kernel ()-> trap_init ()-> cpu_init:
Void _ init cpu_init (void)
{
Int Nr = smp_processor_id (); // obtain the current CPU
Struct tss_struct * t = & init_tss [Nr]; // the TSS used by the current CPU
T-> esp0 = Current-> thread. esp0; // update esp0 in TSS to esp0 of the current process
Set_tss_desc (NR, t );
Gdt_table [_ TSS (NR)]. B & = 0 xfffffdff;
Load_tr (NR); // load TSS
Load_ldt (& init_mm.context); // load LDT
}
We know that task switching (hard switching) requires TSS to store all registers (JMP was used before 2.4 for switching). When the interruption occurs, we also need to read esp0 of ring0 from TSS, what should I do if the process uses the same TSS?
In fact, after 2.4, hard switching is no longer used, but soft switching is used. Registers are no longer stored in TSS, but saved in task-> thread. Only the esp0 and I/O bitmap of TSS are used, therefore, during the process switchover, you only need to update esp0 and io_bitmap In the TSS. The code is in sched. c:
Schedule ()-> switch_to ()->__ switch_to (),
Void fastcall _ switch_to (struct task_struct * prev_p, struct task_struct * next_p)
{
Struct thread_struct * Prev = & prev_p-> thread,
* Next = & next_p-> thread;
Struct tss_struct * TSS = init_tss + smp_processor_id (); // the TSS of the current CPU
/*
* Reload esp0, LDT and the page table pointer:
*/
Ttss-> esp0 = Next-> esp0; // use esp0 of the next process to update Tss-> esp0
// Copy the io_bitmap of the next process to TSS-> io_bitmap
If (prev-> ioperm | next-> ioperm ){
If (next-> ioperm ){
/*
* 4 cachelines copy... not good, but not that
* Bad either. Anyone got something better?
* This only affects processes which use ioperm ().
* [Putting the tsss into 4k-tlb mapped regions
* And playing VM tricks to switch the IO bitmap
* Is not really acceptable.]
*/
Memcpy (TSS-> io_bitmap, next-> io_bitmap,
Io_bitmap_bytes );
Tss-> bitmap = io_bitmap_offset;
} Else
/*
* A Bitmap offset pointing outside of the TSS limit
* Causes a nicely controllable SIGSEGV if a process
* Tries to use a Port IO instruction. The first
* Sys_ioperm () call sets up the bitmap properly.
*/
Tss-> bitmap = invalid_io_bitmap_offset;
}
}
2. Summary of 80x86 segments
Let's look at it. Let's take a look at the 80x86 segment register knowledge. This knowledge is very important in Linux Kernel Analysis. We have mentioned gdt, here we will repeat the knowledge of the entire segment register, so that the comrades who did not understand it should have some clues.
Starting from the 80286 mode, Intel microprocessor performs address translation in two different ways: real mode and protected mode ). A logical address consists of two parts: a segment identifier (note that it is not the "segment base address" we learned in the class. I upgraded it !) And the offset of the relative address in a segment. The segment identifier is a 16-bit long field, called the segment selector, And the offset is a 32-bit long field.
The processor provides segment registers to quickly and conveniently locate the segment selection character. The unique purpose of segment registers is to store the segment selection character address (16 bits, please note that, the content of these segment registers is no longer the base address of the segment ). These segment registers are called CS, SS, DS, es, FS, and Gs. Although there are only six segment registers, the program can use the same segment register for different purposes by saving the value in the memory and restoring it after it is used up.
Three of the six registers have special purposes:
CS-code segment register, pointing to the segment containing program instructions.
SS-stack segment register, pointing to the segment containing the current program stack.
DS-data segment register, pointing to a segment that contains static or global data.
The other three segment registers can point to any data segment for general purposes.
The CS register also has an important feature: it contains two fields to indicate the current privilege level (CPL) of the CPU ). 0 indicates the highest priority, and 3 indicates the lowest priority. Linux only uses level 0 and Level 3, which are called kernel state and user State respectively.
Each segment is represented by an 8-byte segment descriptor (segment descriptor) (see the figure). It describes the features of the segment (note that it is not the segment address ). Segment descriptors are stored in the Global Descriptor Table (gdt) or Local Descriptor Table (LDT. Generally, only one gdt is defined, and each process can have its own LDT if it needs to create additional segments in addition to the segments stored in the gdt. Gdt addresses and sizes in the primary memory are stored in the GDTR Processor register, and the currently used LDT addresses and sizes are placed in the ldtr Processor register.
Its significance is as follows:
Base: The linear address of the first byte of the segment.
G: granularity Mark G: If the bit is 0, the segment size is measured in bytes. Otherwise, the segment size is measured in multiples of 4096 bytes.
Limit: the offset of the last memory unit in the storage segment to determine the length of the segment. If G is set to 0, the size of the segment changes from 1 byte to 1 MB. Otherwise, the size changes from 4 kb to 4 GB.
S: system identifier S. If it is cleared 0, it is a system segment that stores key data structures such as local descriptor tables. Otherwise, it is a common code segment or data segment.
Type: describes the type features of a segment and its access permissions (see the description below ).
DPL: descriptor privilege level field: used to restrict access to this segment. It indicates the Minimum CPU priority required to access this segment. Therefore, DPL 0 segments can only be accessed when CPL is 0 (that is, in the kernel state, the CIDR block where DPL is set to 3 is accessible to any CPL value.
P: segmentpresent flag: 0 indicates that the segment is not currently in the primary storage. Linux always sets this flag (47th bits) to 1 because it never switches the entire segment to the disk.
D or B: The identifier D or B depends on the code segment or data segment. The meaning of D or B is slightly different in two cases, but if the segment offset address is 32-bit long, it is basically set to 1, if the offset is 16-bit long, it is cleared by 0 (for more information, see the Intel User Manual ).
AVL: the AVL flag can be used by the operating system, but is ignored by Linux.
The logical address consists of a 16-bit segment selector and a 32-bit offset. The segment register only stores the segment selector. The CPU segment Unit performs the following operations (all are mechanical conversions, just take a look ):
• Check the Ti field of the segment selection operator to determine which Descriptor Table the segment descriptor is saved in. The Ti field specifies whether the descriptor is in gdt (in this case, the segment Unit obtains the linear base address of gdt from the GDTR register) or in the activated LDT (in this case, segment units obtain the linear base address of LDT from the ldtr register ).
• Calculate the segment descriptor address from the index field of the segment selection operator, and multiply the value of the index field by 8 (the size of a segment descriptor, in fact, it is to block the three CPL at the end and the Ti field at the privileged level.) This result is added to the content in the GDTR or ldtr register.
• The linear address is obtained by adding the offset of the logical address and the value of the base field of the segment descriptor.
Note that some registers of the CPU related to segment registers are called hidden cache, and some books are also called unprogrammable registers for caching segment descriptors. Therefore, the first two operations are required only when the sub-content selected in the segment register is changed.
3 Linux pointer
When a pointer pointing to a command or data structure is saved, the kernel does not need to set a segment selection character for the logical address, because the CS register contains the current segment selection character. For example, when the kernel calls a function, it executes a call assembly language instruction. This instruction only specifies the offset of its logical address, and the segment selector does not need to be set, it is hidden in the CS register. Because there is only one segment in kernel-state execution, which is called a code segment defined by macro _ kernel_cs, when the CPU switches to the kernel-state, it is sufficient to load _ kernel_cs into CS. The same principle applies to pointers pointing to the kernel data structure (implicitly using the DS register) and pointers pointing to the user data structure (the kernel uses the es register explicitly ).