Linux0.11 kernel series-2. System Call mechanism analysis, linux0.11 system call
[All Rights Reserved. For more information, see the source. Source: http://www.cnblogs.com/joey-hua/p/5570691.html]
The Linux kernel has read some source code files from startup to initialization. This time, we can see system_call.s in the kernel folder. This file is mainly used for system calling. But when it comes to system calls, it's not just that this file is so simple that it involves too much content. Here we will take a note to record the complete mechanism from the establishment of interruptions to the final invocation of system calls.
Assume that the write function is interpreted as a system call.
The essence of system calling is that the user process needs to access the kernel-level code, but the user process has the lowest permissions. The kernel code has the highest permissions and direct access is not allowed, you must use the interrupt gate as the media to redirect permissions. In short, the user process calls an interrupt, which then accesses the kernel code. Here we will learn how to implement the Linux kernel.
1. Create the Interrupt Descriptor Table IDT
Because the interrupt is used, the Interrupt Descriptor Table IDT must be created first, for example:
In the head. s file, you have created an IDT. For example, if you want to use int 0x80, run the code from the point where _ idt starts to locate the offset 0x80.
. Align 3 # align the memory address boundary in 8 bytes. _ Idt:. fill 256, 0 # idt is uninitialized #, 8 bytes for each item, fill in 0. Idt_descr: # The following two lines are the six-byte operands of the lidt command: length, base address .. Word 256*8-1 # idt contains 256 entries. long _ idtlidt idt_descr # load the value of the Interrupt Descriptor Table register.
2. Create a 0x80 interrupt
All system calls are implemented through the 0x80 interrupt, so the next step is to establish the 0x80 interrupt, in sched. c:
// Set the system call interruption gate. Set_system_gate (0x80, & system_call );
Here, we use the macro set_system_gate to associate the 0x80 interrupt with the system_call function. Here we will first look at set_system_gate regardless of system_call. In system. h:
//// Set the system call function. // Parameter: n-interrupt number; addr-interrupt program offset address. // & Idt [n] corresponds to the offset value of the interrupt number in the Interrupt Descriptor Table. The Interrupt Descriptor type is 15 and the privileged level is 3. # Define set_system_gate (n, addr) _ set_gate (& idt [n], 15,3, addr) /// set the macro function of the gate descriptor. // Parameter: gate_addr-Descriptor address; type-Descriptor field value; dpl-Descriptor privileged layer value; addr-offset address. // % 0-(A type flag consisting of dpl and type); % 1-(4-Byte Low descriptor address ); // % 2-(4-byte high descriptor address); % 3-edx (Program offset address addr); % 4-eax (high contains segment selector ). # Define _ set_gate (gate_addr, type, dpl, addr) \__ asm _ ("movw % dx, % ax \ n \ t "\ // combine the low offset address character with the selector to form a four-byte (eax) Low descriptor ). "Movw % 0, % dx \ n \ t" // combine the type flag and the offset height into a four-byte (edx) Higher descriptor ). "Movl % eax, % 1 \ n \ t" \\// set the low 4-byte and high 4-byte of the gate descriptor respectively. "Movl % edx, % 2": "I" (short) (0x8000 + (dpl <13) + (type <8 ))), "o" (* (char *) (gate_addr), "o" (* (4 + (char *) (gate_addr ))), "d" (char *) (addr), "a" (0x00080000 ))
ReferBroken Door StructureWe can see that if the privilege level is set to 3 and the user process is also 3, you can directly access this interrupt. The offset address corresponds to the above system_call, that is, if the call interrupts int 0x80, then the system_call function is accessed. Note that n is 0x80, that is, the idt array [0x80], and the idt is in the head. declare in h. After compilation, it will become the symbol _ idt, in head. s.
3. Declare the system call Function
Take the write system function as an example to declare this function in write. c:
_syscall3 (int, write, int, fd, const char *, buf, off_t, count)
_ Syscall3 is a macro definition, in unistd. h:
// Macro functions with three parameters are called. Type name (atype a, btype B, ctype c) // % 0-eax (_ res), % 1-eax (_ NR_name ), % 2-ebx (a), % 3-ecx (B), % 4-edx (c ). # Define _ syscall3 (type, name, atype, a, btype, B, ctype, c) \ type name (atype a, btype B, ctype c) \ {\ long _ res; \__ asm _ volatile ("int $0x80" \: "= a" (_ res )\: "" (_ NR _ # name), "B" (long) (a), "c" (long) (B )), "d" (long) (c); \ if (_ res> = 0) \ return (type) _ res; \ errno =-_ res; \ return-1 ;\}
Therefore, the translation can be written as follows in write. c:
int write(int fd,const char* buf,off_t count) \{ \long __res; \__asm__ volatile ( "int $0x80" \: "=a" (__res) \: "" (__NR_write), "b" ((long)(fd)), "c" ((long)(buf)), "d" ((long)(count))); \if (__res>=0) \return (type) __res; \errno=-__res; \return -1; \}
Is it clear at once, that is to say, if a user process needs to use the write function, it will call int 0x80 for interruption, then, the three parameters fd, buf, and count are respectively stored in the ebx, ecx, and edx registers. The most important thing is _ NR_write, which stores the value in the eax register, the specific purpose will be discussed later. This is in the unistd. h:
#define __NR_write 4
Well, now all the initialization and sound information have been completed, and everything is ready for use!
4. System Call Process
When a user process calls the write function, it will call int 0x80 for interruption. As mentioned above, if the int 0x80 for interruption is called, it will access the system_call function, sched. c:
Extern int system_call (void); // The system calls the interrupt handler (kernel/system_call.s, 80 ).
It is defined in system_call. Note that _ is added to the header after compilation. The following code only captures the First Half of the Code:
_ System_call: cmpl $ nr_system_calls-1, % eax # If the call number is out of range, set-1 in eax and exit. Ja bad_sys_callpush % ds # Save the original register value. In push % espush % fspushl % edx # ebx, ecx, and edx, the system calls the corresponding C-language function call parameters. Pushl % ecx # push % ebx, % ecx, % edx as parameterspushl % ebx # to the system callmovl $0x10, % edx # set up ds, es to kernel spacemov % dx, % ds # ds, es points to the kernel data segment (the data segment descriptor in the Global Descriptor Table ). Mov % dx, % esmovl $0x17, % edx # fs points to local data spacemov % dx, % fs # fs points to a local data segment (data segment descriptor in the local Descriptor Table ). # The following operand indicates the call address = _ sys_call_table + % eax * 4. See the description after the list. # Sys_call_table in the corresponding C program is defined in include/linux/sys. h, which includes 72 # address array tables for the system to call the C processing function. Call _ sys_call_table (, % eax, 4) pushl % eax # import the system call number to the stack. (This explanation is incorrect. It is the function returned value to the stack.) movl _ current, % eax # obtain the data structure address of the current task (process ?? Eax.
Note that the three statements starting with pushl % edx are the three parameters mentioned in the preceding section 3rd, which are pushed from right to left. The focus is on the call _ sys_call_table (, % eax, 4) code. The translation is call [eax * 4 + _ sys_call_table]. According to the 3rd point, eax stores the value of _ NR_write, that is, 4, because _ sys_call_table is sys. an array of the int (*) () type in h, which stores all the system call function addresses. Therefore, the translation is to access sys_call_table [4], that is, sys_write function:
// The system calls the function pointer table. Used to call the interrupt handler (int 0x80) as a jump table. Fn_ptr sys_call_table [] = {sys_setup, sys_exit, sys_fork, sys_read, sys_write ,...}
Read_write.c:
intsys_write (unsigned int fd, char *buf, int count){ struct file *file; struct m_inode *inode;...}
Okay, so far, we can see that the sys_write function is called in the end. Now the analysis is complete!