Linux system call

Source: Internet
Author: User
Tags failover
Chapter 5

5.5 Linux system call

5.5.1 System Call Interface

System calls (commonly referred to as syscils) are the only interface for interaction between linux kernels and upper-layer applications. See Figure 5-4. According to the description of the interrupt mechanism, the interrupt int 0x80 is called by the interrupt program directly or indirectly (through the library function), and the system call function number is specified in the eax register, you can use the internal core resources, including system hardware resources. However, generally, applications use functions in the C function library defined by the standard batch interface to indirectly use internal system calls, as shown in Figure 5-19.

Normally, a call is called in the form of a function, so one or more parameters can be included. The result of System Call execution is displayed in the return value. A negative value indicates an error, while 0 indicates success. In case of an error, the error type code is stored in the global variable errno. By calling the program library function perror (), we can print the error string information corresponding to the error code.

In Linux, each system call has a unique system call function number. These feature numbers are defined at the beginning of line 62nd in include/unistd. h. For example, the function number of the Write System Call is 4, which is defined as the symbol-nr_write. These System Call function numbers correspond to the index values of the System Call handler metric array list sys_call_table [] defined in include/Linux/sys. h. Therefore, the write () System Call handler metrics are located at Item 4 of the array.

To use these System Call characters in our program, we need to define the symbol "_ Library _" before including "" as shown below __".

# DEFINE _ Library __
# Include

In addition, we can see from sys_call_table [] that the names of all system call handler functions in the inner core basically start with the symbol 'sys. For example, the implementation function of the system call read () in the internal core original code is sys_read ().

5.5.2 system call handling process

When an application sends an interrupt call int 0x80 to the internal core through the library function, it starts to execute a system call. The register eax contains the system call number, and the parameters of the Register can be stored in the registers EBX, ECx, and EDX in sequence. Therefore, the kernel program in Linux 0.12 can directly pass up to three parameters to the kernel without parameters. The process for handling system call interruption int 0x80 is system_call in the program kernel/system_call.s.

To facilitate the execution of system calls, the original kernel code is in include/unistd. the huge set function _ syscalln () is defined in the H file (lines-). N indicates the number of parameters in the sequence, which can be divided into 0 to 3. Therefore, you can directly pass up to three parameters. If you need to upload a large piece of data to the inner core, you can transmit the index value of the data. For example, for a read () System Call, its definition is:

Int read (int fd, char * Buf, int N );

If we directly execute the corresponding system call in the program, then the form of the system call is as follows:

# DEFINE _ Library __
# Include
_ Syscall3 (INT, read, Int, FD, char *, Buf, Int, n)

Therefore, we can use the above _ syscall3 () in the program to execute a system call read () without using the C function library as a mediation. In fact, the call form of the function Final Call System in the C function library is exactly the same as that given here.

For each system call set in include/unistd. h, there are 2 + 2 * n parameters. Among them, 1st parameters correspond to the type of the System Call return value; 2nd parameters are the name of the system call; followed by the type name of the parameters included in the System Call. This huge set will be extended into C functions that contain nested composite statements, as shown below.

Int read (int fd, char * Buf, int N)
Long _ res;
_ ASM _ volatile (
"Int $0x80"
: "= A" (_ res)
: "(_ Nr_read)," B "(long) (FD)," c "(1ong) (BUF)," D "(1ong) (n )));
If (_ res> = 0)
Return int _ res;
Errno =-_ res;

It can be seen that the expansion of this huge set is a specific implementation of reading the call from the job system. The embedded composite statement is used to execute the Linux system interrupt call 0x80 with the function number _ nr_read (3. This interrupt call returns the number of actually read bits in the eax (_ res) Register. If the returned value is less than 0, it indicates that the read operation has an error. Therefore, the error number is reversed and stored in the global variable errno, and the-1 value is returned to the caller.

If a system call requires more than three parameters, the internal core usually uses the callback method to directly use these parameters as a parameter buffer block, and pass the indicator of the buffer block as a parameter to the core. For system calls with more than three parameters, we only need to use a giant set _ syscalll () with one parameter to pass the indicator of the first parameter to the inner core. For example, a select () function system call has five parameters, but we only need to pass the indicator whose nth parameter is familiar. For more information, see description of the FS/select. C program.

After entering the system call handler kernel/sys_call.s in the core, the system_call code will first check whether the system call function number in eax is within the valid system call number range, then call the system call handler according to sys_call_table [] function indicator table.

Call_sys_call_table (, % eax, 4) // kernel/sys_call.s 99th rows.

The meaning of this combined statement is the indirect call function at _ sys_call_table + % eax. Because each item of sys_call_table [] is a 4-digit group, the system call function number must be multiplied by 4. Then, obtain the address of the called processing function from the table with the obtained value.

5.5.3linux system call parameter transfer method

The Linux system uses general register transfer methods, such as registers EBX, ECx, and EDX, to transmit parameters in the interrupt call process from a kernel trip to the system. One obvious advantage of using the register pass-through parameter method is that when you enter the system to interrupt the service program and save the register value, the registers for passing parameters are automatically placed on the inner core State stack. Therefore, the registers for passing parameters are not specially processed. This method is the simplest and fastest parameter transfer method that Linus knew at the time. In addition, there is a parameter delivery method using the system call gate provided by Intel CPU, which automatically copies and transmits parameters during the trip using the dynamic stack and inner core stack. However
The steps to use the method are complex.

In addition, the parameters that should be passed in each system call processing function are verified to ensure that all parameters are valid and valid. In particular, metrics provided by the scheduler should be strictly reviewed. To ensure that the range of memory areas referred to by indicators is valid and has the corresponding read/write permission.

5.6 system time and timing

5.6.1 system time

In order for the operating system to automatically provide information on the current time and date, the PC/at microcomputer system provides support for the Real Time circuit powered by battery. Generally, this part of the circuit is integrated with a small amount of cmos ram that stores system information on a single chip. Therefore, this part of the circuit is called the RT/cmos ram circuit. The Motorola MC146818 chip is used in PC/AT microcomputer or its compatible machine.

When Initialization is available, Linux 0.12 core uses init/main. the time_init () function in the C program reads the current time and date information stored in the chip and uses the kernel/mktime. in the C program, the kernel mktime () function is converted to the time starting from midnight, January 1, January 1, 1970 to the current time in seconds. This is called the Unix calendar time. This time determines the calendar time when the system starts execution and is saved in the global variable startup_time for all kernel code. The timer program can use the system call stime () to read the startup_time value, while the Super User call stime () to modify the system time value.

In addition, the program can uniquely determine the current time value of the execution time through the system ticking value jiffies described below. Because the scheduled value of each tick answer is 10 ms, the inner core code defines a huge set to facilitate code access to the current time. This huge set is defined on the first line of the include/Linux/sched. h file:

# Define current_time (startup_time + jiffiles/Hz)

Hz = 100 indicates the clock frequency of the inner core. The current time collection current_time is defined as the system boot time startup_time plus the time when the system is started jiffies/100. This set is used to modify the access time of a file or the time when its I node is modified.

5.6.2 system timing

During the initialization of Linux 0.12 core, the counter channel 0 of Intel 8253 (8254) on the PC can be set to run in Mode 3 (in the square wave generator mode ), and the initial count value latch is set to issue a square wave rising edge at the output end of channel 0 every 10 milliseconds. Because the clock input frequency of the 8254 chip is 1.193180 MHz, the initial counter value is latch = 1193180/100, about 11931. As an out-of-Memory connection is connected to a level 0 programmable control chip, the system sends a clock interruption request (irq0) every 10 milliseconds. This time beat is the heartbeat of the operating system. We call it one system tick or one system clock crash. Therefore, the system calls the clock interrupt processing program (timer_interrupt) every time one tick answer is received ).

The clock interrupt processing program timer_interrupt is mainly used to accumulate the number of time drops that have elapsed since the system was paused through the jiffies variable. The value of jiflies increases by 1 every time a clock interruption occurs. Then call the C function do_timer () for further processing. During a call, the CPL parameter is used to obtain the current Code privileged CPL from the segment selector of the interrupted program (the CS segment register value stored in the stack.

The do_timer () function accumulates the execution time of the Current itinerary based on the privilege level. If CPL = 0, it indicates that the itinerary is interrupted when it is executed in the inner core State. Therefore, the inner core state of the itinerary will increase the execution time statistical value of stime by 1, otherwise, increase the travel time statistical value by 1 in the running state. If floppy. C adds a timer during the operation, the timer linked list is processed. If a timer reaches 0, the processing function of the timer is called. Then, process the execution time of the Current itinerary, and set the execution time to 1. A time slice is the CPU time that can be continuously executed before the trip is switched OFF. Its unit is the number of tick answers defined above. If the value of the travel time slice is greater than 0 after being handed over, it indicates that the time slice has not been completed, so the do_timer () will exit to continue the current itinerary. If the route time slice has been handed over to 0 at this time, it indicates that the CPU-used time slice has been used up for this trip, the program will determine the further processing method based on the level of the interrupted program. If the current interrupted itinerary is in the user State (the privileged level is greater than 0), do_timer () will call the scheduler schedule () switch to the full route for execution. Do_timer () will immediately exit if the current interrupted schedule is in the inner kernel state, that is, when it is executed in the inner kernel program. Therefore, this processing method determines that the Linux system's itinerary will not be switched by the scheduler during the kernel-state execution. That is to say, the itinerary cannot be preemptible when it is executed in the internal kernel program, but it can be preemptive when it is executed in the user program ).

Starting from the linux2.4 kernel, Robert love has developed a preemptible kernel upgrade kit. This allows low-priority scheduling in the kernel space to be preemptible by high-priority scheduling, which increases the system response performance by up to 200%. See the book "Linux kernel development" compiled by Robert love.

Note that the timer above is used exclusively for the timing operation of the disc motor on and off. This type of timer is similar to the dynamic timer (Dynamic timer) in modern Linux systems and is only used by the core. This type of timer can be dynamically set up whenever necessary, and is dynamically revoked when the timer expires. In Linux 0.12, there can be a maximum of 64 timers. The timer processing code is in line 283--368 of the sched. C program.

5.7 Linux travel control

A program is an executable file, and a process is an instance of the program being executed. Using time-sharing technology, you can execute multiple itineraries simultaneously on the Linux operating system. The basic principle of the time-sharing technology is to divide the CPU execution time into time slice of a specified length, so that each trip can be executed in one time slice. When the travel time slice is used up, the system uses the scheduling program to switch to another itinerary for execution. Therefore, for machines with a single CPU, only one trip can be executed at a certain time. However, since the time slice for each trip is very short (for example, 15 system ticking = 150 milliseconds), it seems that all the trips are executed simultaneously.

For Linux 0.12 cores, the system can have a maximum of 64 itineraries at the same time, except that the First Schedule is created manually, the rest are new itineraries created by using the system call fork. The established itinerary is called child process, and the created itinerary is called parent process ). The inner program uses the process ID (PID) to identify each itinerary. The itinerary consists of executable instruction codes, materials, and stack areas. The code and data section in the itinerary correspond to the code segment and data segment in the execution file. Each trip can only execute its own code and access its own data and stack area. Inter-trip communication needs to be performed through system calls. For a system with only one CPU, only one route is in progress at a certain time. The core schedules the execution of each itinerary through the time-based scheduling program.

We already know that a route in a Linux system can be executed in the kernel mode or user mode, in addition, they use their independent inner-core stack and inner-state stack. The callback stack is used to temporarily store call function parameters, Region variables, and other information during the trip in the callback mode. The core stack contains information about function call execution by the core program.

In addition, in linux kernels, a trip is usually called a task, and a program that runs in the memory space is called a trip. This article will try to abide by this preset rule while mixing these two terms.

5.7.1 task Data Structure

The inner program manages the itinerary through the itinerary. Each itinerary is included in the itinerary. In Linux, a row table item is a task_struct Task Structure indicator. The task data structure is defined in the header file include/Linux/sched. h. A write book calls it a process control block (PCB) or a trip descriptor Pd (processor descriptor ). It stores all the information used to control and manage the itinerary. It mainly includes the status information, signal, travel number, parent travel number, cumulative execution time value, files in use, Region descriptor of the task, and task status segment information. The meanings of each column in this structure are as follows.


■ The long state column contains the current status code of the trip. If the itinerary is waiting for CPU usage or the itinerary is being executed, the value of the state is task_running. If the itinerary is waiting for the occurrence of an event and the event is in the empty rolling state, the value of the state is task_interruptible or task_uninterruptible. The difference between the two values is that a trip in the task_interruptible status can be awakened and moved by signals, however, the trip in the task_uninterruptible State usually waits directly or indirectly for the fulfillment of the hardware conditions and therefore does not receive any signal. The task_stopped state is used to indicate that a trip is in the stopped state. For example, when a trip receives a relevant signal (such as sigstop, sigttin or sigttou) or when the trip is monitored by another trip using the ptrace system call and the control is in the monitoring trip. The task_zombie status is used to describe that a trip has been terminated, but its task data structure items still exist in the task structure table. The conversion process of a trip between these statuses is described below.

■ The Long counter column stores the number of time ticking questions that can be executed before the current execution is temporarily stopped, that is, under normal conditions, it takes several system clock cycles to switch the timer to another itinerary. The scheduler uses the counter value of the itinerary to select the next itinerary to be executed. Therefore, counter can be seen as a dynamic feature of the itinerary. The initial value of counter is equal to priority when a trip is created.

■ Long priority is used to give counter an initial value. In linux0.12, the initial value is 15 system clock downtime periods (15 ticking times ). When necessary, the scheduler uses the value of priority as counter to input an initial value. For details, see sched. C program and fork. C program. Of course, the unit of priority is also the time tick count.

■ The long signal column is the dot matrix of the signal received by the current trip. There are 32 bits in total. Each bit element represents a signal, and the signal value is offset + L. Therefore, a Linux kernel can contain up to 32 signals. At the end of each system call processing process, the system uses the signal matrix to pre-process the signal.

■ Struct sigaction [32] structure array is used to save the operations and attributes used to process various signals. Each item of the array corresponds to a signal.

■ The long blocked column is the signal blocking lattice map that the current trip does not want to process. Similar to the signal column, each bit represents a blocked signal.

■ The Int exit field is used to save the exit code when the program is terminated. After the child itinerary ends, you can query the exit code of the parent itinerary.

■ The unsigned long start_code column is the starting address of the travel code in the linear space.

■ The unsigned long end_code column stores the length value of the bitkey of the travel code.

■ The unsigned long end_data column stores the code length of the trip + the total length of the data.

■ The unsigned long BRK column is also the total length value of the travel code and data (the indicator value), but it also includes the uninitialized data area BSS. See figure 13-6. This is the initial value of BRK when a trip starts. By modifying this indicator, the core can add and release dynamically allocated memory for the trip. This is usually done by calling the malloc () function and calling the internal core through the BRK system.

■ The unsigned long start_stack field value points to the starting position of the stack in the travel volume address space. In the same way, please refer to the stack index position in Figure 13-6.

■ Long PID is the travel ID, that is, the travel number. It is used to uniquely identify a trip.

■ Long pgrp refers to the number of the itinerary group to which the itinerary belongs.

■ Long session is the session number of the itinerary, that is, the session has a good itinerary.

■ Long leader is the first session itinerary number. For more information about the concept of itinerary groups and sessions, see the description after the procedure list in Chapter 7th.

■ Int groups [ngroups] is the group number array of each group to which the itinerary belongs. A trip can belong to multiple groups.

■ Task_struct * p_pptr is an indicator pointing to the parent itinerary task structure.

■ Task_struct * p_cptr is the destination o pointing to the latest sub-itinerary Task Structure

■ Task_struct * p_ysptr is an indicator pointing to adjacent itineraries established later than itself.

■ Task_struct * p_osptr is an indicator pointing to adjacent itineraries established earlier than yourself. For the relationship between the above four indicators, see Figure 5-20. In the task data structure of Linux 0.11 core, there is a parent travel Number Column named "father", but it is no longer used in the 0.12 core. In this case, we can use the pptr-> PID of the itinerary to obtain the itinerary Number of the parent itinerary.

■ Unsigned short uid indicates the identifier (with the identifier ID) of the trip ).

■ Unsigned short EUID is a valid marker used to indicate the right to access a file.

■ Unsigned short SUID is the ID of the stored object with the marker. The upload ID Flag is used when the execution file is set.
(Set-user-ID) indicates the uido of the execution file in SUID. Otherwise, SUID is equal to the EUID of the itinerary.

■ Unsigned short GID indicates the ID of the group to which the ticket belongs (group ID ). Indicates the consumer group that owns the itinerary.

■ Unsigned short EGID is a valid group ID, used to indicate the permission for the Group to use the token to access the file.

■ Unsigned short SGID is the ID of the stored region group. When the set-group-ID (set-group-ID) flag is set, the GID of the execution file is saved in SGID. Otherwise, the sgid is equal to the EGID of the itinerary. For a description of these using sequence numbers and group numbers, see the overview before the SYS. C program in Chapter 5th.

■ Ultra-value for long timeout internal verification.

■ Long alarm is the scheduled value (the number of tick answers) of the trip. This value is delivered during scheduled interruption of the main system. When you call alarm () (sched. (Row C 338th) after this value is set (the parameter is in seconds, but before it is saved to the alarm column, the inner core converts it to the system tick number ), after a specified number of seconds, the value is passed to 0. Then, the system sends a sigalrm signal to the trip, and the program is terminated by default. Of course, you can also use the signal capture function (signal () or signal () to capture the signal for the specified operation.

■ Long utime is the time when the accumulated itinerary is executed in the delayed State (the number of tick answers ).

■ Long stime is the time (the number of tick answers) of the cumulative route in the system state (inner core State ).

■ Long cutime is the time when the sub-itinerary of a cumulative trip is executed in the delayed State (the number of tick answers ).

■ Long cstime is the time when the kernel state is executed in the Child stroke of the accumulated itinerary (the number of tick answers ).

■ Struct start_time is the time when the trip is generated and started to be executed.

■ Struct rlimit rlim [rlim nlimits] travel resource usage statistics array.

■ Unsigned int flags indicates each stroke. The 0.12 inner core is not used yet.

■ Unsigned short used_math is a flag indicating whether the secondary timer is used in this trip.

■ Int tty is the sub-device number of the tty terminal used for travel. -1 indicates no use.

■ Unsigned short umask is the attribute mask bit used to create a new file in the itinerary, that is, the access attribute set for the new file.

■ Struct m_inode * Pwd is the structure of the current working directory I node of the itinerary. Each trip has a current job category, which is used to parse the relative path name and can be changed using a system call to chdir.

■ Struct m_inode * root is the root segment of the itinerary itself. Each trip can have its own root category, which is used to parse the path name of the category pair. Only Super Users can call chroot to modify the root category.

■ Struct m_inode * executable is the structural indicator of the execution file of the itinerary in the memory I node. Based on this field, the system can determine whether there is another route in the system to execute the same execution file. If so, the reference count value of the I node in this memory is executable-> I _count, which is greater than 1. When a trip is created, this column is assigned the same value as the parent row., that is, the program is being run with the parent itinerary. When a cxec () class function is called in a thread to execute a specified execution file, the field value is converted to Exec () memory I node metrics of the Program executed by the function. When the exit () function is called and the exit process is executed, the reference Reference reference count of the memory I node referred to by this column will be L, and this column will be left blank. The main function of this field is to save it to the memory _page () function of the memory. C program. Based on the reference count of the node indicated by the stroke executable, the function code can determine whether multiple copies of the program currently executed in the system exist (at least two copies ). If so, try the page sharing operation between them.

■ During system initialization, the executable of all tasks created by the system is 0 before the execve () command is executed for 1st calls. These tasks include task 0, Task 1, and all tasks directly created by Task 1 that have not executed execve (), that is, the executable of all tasks whose code is directly included in the kernel code is 0. Because the code of task 0 is included in the internal code, it is not the execution file loaded by the system from the file system. Therefore, the executable value of task 0 is fixed in the internal code to 0. In addition, when a new itinerary is created, fork () will copy the task data structure of the parent itinerary, so the executable of Task 1 is also 0. However, after executing exccve (), executable is assigned the memory I node metrics of the execution file, and the value of this parameter for all subsequent tasks will not be 0.

■ Unsigned long close_on_exec is a path file descriptor (archive control code. Each bit represents a file descriptor used to determine the file descriptor to be closed when the system calls execvc () (see include/fcntl. h ). When a program uses the fork () function to create a sub-itinerary, it usually calls the execve () function in the sub-itinerary to run another new procedure. In this case, the sub-itinerary will be completely replaced by the new path and the new path will be executed in the sub-itinerary. If the corresponding bit element of a file descriptor in close_on_exec is in the position meta state, the file descriptor that is opened during the execve () call of the sub-itinerary will be closed, that is, the file descriptor will be closed in the new thread. Otherwise, the file descriptor is always in the open state.

■ Struct file * filp [nr_open] is the archive structure indicator table for all open files used in the itinerary, up to 32 items. The value of the file descriptor is the index value in the structure. Each item is used for file descriptor location indicator and access file.

■ Struct desc_struct LDT [3] is the table structure of the travel region descriptor. Defines the code segment and Data Segment of the task in the pseudo-address space. Array Item 0 is null, item l is the code segment descriptor, and item 2 is the data segment (including data and stack) descriptor.

■ Struct tss_struct TSS is the information structure of the task status segment TSS (Task state segment) of the itinerary. When the task is switched out from execution, the tss_struct structure stores all register values of the current processor. When the task is re-executed by the CPU, the CPU will use these values to restore the status when the task is switched out and start execution.

When a trip is executed, the values in all the registers of the CPU, the status of the trip, and the content in the stack are called the context of the trip. When the core needs to switch to another itinerary, it needs to save all the statuses of the current itinerary, that is, to save the context of the current itinerary, so that when you re-execute the itinerary, it can be restored to the status when the switch is completed. In Linux, the current itinerary context is saved in the task structure of the itinerary. In the event of interruption, the inner core executes the interruption service in the inner core State in the context of the interrupted trip. However, all resources to be used are retained so that the interrupted service can be resumed at the end of the service interruption.

5.7.2 itinerary execution status

Within the lifetime of a process, it may be in a group of different States, called the stroke status. See Figure 5-21. The itinerary status is saved in the state field of the itinerary task structure. When the trip is waiting for resources in the system and is in the waiting status, it is said that it is in the sleep waiting status. In Linux, the sleep waiting status is divided into the pendable and pendable waiting statuses.

Execution status (task_running)
When the itinerary is being executed by the CPU or is ready for running by the scheduler at any time, the itinerary is called in running state ). If the route is not executed by the CPU at this time, it is in the thread execution status. As shown in Figure 5-21, the three States marked as 0 can be executed in the inner core or in the standby mode. When a route is executed in the kernel code, we say it is in the kernel execution State, or for short, it is in the kernel state. When a route is executing the code using the kernel itself, we call this an execution State (in the dummy State ). When the system resources are available, the trip is woken up and enters the standby execution state. This state is called the ready state. These statuses (the middle column in the figure) indicate that the methods are the same in the kernel.
It is in the task_running status. When a new line is created, it is in this State (the last 0 ).

Stoppedsleep status (task_interruptible)

The system does not schedule the trip when the trip is in the pendable waiting (sleep) status. When the system generates a resource that is interrupted, released, or received a signal during the trip, you can wake the trip to the ready status (Execution status ).

Uninterrupted sleep (task_uninterruptible)

Except that the system will not be woken up because it receives a signal, this status is similar to the stoppedsleep status. However, only when the wake_up () function is used to explicitly wake up a trip in this status can it be switched to the ready state for execution, this status is usually used when the trip needs to wait without interference or the waiting event will happen soon.

Paused (task_stopped)
When the trip receives the signal sigstop, sigtstp, sigttin or sigttou, it will be paused. You can send a sigcont signal to it to convert the trip to an executable state. Any signal received during the trip is in this status. In Linux 0.12, the status has not been changed to "ready. The trip in this status will be processed as the end of the trip.

Task zombie)
When the itinerary is stopped, but its parent itinerary does not call wait () to ask about its status, it is said that the itinerary is in a dead state. In order to enable the parent itinerary to obtain information about its stopping execution, the task data structure information of the Child itinerary needs to be retained. Once the parent itinerary calls wait () to obtain the information about the sub-itinerary, the task data structure of the current itinerary will be released.

When the execution time slice of a trip is used up, the system uses the scheduler to schedule the trip to other itineraries for execution. In addition, if you need to wait for a certain resource of the system during the execution of the itinerary in the kernel state, the itinerary will be called sleep_on () or interruptible_sleep_on to voluntarily enable the CPU quota, let the scheduler execute other itineraries. The trip goes to sleep (task_uninterruptible or task_interruptible ).

Only when the stroke is transferred from the "inner core execution state" to the "sleep state", the inner core performs the stroke switching operation. In the inner core state, the itinerary cannot be preemptible by other itineraries, and the status of one itinerary cannot be changed. In order to avoid errors in the inner core data during the stroke switch, all interruptions are prohibited when the inner core code is executed in the critical section.

5.7.3 itinerary Initialization

In boot/Object Storage, The Boot Program uploads the core from the disk to the memory, and enables the system to run in protected mode. Then, it starts to run the system initialization program init/main. c. The program first determines how to allocate and use the system physical memory, then, call the initialization functions of each part of the core to perform initial processing on memory management, interrupt processing, Block devices and metadevices, travel management, and hard disks. After these operations are completed, all parts of the system are in the executable state. After that, the program will manually move itself to task 0 (trip 0) for execution, and use fork () to call the first time to create a trip L. In step 1, the program will continue to initialize the application environment and execute the shell login program. However, if the original route is 0, it will be scheduled to run when the system has a blank schedule. At this time, task 0 only executes the pause () System Call and calls the scheduling function.

The process of "moving to task 0 for execution" is completed by move_to_user_mode (include/ASM/system. h) of the huge set. It moves the main. C program execution stream from the inner core State (privileged level 0) to task 0 in the privileged state (privileged level 3) to continue execution. Before moving, the system first sets the execution environment of task 0 in the initialization process (sched_init () of the scheduler. This includes the values (include/Linux/sched. h). In the Global Descriptor Table, add the task status segment (TSS) descriptor of task 0 and the segment descriptor of the region Descriptor Table (LDT, and load them into the task register Tr and the region Descriptor Table register ldtr.

In this case, the internal core initialization code is a special process. The internal core initialization code is also the Code of task 0. According to the initial data set in the data structure of task 0, the code segment of task 0 and the base address of the data segment are 0, and the segment length is 640kb. The base address of the Content Code segment and data segment is 0 and the segment length is 16 Mb. Therefore, the code segment and Data Segment of task 0 are not included in the kernel code segment and data segment. The core initialization program main. c is the code in task 0, but the system is privileged before it is moved to task 0. Run the main. C program. The move _ to_user_mode function of the huge set is to change the execution privilege level from the level 0 of the inner core State to the Level 3 with the STANDBY state, but still continue to execute the original code to the current stream.

During the process of moving to task 0, move_to_user_mode of the huge set uses methods that interrupt the return command to cause a privileged change. Using this method for control transfer is made by the CPU protection machine. The CPU allows low-level (such as privileged level 3) code to call or transfer through the call door or interrupt or trap door for execution, but not vice versa. Therefore, the kernel uses this method to simulate iret to return low-level code. The main idea of this method is to build the content required by the interrupt return command in the stack, set the segment selection character of the return address to the task 0 code segment selection character, and its privilege level is 3. After that, when the iret command is returned, the system CPU jumps from privileged level 0 to privileged level 3 of the outer layer.
See Figure 5-22. The returned stack structure is interrupted when the privileged level changes.

The move_to_user_mode of a mega is first applied to the core stack. Data Segment selection operator and core stack indicator. Then, press the content in the sign register. Finally, press the 0 code segment of the task to select the offset of the operator and the next command to be executed after the execution is interrupted. This offset is located at a command after iret.

When the iret command is executed, the CPU sends the returned address to CS: EIP, And the mark register content in the stack is displayed. The CPU determines that the privileged level of the target code segment is 3, which is different from the current level 0 of the inner core. Therefore, the CPU will pop up the stack segment selection operator and indicator in the stack to SS: ESP. As a result, the value of segment registers ds, es, FS, and GS becomes invalid, and the CPU clears these segment registers. Therefore, after executing the iret command, you need to re-load these segment registers. After that, the system began to execute task 0 code at the privileged level 3. The stack in use is the stack used before moving. The core stack is specified as the page where the task resource structure is located.
(Page_size + (1ong) & init_task) The task data structure of task 0 needs to be copied when a new itinerary is created in the future, including the indicator that uses phantom. Therefore, the task is required. Using the delimiter
State stack is "dry running" before task L (stroke 1) is created.

5.7.4 create a new itinerary

Create a new route in Linux and use the fork () System Call. All the itineraries are obtained by copying the itinerary 0, and all the itineraries are subitineraries with the itinerary 0.

During the creation of a new itinerary, the system first finds an empty entry (empty slot) that has not been used by any itinerary in the task array ). If the system has 64 active routes, the fork () System Call will return an error because the task array table does not have any null options available. The system then applies for one page of memory in the primary memory area for the new itinerary to store the task data structure information, copy all content in the current itinerary task resource structure as a template of the new itinerary task data structure. To prevent the newly created itinerary from being executed by the scheduling function, you should immediately set the new itinerary status to the stoppedwait status (task_uninterruptible ).

Then, modify the data structure of the copied task. Set the current itinerary to the parent itinerary of the new itinerary, clear the signal lattice map, reset the statistical values of the new itinerary, and set the initial execution time slice value to 15 system tick messages (150 ms ). Then, set the value of each register in the task status segment (TSS) based on the current itinerary. Because the returned value of the new itinerary should be 0 when the itinerary is created, TSS. eax = 0 must be set for all. Tss. esp0 is set to the top of the memory page where the data structure of the new travel task is located, and TSS. ss0 is set to the inner core data segment selection character. Tss.1dt is set as the index value of the region table descriptor in gdt. If the secondary timer is used in the current itinerary, you also need to save the full state of the secondary timer to the TSS. i387 structure of the new itinerary.

After that, the system sets the code and Data Segment Base Address and limit for the new task, and copies the page table for pagination management of the current travel memory. Note: At this time, the system does not allocate an actual physical memory page for a new stroke, but instead allows it to share the memory page of its parent stroke. Only when any write memory operation exists in the parent or new thread will the system allocate a memory page for the write operation. This method is called copy on write technology.

Then, if the parent itinerary contains files that are played, the number of opened files should be increased by 1. Then, set the TSS and LDT descriptor items for the new task in gdt. The base address information points to the TSS and LDT in the new itinerary task structure. Finally, set the new task to the executable status and return the new itinerary number.

Note that creating a new sub-itinerary and loading and executing an executable file are two different concepts. When a child itinerary is created, it completely copies the parent itinerary code and data area, and executes the Code of the Child itinerary part in it. When executing a program on the block device, the exec () system call is generally performed in the sub-itinerary. After entering exec (), the original code and data area of the subtrip will be cleared (released ). When the sub-itinerary starts to execute the new schedule, the code of the program is not loaded from the block device by the core, the CPU will immediately generate a fault that does not exist in the internal table. In this case, the memory management program will load the corresponding content table from the block device, and then the CPU will execute it again.
The new program code can only be executed in the room.

5.7.5 Scheduling

The scheduler in the core is used to select the next route to be executed in the system. This mechanism is the basis of the multi-job system. The scheduler can look at the management code used to allocate CPU execution time among all the routes in the execution status. We can see from the previous descriptions that the Linux itinerary is in the preemptive mode, but the screenshot is still in the task_running status, but is not executed by the CPU temporarily. The competition of a trip occurs in the execution phase of the trip in the dynamic state, and cannot be obtained during the execution of the internal kernel state.

In order to allow the trip to effectively use system resources and quickly respond to requests during the trip, a certain scheduling policy is required for the Failover scheduling of the trip. In Linux 0.12, the scheduler uses a priority-based queuing scheduling policy.

The schedule () function first scans the task array. Compare the execution time of each ready state task (task_running) to the counter value of the tick count to determine which trip is currently executed at least. If the value is large, it indicates that the execution time is not long. Therefore, you can select the itinerary and use the job- function to switch to the itinerary for execution.

If all the time slices in the task_running status have been used up, the system will set the priority value of each trip to priority for all the itineraries in the system (including sleep itineraries) recalculate the counter value of the time slice to be executed for each task. The formula is as follows:

In this way, there is a high counter value for sleep itineraries when they are awakened. Then, the schedule () function re-Scans all tasks in the task array in the task_running state. Repeat the preceding process until you select a route. Finally, switch_to () is called to perform the actual route switch operation.

If no other itinerary can be executed at this time, the system will select the itinerary 0 for execution. For Linux 0.12, the itinerary 0 will call pause () set yourself to an interrupted sleep state and call schedule () again (). However, during scheduling, schedule () does not care about the status of Route 0. As long as the system is idle, the scheduling process is 0.

Itinerary Switching
Each time a new executable itinerary is selected, the schedule () function calls the switch_to () Collection defined in include/ASM/system. h to perform the actual trip-switching operation. This huge set changes the current stroke status (context) of the CPU to the new stroke status. Before switching, switch_to () first checks whether the trip to be switched to is the current itinerary. If yes, it does nothing and exits directly. Otherwise, the global variable current in the core is set as the indicator of the new task, and then jumps to the address composed of the task status segment TSS of the new task for a long time, causing the CPU to execute the Task Switch operation. At this time, the CPU will save the status of all its registers to the current task register tr where the TSS segment selector points to the forward
In the TSS structure of the task data structure, the register information in the TSS structure in the new task data structure pointed to by the new task status segment selector is restored to the CPU, the system officially began executing the new failover task. See Figure 5-23.

5.7.6 terminate the itinerary

When a task is completed or terminated halfway, the kernel needs to release the system resources used for the task. This includes the files opened during the execution of the itinerary and the applied memory.

When a user program calls the exit () system call, the internal kernel function do_exit () is executed (). This function will first release the memory pages used for the travel code segment and Data Segment seek, and close all the files opened during the trip, synchronize the current job category, root category, and I node of the program. If the itinerary has a child itinerary, The init itinerary is used as the parent itinerary of all its child itineraries. If the itinerary is a session header and there is a control terminal, the control terminal is released and a sighup signal is sent to all the itineraries of the session, this usually terminates all the itineraries in the session. Then, set the travel status to the frozen status task_zombie. And sends a sigchld signal to the original parent itinerary. It is known that a sub-itinerary has been terminated. Finally, do_exit () calls the scheduling function to execute other itineraries. It can be seen that when the itinerary is terminated, its task data structure is retained. Because the parent's itinerary still needs to use the information.

During the execution of a sub-itinerary, the parent itinerary usually uses the wait () or waitpid () function to wait for the termination of a sub-itinerary. When the waiting child itinerary is terminated and frozen, the parent itinerary will accumulate the time used for executing the child itinerary into its own itinerary. Finally, the Memory Page used for the final sub-itinerary task resource structure is released, and the indicator items used by the idle sub-itinerary in the task array are set.

From: T = NH & id = 40

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.