Getting Started with Linux kernels

Last Update:2015-08-13 Source: Internet

Author: User

Tags prev

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Linux is a set of free-to-use and freely-propagated Unix-like operating systems that are first used on computers based on the x86 series CPUs. The system is designed and implemented by thousands of programmers around the world. The aim is to create UNIX-compatible products that are free to use worldwide, without the copyright restrictions of any commodity-based software.

We do not introduce the history of the operating system, or the operating system of the subject of the classification of the operating system, gossip less, the Linux operating system is just a very new operating system. It does not adhere to a particular type of operating system, from the kernel, it is a time-sharing operating system, but also has the characteristics of the RTOS, it is a single-core operating system, but also has a modular micro-kernel features, it supports a variety of Internet protocols, so another network operating system It supports large-scale clustering, grid computing, and even now there are people on it to set up cloud computing, cloud storage and other environments, so it is a distributed operating system ...

In any case, the open source nature of Linux determines that all kinds of people in the world can develop it according to their own needs and improve it, so Linux has a high performance, high availability, extensible, portable and many other features.

Today it industry, small to embedded, mobile phones, PC, large to large-scale clusters, grids, clouds, can see the figure of Linux. I wrote a series of Crazy Linux kernel blog posts to let the vast majority of Chinese compatriots understand this great operating system inside the real face, let you feel why it is great.

Linux is an operating system and also a software. Since a software is sure to follow all the features of the software, that is its essence = algorithm + data structure + documentation. This article begins with his architecture, stepping into its interior.

1 Linux Architecture

Above is a Linux architecture diagram that I think is quite perfect. We can see that the application used by the user will eventually access the kernel in the form of an interrupt. A detailed description is: The application to the kernel to make system calls this special interrupt, then the process of the program and the user into the kernel state, you can access the kernel provides a variety of functions and data structures. More specific description We will then elaborate, first to introduce some concepts:

"Files" and "processes" are the two most basic entities and central concepts in the Linux kernel, and all operations on the Linux system are based on these two. The entire system core consists of the following five parts:
① virtual File system: File management and disk cache management (node and space management)
②I/O Device Management: Block device drivers (random access devices), raw devices (raw devices, character devices, bare devices)
③ Process Control: Scheduling, synchronization, and communication of processes
④ Storage Management: Move programs between main memory and CPU level two storage
⑤ Internet Protocol stack: implements a wide range of network protocols.

2 Implementation of general procedures

When a process executes a system call to EXEC (EXEC ("command name", parameter)), the executable is loaded into the three regions of the process:
• Body Area: The body segment of the corresponding executable file
• Data area: The data ID segment of the corresponding executable file
• Stack area: Newly created process workspace

A stack is an important concept that is primarily used to pass parameters, protect the site, store return addresses, and provide storage for local dynamic variables. The post on our back will focus on this because it's too important to stack this stuff.

While the process is running in the kernel state, the workspace is the kernel stack, and the workspace is the user stack when running under the user state. Kernel stacks and user stacks cannot be cross-used.

Here, let's look at a program that the user typed on the standard terminal: Copy oldfile newfile. Here, Oldfile is an existing filename, and NewFile is a new file name. Yes: (where the variant version is the initialization data; array buffer is uninitialized data)

#include <fcntl.h>
Char buffer[2048];
int version=1;

Main (int argc, char *argv[])/* The system references main when required to provide argc as in table argv */
{/* parameter, and each member of the array argv is assigned an initial value, */
int Fdold, fdnew; /* control command: argv[0] point to string copy; */
if (argc! = 3)/*argv[1] points to the string oldfile; */
{/*argv[2] points to string newfile; */
printf ("Need 2 arguments for copy program/n");
Exit (1);
}
Fdold = open (argv[1], o_rdonly); /* Open source file read-only */
if (Fdold = =-1)
{
printf ("Cannot open file%s/n", argv[1]);
Exit (1);
}
Fdnew = creat (argv[2], 0666); /* Create a target file that can be read and written to all users */
if (fdnew = =-1)
{
printf ("Cannot create file%s/n", argv[2]);
Exit (1);
}
Copy (Fdold, fdnew);
Exit (0);
}
copy (int old, int new)
{
int count;
while ((count = read (old, buffer, sizeof (buffer))) > 0)
Write (new, buffer, count);
}

When main is called, the parameters argc and argv, variables Fdold, fdnew, and related function address information in main will be stacked, and whenever the next function is encountered (in this case, the copy function), Its parameters and variables and the associated address will also be stacked: (assuming that the program does not enter the three if program section of the stack process, in fact, if there is a pressure stack process, we omitted, but do not think not)

We see that the Linux process works in two states-the kernel State (kernel mode) and the user mode. Therefore, the kernel stack and the user stack of the Linux system are separate. The user stack holds the general functions in the program and the system call functions related information, is visible to the user; the kernel stack holds functions or data in the kernel, such as the GETBLK function, and is transparent to the user.

So when does the program use the user stack, and when does it use the kernel stack?

Yes, the system is called. That is, when you execute printf, open, read, write execute C language library functions, it will eventually use the corresponding system calls, such as Sys_open, Sys_read and so on. This is the time to switch to the kernel stack.

1 Stack switch for Linux

We're talking about 80x86, in fact, Linux only uses its stack segment in four places (from SS+ESP to its bottom address):
• System boot initializes the stack used in temporary real mode
• Provides a stack used by the kernel program to initialize after entering protected mode, which is also the user-state stack used by process 0 later
• Each process is called by the system, the stack used to execute the kernel program, called the kernel stack of the process, each with its own independent kernel-state stack
• The stack that the process executes in the user state, located near the end of the process logical address space

The following is a brief introduction to the two stacks associated with normal processes

Each process has two stacks, respectively, for the execution of the user-state and kernel-state programs, which we call the user-state stack and the kernel-state stack.

In addition to being at different CPU privilege levels, the main difference between the two stacks is that the kernel stack of the task is small, and in the later Process management topic we can see that the stored data cannot be more than 8,096 bytes, while the process's user-state stack can be at the bottom of the user's 4GB space and stretched upward.

Each process (except process 0 and Process 1) has its own 4GB address space when the user state is running, which is the code you see, and when a process has just been created, its user-state stack pointer is set near the end of its address space, and the application continues to use the stack when it is running in the user state. The actual physical address memory is determined by the CPU paging mechanism.

In the kernel State runtime, each task has its own kernel-state stack, which is used during the execution of the task in kernel code, after the system call is executed. The position of the linear address in which it is located is specified by the SS0 and Esp0 two fields in the TSS segment of the process, where are the two values from?

In our "Memory Management" topic, we will mention that for the 80x86 system, Linux uses only symbolic segmentation techniques, that is, only with code snippets and data segments. The SS registers in the CPU point to the stack segment, but Linux does not use a dedicated stack segment, but rather a portion of the data segment as a stack segment. Therefore, when the CPL field in the data segment is 3, the SS register points to the user stack in the user data segment, and if the CPL in the data segment is 0 o'clock, it points to the kernel stack in the kernel data segment. Attention! This is important, especially when we're going to talk about the process switching, which you don't know, and that will make you mad.

In addition to the user data segment, user code snippet, kernel data segment, kernel code snippet 4 segments, Linux also uses a few other specialized segments, below we specifically to explore, in a single-processor system, only one GDT, and in multiprocessor systems each CPU corresponds to a GDT. All GDT is stored in the cpu_gdt_table array, and the addresses of all GDT (used when initializing the GDTR register) and their size are stored in the CPU_GDT_DESCR array, all in the file arch/i386/kernel/head. is defined in S.

We will expand this knowledge, the 80x86 system has a new section after 286, called the Task State segment (TSS), mainly used to save the contents of each register in the processor. Linux has a corresponding TSS-related data structure for each processor, and each TSS corresponding linear address space is a small subset of the corresponding linear address space of the kernel data segment. All the task status segments are stored sequentially in the INIT_TSS array; it is worth noting that the base field of the TSS descriptor for the nth CPU points to the nth element of the INIT_TSS array. The G (granularity) flag is cleared 0, and the Limit field is set to 0xeb because the TSS segment is 236 bytes long. The Type field is set to 9 or 11 (available 32-bit TSS), and DPL is set to 0 because the process under the user state is not allowed to access the TSS segment.

OK, back to the question, we talked about the user process to access the data structures and functions provided by the kernel, you need to switch, that is, from the user state to the kernel State, then the address of the kernel stack from?

As a result, when the process from the user state into the kernel state, there will be an interruption, because the kernel of the CPL high priority, so the stack to switch. The TR register is then read to access the TSS segment of the process (now or the user state). The SS and ESP registers are then loaded with the kernel-state stack segment SS0 and the stack pointer esp0 in TSS, which allows the user stack to switch to the kernel stack. At the same time, the kernel uses a set of MOV instructions to save all registers to the kernel stack, which also includes the contents of the register of SS and ESP in the user's configuration.

At the end of the interrupt or exception processing, the CPU control unit executes the Iret command and re-reads the register contents of the stack to update the individual CPU registers to restart the user-state process, which will be based on the stack.

It is also emphasized here that the kernel stack has only one address (if it is a multi-CPU architecture, one per CPU), its SS and ESP are kept in the TSS structure, the user-state process access is not allowed, and the Linux data structure describing the TSS format is tss_struct:

struct Tss_struct {
unsigned short back_link,__blh;
unsigned long esp0;
unsigned short ss0,__ss0h;
unsigned long esp1;
unsigned short ss1,__ss1h; /* SS1 is used to cache Msr_ia32_sysenter_cs */
unsigned long esp2;
unsigned short ss2,__ss2h;
unsigned long __CR3;
unsigned long EIP;
unsigned long eflags;
unsigned long eax,ecx,edx,ebx;
unsigned long ESP;
unsigned long EBP;
unsigned long esi;
unsigned long EDI;
unsigned short es, __esh;
Unsigned short CS, __csh;
Unsigned short SS, __ssh;
Unsigned short DS, __DSH;
unsigned short FS, __fsh;
Unsigned short GS, __GSH;
unsigned short LDT, __ldth;
unsigned short trace, io_bitmap_base;
/*
* The extra 1 is there because the CPU would access an
* Additional byte beyond the end of the IO permission
* Bitmap. The extra byte must is all 1 bits, and must
* be within the limit.
*/
unsigned long io_bitmap[io_bitmap_longs + 1];
/*
* Cache the current maximum and the last task used the bitmap:
*/
unsigned long Io_bitmap_max;
struct Thread_struct *io_bitmap_owner;
/*
* Pads the TSS to being cacheline-aligned (size is 0x100)
*/
unsigned long __cacheline_filler[35];
/*
* .. And then another 0x100 bytes for emergency kernel stack
*/
unsigned long stack[64];
} __attribute__ ((packed));

This is the whole content of the TSS segment, not much. At each switch, the kernel updates some of the TSS fields so that the desired CPU control unit can safely retrieve the information it needs, which is one of the Linux security aspects. Therefore, TSS only reflects the characteristic level of the current process on the CPU, and processes that do not need to be running retain TSS.

The kernel prior to linux2.4 has a limit on the maximum number of processes, because each process has its own TSS and LDT, and the TSS (Task descriptor) and the LDT (private descriptor) must be placed in the GDT, the GDT can only hold 8,192 descriptors, Remove the 12 descriptor for the system, the maximum number of processes = (8192-12)/2, a total of 4,090 processes.

From Linux2.4 onwards, all processes use the same TSS, which is exactly one TSS per CPU, using the same TSS on the same CPU process. The definition of TSS in ASM-I386/PROCESSER.H is defined as follows:

extern struct tss_struct Init_tss[nr_cpus];

Initialize and load TSS in Start_kernel ()->trap_init ()->cpu_init ():

void __init cpu_init (void)
{
int nr = smp_processor_id (); Get current CPU

struct Tss_struct * t = &init_tss[nr]; TSS currently used by the CPU

T->esp0 = current->thread.esp0; Update the ESP0 in TSS to the esp0 of the current process
Set_tss_desc (nr,t);
Gdt_table[__tss (NR)].b &= 0xfffffdff;
Load_tr (NR); Loading TSS
Load_ldt (&init_mm.context); Loading LDT

}

We know that task switching (hard switching) requires TSS to hold all registers (2.4 used to switch using JMP), and when interrupts occur it is also necessary to read RING0 's esp0 from TSS, so the process uses the same TSS, what about task switching?

In fact, after 2.4 no longer uses the hard switch, but uses the soft switch, the register no longer saves in the TSS, but is saved in the Task->thread, only uses the TSS esp0 and the IO license bitmap, therefore, during the process switchover process, only need to update in the TSS esp0, Io_ Bitmap, the code is in SCHED.C:

Schedule ()->switch_to ()->__switch_to (),

void Fastcall __switch_to (struct task_struct *prev_p, struct task_struct *next_p)
{
struct Thread_struct *prev = &prev_p->thread,
*next = &next_p->thread;
struct Tss_struct *tss = init_tss + smp_processor_id (); TSS for current CPU

/*
* Reload Esp0, LDT and the page table pointer:
*/
Ttss->esp0 = next->esp0; Update tss->esp0 with the esp0 of the next process

Copy the Io_bitmap of the next process to Tss->io_bitmap

if (Prev->ioperm | | next->ioperm) {
if (next->ioperm) {
/*
* 4 cachelines copy ... not good.
* Bad either. Anyone got something better?
* This is affects processes which use ioperm ().
* [putting the TSSS into 4k-tlb mapped regions
* and playing VM tricks to switch the IO bitmap
* is not really acceptable.]
*/
memcpy (Tss->io_bitmap, Next->io_bitmap,
Io_bitmap_bytes);
Tss->bitmap = Io_bitmap_offset;
} else
/*
* A bitmap offset pointing outside of the TSS limit
* Causes a nicely controllable SIGSEGV if a process
* Tries to use a port IO instruction. The first
* SYS_IOPERM () call sets up the bitmap properly.
*/
Tss->bitmap = Invalid_io_bitmap_offset;
}
}

2 Summary of the 80X86 segment

Look dizzy, we still have to clean up the knowledge of the 80x86 section register, these knowledge in the Linux kernel analysis is very important, before already mentioned GDT, here again the whole section register of Knowledge comb, so, just don't understand comrade should have some clue.

Starting with the 80286 pattern, the Intel microprocessor performs address translation in two different ways, known as real mode and protected mode (Protected mode), respectively. A logical address consists of two parts: a segment identifier (note, not what we learned in class "section base" Ha, upgraded!) ) and the offset of the relative address within a segment. The segment identifier is a 16-bit long field, called the segment selector (segment selector), and the offset is a 32-bit long field.

To quickly and easily find the segment selector, the processor provides a segment register, the only purpose of the segment register is to hold the address of the segment selector (16 bits, it is important to note that the contents of these segment registers are not the base of the paragraph). These segment registers are called CS, SS, DS, ES, FS, and GS. Although there are only 6 segment registers, the program can use the same segment register for different purposes by first storing its value in memory and then recovering it after it is exhausted.

3 of the 6 registers have specific uses:
cs--the code segment register, pointing to the segment that contains the program directives.
The ss--stack segment register, which points to the segment that contains the current program stack.
A ds--data segment register that points to a segment that contains static data or global data.

The other three segment registers are used for general purposes and can point to arbitrary data segments.

The CS register also has a very important function: it contains a two-bit field that indicates the current privilegelevel,cpl of the CPU. A value of 0 represents the highest priority, and a value of 3 represents the lowest priority. Linux uses only level 0 and Level 3, which are called kernel and user configurations, respectively.

Each segment is represented by a 8-byte segment Descriptor (Segment descriptor) (see figure), which describes the characteristics of the segment (it is important to note that it is not the address of the segment). The segment descriptor is placed in the Global Descriptor Table (global descriptor, GDT) or local descriptor table (local descriptor, LDT). Typically only one GDT is defined, and each process can have its own LDT in addition to the segments that are stored in the GDT if additional segments need to be created. The address and size of the GDT in main memory are stored in the GDTR processor register, and the LDT address and size currently being used are placed in the LDTR processor register.

The meaning is as follows:
Base: Linear address containing the first byte of a segment
G: Granularity flag G: If the bit is cleared to 0, the segment size is in bytes, otherwise it is counted in multiples of 4096 bytes.
Limit: The offset of the last memory cell in the segment, which determines the length of the segment. If G is set to 0, the size of the segment varies from 1 bytes to 1MB; otherwise, it changes from 4KB to 4GB.
S: System flag S, if it is cleared 0, then this is a system segment that stores a key data structure such as a local descriptor chart, otherwise it is an ordinary code snippet or data segment.
Type: Describes the types characteristics of a segment and its access rights (see the table below for a description).
DPL: Descriptor Privilege (descriptor Privilege level) field: used to restrict access to this segment. It represents the minimum CPU priority required to access this segment. Thus, the segment set to 0 of DPL is accessible only if the CPL is 0 o'clock (that is, in the kernel state), and the segment set to 3 is accessible to any CPL value.
P:segmentpresent flag: Equals 0 indicates that the segment is not currently in main memory. Linux always sets this flag (47th bit) to 1 because it never swaps the entire segment to disk.
D or B: a flag called D or B, depending on whether it is a code snippet or a data segment. The meaning of D or B is slightly different in two cases, but if the address of the segment offset is 32 bits long, it is basically set to 1, and if the offset is 16 bits long, it is cleared 0 (see the Intel User manual for more details).
The AVL:AVL flag can be used by the operating system, but is ignored by Linux.

The logical address consists of a 16-bit segment selector and a 32-bit offset, and the segment register holds only the segment selector. The segmented unit of the CPU performs the following operations (all mechanically converted, to be understood):
• Check the Ti field of the segment selector first to determine which descriptor characters the segment is in. The Ti field indicates that the descriptor is in the GDT (in this case, the segment unit gets the linear base address of the GDT from the GDTR register) or in the activated LDT (in which case the segmented unit obtains the linear base address of the LDT from the LDTR register).
• The address of the segment descriptor is computed from the index field of the segment selector, and the value of the index field is multiplied by 8 (the size of a segment descriptor is actually masked by the three bits indicating the privileged level of the CPL and the field indicating Ti), which is added to the contents of the GDTR or LDTR registers.
• Add the offset of the logical address to the value of the segment Descriptor base field to get a linear address.

Note that the CPU has some registers associated with the segment register called the hidden cache, and some books are also called non-programmable registers to cache the segment descriptor. Thus only the first two actions need to be performed when the content of the selected sub in the segment register is changed.

3 Pointers to Linux

When you save a pointer to an instruction or data structure, the kernel does not need to set the segment selector for its logical address at all, because the CS register contains the current segment selector. For example, when the kernel calls a function, it executes a call assembly language instruction that specifies only the offset portion of its logical address, and the segment selector is not set, and is implied in the CS register. Because the "in kernel Execution" segment has only one, called code snippet, defined by macro _kernel_cs, the __kernel_cs can be loaded into CS as soon as the CPU switches into the kernel state. The same applies to pointers to kernel data structures (implicitly using DS registers) and pointers to user data structures (the kernel explicitly uses ES registers).

Getting Started with Linux kernels

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More