In the previous blog, we will see that the kernel has successfully deferred the allocation of dynamic memory for processes using a new resource. When a user-state process requests dynamic memory, it does not obtain the requested page box, but only obtains the right to use a new linear address range, this first-line address range becomes part of the process address space. This interval is called a linear zone ". Ben Bo, let's discuss this linear zone in detail.
1. Linear Data Structure
In Linux, the linear zone is implemented through objects of the vm_area_struct type. Its fields are as follows:
Struct vm_area_struct {
Struct mm_struct * vm_mm;/* points to the memory descriptor of the linear zone */
Unsigned long vm_start;/* The first linear address in the linear zone */
Unsigned long vm_end;/* The first linear address after the linear zone */
/* Linked list of VM areas per task, sorted by address: The next linear zone in the linked list */
Struct vm_area_struct * vm_next;
Pgprot_t vm_page_prot;/* access permission of the page box in the linear area */
Unsigned long vm_flags;/* sign of the linear zone */
Struct rb_node vm_rb;/* data used for the red-black tree */
/*
* For areas with an address space and backing store,
* Linkage into the address_space-> I _mmap PRIO tree, or
* Linkage to the list of like VMAs hanging off its node, or
* Linkage of VMA in the address_space-> I _mmap_nonlinear list.
*/
Union {
Struct {
Struct list_head list;
Void * parent;/* aligns with prio_tree_node parent */
Struct vm_area_struct * head;
} Vm_set;
Struct raw_prio_tree_node prio_tree_node;
} Shared;/* link to the data structure used by the ing */
/*
* A file's map_private VMA can be in both I _mmap tree and anon_vma
* List, after a cow of one of the file pages. A map_shared VMA
* Can only be in the I _mmap tree. An anonymous map_private, stack
* Or brk vma (with null file) can only be in an anon_vma list.
*/
Struct list_head anon_vma_node;/* serialized by anon_vma-> lock pointer to anonymous linear area linked list */
Struct anon_vma * anon_vma;/* serialized by page_table_lock pointer to anon_vma Data Structure */
/* Function pointers to deal with this struct. Method pointing to linear zone */
Struct vm_operations_struct * vm_ops;
/* Information about our backing store :*/
Unsigned long vm_pgoff;/* offset in the ing file. For anonymous pages,
It is equal to 0 or vm_start/page_size */
Struct file * vm_file;/* point to the file object of the ing file (if any )*/
Void * vm_private_data;/* private data pointing to the memory zone */
Unsigned long vm_truncate_count;/* used to release a linear address range in the non-linear File Memory ing */
# Ifndef config_mmu
Atomic_t vm_usage;/* refcount (VMAs shared if! MMU )*/
# Endif
# Ifdef config_numa
Struct mempolicy * vm_policy;/* NUMA policy for the VMA */
# Endif
};
Each linear zone descriptor represents a linear address range. The vm_start field contains the first linear address of the range, while the vm_end field contains the first linear address outside the range. Vm_end-vm_start indicates the length of the linear zone. The vm_mm field points to the mm_struct memory descriptor of the process that owns the interval. We will describe other fields of vm_area_struct later.
The linear zones owned by processes never overlap, and the kernel tries its best to merge the newly allocated linear zones with the existing adjacent linear zones. If the access permissions of the two adjacent zones match, they can be merged.
When a new linear address range is added to the address space of the process, the kernel checks whether an existing linear zone can be expanded. If not, create a new linear zone. Similarly, if you delete a linear address range from the address space of a process, the kernel needs to adjust the size of the affected linear zone. In some cases, resizing forces a linear zone to be divided into two smaller parts (theoretically, if there is no idle memory for new memory descriptors, deleting a linear address range may fail, but the probability of this case is too small ).
The vm_ops field points to the vm_operations_struct data structure, which stores the linear zone method. Only the methods shown in the table can be applied to the UMA system:
Struct vm_operations_struct {
/* Called when the linear zone is added to the linear zone set owned by the process */
Void (* open) (struct vm_area_struct * area );
/* Called When deleting a linear zone from the linear zone set owned by a process */
Void (* close) (struct vm_area_struct * area );
/* When a process attempts to access a page that does not exist in Ram, but the linear address of the page is linear, it is called by the page missing exception handler */
Struct page * (* nopage) (struct vm_area_struct * area, unsigned Long Address, int * type );
Unsigned long (* nopfn) (struct vm_area_struct * area, unsigned Long Address );
/* Set the page table items corresponding to the linear address (pre-missing page) of the linear area. Mainly used for non-linear File Memory ing */
INT (* populate) (struct vm_area_struct * area, unsigned Long Address, unsigned long Len, pgprot_t Prot, unsigned long pgoff, int nonblock );
/* Notification that a previusly read-only page is about to become
* Writable, if an error is returned it will cause a sigbus */
INT (* page_mkwrite) (struct vm_area_struct * VMA, struct page * page );
# Ifdef config_numa
INT (* set_policy) (struct vm_area_struct * VMA, struct mempolicy * New );
Struct mempolicy * (* get_policy) (struct vm_area_struct * VMA,
Unsigned long ADDR );
INT (* migrate) (struct vm_area_struct * VMA, const nodemask_t * from,
Const nodemask_t * To, unsigned long flags );
# Endif
};
All linear zones owned by a process are linked together through a simple linked list. The linear areas in the linked list are arranged in ascending order of memory addresses. However, each two linear areas can be separated by unused memory address areas. The vm_next field of each vm_area_struct element points to the next element of the linked list. The kernel searches for the linear zone through the MMAP field of the memory descriptor of the process. The MMAP field points to the first linear zone descriptor in the linked list.
The map_count field of the memory descriptor stores the number of linear partitions owned by the process. By default, a process can have a maximum of 65536 different linear zones. The system administrator can modify the limit by writing the/proc/sys/Vm/max_map_count file.
Displays the relationship between the address space of the process, its memory descriptor, and the linear area linked list.
One of the operations frequently executed by the kernel is to find the linear zone containing the specified linear address. Because the linked list is sorted, you only need to find a linear area after specifying a linear address, and the search can end.
However, it is very convenient to use this linked list only when the linear zone of the process is very small, for example, there are only 10 or 20 linear zones. Searching for, inserting, and deleting elements in a linked list involves many operations. These operations take a linear ratio of time to the length of the linked list.
Although most Linux processes use a very small number of linear zones, such as object-oriented databases or malloc () large applications such as the dedicated debugger may have hundreds of thousands of linear zones. In this case, the management of linear partition linked lists becomes very inefficient. Therefore, the performance of memory-related system calls is reduced to an intolerable level.
Therefore, Linux 2.6 stores the memory descriptor in a data structure called the Red-black tree.
2 red-black Tree Algorithm
The red-black tree is an extended balanced binary tree. Let's first recall the concept of a binary tree: Each element (or node) usually has two children: left and right. The elements in the tree are sorted. For a node with a keyword of N, the keywords of all elements in the left subtree are smaller than those of N. On the contrary, the keywords of all elements in the right subtree are larger than those of N [() node keywords are written into the node. In addition to the basic binary sorting tree, the red-black tree must meet the following five rules:
1. Each node must be black or red.
2. The root of the tree must be black.
3. The newly inserted node must be in red.
4. The child with red nodes must be black.
5. Each path from a node to a descendant leaf node contains the same number of black nodes. When counting the number of black nodes, the NULL pointer is also counted as a black node.
These four rules ensure that the height of any red-black tree with n internal nodes is at most 2 x log (n + l ).
Searching an element in the red-black tree becomes very efficient because the execution time of the operation is linearly proportional to the logarithm of the tree size. In other words, the number of double linear zones only increases once.
For example, an element with a value of 4 must be inserted into the red-black tree shown in figure (. Its correct position is the right child of the node with the key value of 3. However, once it is inserted, the red node with the key value of 3 has a red child, and thus violates Rule 3. To meet this rule, the color of nodes with values of 3, 4, and 7 must be changed. However, this operation violates Rule 5. Therefore, the algorithm performs the "Rotate" Operation on the subtree with the key value 19 as the root node to generate (B) the new red and black trees shown in. This looks complicated, but inserting or deleting an element in the red-black tree requires only a small number of operations-the complexity of this operation is only linearly proportional to the logarithm of the tree size.
Therefore, in order to store the linear zone of the process, Linux uses both a linked list and a red-black tree. These two data structures contain pointers pointing to the same linear partition descriptor. When a linear partition descriptor is inserted or deleted, the kernel searches for the front and back elements through the red-black tree, use the search results to quickly update the linked list without scanning the linked list.
The head of the linked list is pointed by the MMAP field of the memory descriptor. Any linear area object stores the pointer to the next element of the linked list in the vm_next field. The header of the Red-black tree is pointed by the mm_rb field of the memory descriptor. Both the linear zone object and the linear zone object store the node color and pointer to the parent, left, and right children in the vm_rb field of rb_node type.
Generally, the red-black tree is used to determine the linear zone containing the specified address, while the linked list is usually used to scan the entire linear zone set.
3 linear zone access permission
Before talking about the next section, let's clarify the relationship between the page and the linear zone. As mentioned in the previous blog, we use the term "page" to indicate the relationship between a set of linear addresses and their physical addresses. In particular, we refer to the linear address range between 0 and 0th as page 1st, and the linear address range between 4096-8191 is called page, and so on. Therefore, each linear zone consists of consecutive pages with a group of numbers.
We have discussed two page-related features:
-Several labels stored in each page table item, such as read/write and present. (For details, refer to the "80x86 Linux segmentation and paging mechanism" blog ).
-A group of logos stored in the flags field of each page Descriptor (see "Linux page box management" blog ).
The first flag is the addressing type used by the 80x86 hardware to check whether the request can be executed. The second flag is used by Linux for many different purposes.
Now we will introduce the third sign, that is, the page-related sign of the linear area. They are stored in the vm_flags field of the vm_area_struct descriptor. Some flags give the kernel information about all pages in this linear area, such as what they contain and what permissions the process has to access each page. Other flags describe the linear zone itself, such as how it should grow (these flags are located in include/Linux/mm. h ):
Vm_read: the page is readable.
Vm_write: the page is writable.
Vm_exec: the page is executable.
Vm_shared: The page can be shared by several processes.
Vm_mayread: Specifies the vm_read flag.
Vm_maywrite: Specifies the vm_write flag.
Vm_mayexec: Specifies the vm_exec flag.
Vm_mayshare: You can set the vm_share flag.
Vm_growsdown: Linear zone can be expanded to low-address
Vm_growsup: Linear zone can be expanded to high address
Vm_shm: Linear zone for IPC shared memory
Vm_denywrite: Linear zone ing a file that cannot be opened for writing
Vm_executable: Linear zone ing an executable file
Vm_locked: pages in the linear area are locked and cannot be swapped out.
Vm_io: I/O address space of linear zone ING DEVICES
Vm_seq_read: The application accesses the page sequentially.
Vm_rand_read: The application accesses the page in a real random order.
Vm_dontcopy: when a new process is created, the linear zone is not copied.
Vm_dontexpand: Linear zone extension prohibited by calling the mremap () System
Vm_reserved: The linear zone is special (for example, It maps the I/O address space of a device), so its pages cannot be swapped out.
Vm_account: Check whether there is sufficient idle memory for dry ing when creating an IPC shared linear zone.
Vm_hugetlb: processing pages in a linear area through the extended paging Mechanism
Vm_nonlinear: Non-linear file ing in the linear Zone
The page access permissions contained in the linear area descriptor can be combined in any way. For example, there is a possibility that pages in a linear area can be executed but cannot be read. To effectively implement this protection scheme, the page-related access permissions (read, write, and execute) in the linear zone must be copied to all corresponding table items, this allows the paging unit to directly perform the check. In other words, page access permissions indicate what type of access should generate a missing page exception. In the subsequent blog, we will see that the Linux delegated page missing handler finds the cause of page missing because the page missing handler implements many page processing policies.
The initial values of page table flags (note that the initial values of all page flags in the same linear area must be the same) are stored in the vm_page_prot field of the vm_area_struct descriptor. When a page is added, the kernel sets the flag in the table items of the corresponding page based on the value of the vm_page_prot field.
Typedef struct {unsigned long pgprot;} pgprot_t;/* include/asm-i386/page. H */
Some brothers asked why the access permission in the linear zone cannot be directly converted into a page protection space because:
-In some cases, even if the access permission to a page specified by the VM flags field of the corresponding linear zone descriptor permits access to this page, access to this page should still generate a page missing exception. For example, in the case of "Copy at write time", the kernel may decide to store two identical private pages (whose vm_share mark is cleared) belonging to two different processes into the same page box; in this case, no matter which process tries to modify this page, an exception should occur.
-The page table of the 80x86 processor has only two protection bits, namely the read/write and user/supervisor labels. In addition, the user/supervisor flag of any page contained in a linear zone must always be set to 1, because the user-state process must always be able to access the pages.
-The PAE-enabled new Intel Pentium 4 microprocessor supports the NX (no execute) flag in all 64-bit page table items.
If the kernel is not compiled to support PAE, linux adopts the following rules to overcome the hardware restrictions of the 80x86 microprocessor:
-Read access permissions always imply execution access permissions, and vice versa.
-Write Access Permissions always imply read access permissions.
If the kernel is compiled to support PAE and the CPU has the NX mark, Linux will adopt different rules:
-Row access permissions always imply read access permissions.
-The access permission always implies the read permission.
Therefore, we need to streamline 16 possible combinations of read, write, execution, and shared access permissions according to the following rules:
-If the page has write and share access permissions, the read/write bit is set to 1.
-If the page has read or execution access permissions but neither write nor share access permissions, the read/write bit is cleared.
-If the NX bit is supported and the page does not have the execution access permission, set the NX bit to 1.
-If the page does not have any access permissions, the seven-digit presen is cleared to 0, so that an error occurs during each access. However, to distinguish this situation from the case where the real page box does not exist, Linux also sets the page size position to 1 (You may think this usage of the page size bit is inappropriate, because this bit indicates the actual page size. However, Linux can survive this scam because the 80x86 chip checks the page size bit in the page Directory item, instead of checking the bit in the table item of the page table .)
The simplified protection bits corresponding to each combination of access permissions are stored in 16 elements of the protection_map array (mm/MMAP. C ):
Pgprot_t protection_map [16] = {
_ P000, _ P001, _ P010, _ p011, _ P100, _ P101, _ P110, _ p111,
_ S000, _ s001, _ s010, _ s011, _ NAS, _ S101, _ s110, _ s111
};
// Include/asm-i386/pgtable. h
# DEFINE _ p000 page_none
# DEFINE _ P001 page_readonly
# DEFINE _ P010 page_copy
# DEFINE _ p011 page_copy
# DEFINE _ P100 page_readonly_exec
# DEFINE _ P101 page_readonly_exec
# DEFINE _ P110 page_copy_exec
# DEFINE _ p111 page_copy_exec
# DEFINE _ s000 page_none
# DEFINE _ s001 page_readonly
# DEFINE _ s010 page_shared
# DEFINE _ s011 page_shared
# DEFINE _ 100 page_readonly_exec
# DEFINE _ S101 page_readonly_exec
# DEFINE _ s110 page_shared_exec
# DEFINE _ s111 page_shared_exec
For example:
# Define copy_exec/
_ Pgprot (_ page_present | _ page_user | _ page_accessed)
In other cases, we will not focus on them one by one. If you are interested, you can look for them in pgtable. h.