After talking about this, many people will surely get confused. What we mentioned above is data structures or conceptual things. Where is the real management mechanism for dynamic pages? In other words, how do I allocate the page boxes of each node and area to processes? To clarify this idea, we must first learn an algorithm-partner system algorithm.
To assign a set of consecutive page boxes to the kernel, a robust and efficient allocation policy must be established. To this end, the well-known external fragmentation problem must be solved. Frequent request and release of a set of consecutive page boxes of different sizes will inevitably lead to the scattered free page boxes of many small blocks in the allocated page boxes. The problem is that even if there are enough free page boxes to meet the request, a large continuous page box may not be able to meet the requirement.
Linux uses the buddy system algorithm to solve external fragmentation issues. Group all idle page boxes into 11 linked lists, each of which contains 1, 2, 4, 8, 16, 32, 64,128,256,512 and 1024 consecutive page boxes. The maximum number of requests to the 1024 page box corresponds to a 4 MB continuous ram block. The physical address of the first page of each block is an integer multiple of the block size. For example, the starting address of a block with a size of 16 page frames is a multiple of 16x212 (212 = 4096, which is the size of a regular page.
A simple example is provided to illustrate how the algorithm works.
Assume that you want to request a block (1 MB) with 256 page frames ). The algorithm first checks whether there is an idle block in the linked list of the 256 page frame. Without such a block, the algorithm will find the next larger page block, that is, find an idle block in the linked list of the 512 page boxes. If such a block exists, the kernel divides the 256 page box into two equal parts, half of which is used to satisfy the request, and the other half is inserted into the linked list of the 256 page box. If no idle block is found in the block linked list of the 512 page frame, continue to find the larger block-the block of the 1024 page frame. If such a block exists, the kernel uses the 1024 page boxes of the 256 page boxes as the request, then, insert 768 entries from the remaining 512 page boxes to the linked list of the 512 page boxes, and then insert the last 256 entries into the linked list of the 256 page boxes. If the linked list of the 1024 page box is still empty, the algorithm will discard the concurrent error signal.
The inverse process of the above process is the release process of the page block, which is also the origin of the algorithm name. The kernel tries to merge a pair of idle partner blocks with the size of B into a separate block with the size of 2B. Two blocks that meet the following conditions are called partners:
• The two blocks have the same size and are recorded as B.
• Their Physical addresses are continuous.
• The physical address of the first page is a multiple of 2 × B × 212.
This algorithm is iterative. If it successfully merges the released blocks, it will try to merge 2B blocks to try to form larger blocks again.
Are you dizzy? If you really can't understand it, just draw a picture by yourself. The principle of this algorithm is still relatively simple. Let's take a look at how Linux is implemented:
1. Data Structure
Linux 2.6 uses different partner systems for each management zone. Therefore, there are three partner systems in the 80x86 structure: the first is to process the page box suitable for isa dma, and the second is to process the "regular" Page box, the third method is to process the high-end Memory Page. The main data structure used by each partner system is as follows:
(1) previously introduced mem_map arrays. In fact, each management area is related to the subset of the mem_map element. The number of the first element and the number of elements in the subset are respectively specified by the zone_mem_map and size fields of the Management Zone descriptor.
(2) An array containing 11 elements and element type free_area. Each element corresponds to a block size. This array is stored in the free_area field of the Management Zone descriptor zone_t.
, We consider the K element of the free_area array in the management area descriptor, which identifies all idle blocks with a size of 2 K. The free_list field of this element is the header of the bidirectional cyclic linked list. This bidirectional cyclic linked list contains the page descriptor corresponding to the idle block of 2 k pages. More precisely, the linked list contains the page descriptor of the start page box of each idle page frame block (2 k in size; pointers to adjacent elements in the linked list are stored in the LRU field of the page descriptor page.
In addition to the linked list header, the K element of the free_area array also contains the nr_free field, which specifies the number of idle blocks on 2 k pages. Of course, if there is no free page box with a size of 2 K, nr_free is equal to 0 and free_list is empty (both the next and Prev pointers of free_list point to their own free_list fields ).
Finally, the private field of the descriptor of the first page of a 2 k free page block stores the order of the block, that is, the number K. Because of this field, when the page block is released, the kernel can determine whether the partner of this block is idle. If yes, it can combine the two blocks into a single block with 2 k + 1 pages.
Allocate two blocks
The kernel uses the _ rmqueue () function to find an idle block in the management area. This function requires two parameters: zone and order of the Management Zone descriptor. Order indicates the logarithm value of the requested free page block size (0 indicates a single page block, and 1 indicates a two-page block, 2 indicates four page blocks ). If the page box is successfully allocated, the _ rmqueue () function returns the page descriptor of the first allocated page box. Otherwise, the function returns NULL.
In the _ rmqueue () function, starting from the linked list of the requested order, It scans each available block linked list for cyclic search. If you need to search for a larger order, continue searching:
Struct free_area * area;
Unsigned int current_order;
For (current_order = order; current_order <11; ++ current_order ){
Area = Zone-> free_area + current_order;
If (! List_empty (& Area-> free_list ))
Goto block_found;
}
Return NULL;
If no suitable idle block is found until the end of the loop, _ rmqueue () returns NULL. Otherwise, an appropriate idle block is found. In this case, delete its first page box descriptor from the linked list and reduce the value of free_pages in the management area descriptor:
Block_found:
Page = list_entry (area-> free_list.next, struct page, LRU );
List_del (& page-> LRU );
Clearpageprivate (PAGE );
Page-> private = 0;
Area-> nr_free --;
Zone-> free_pages-= 1ul <order;
If the block found from the curr_order linked list is greater than the requested order, a while loop is executed. The principles of these lines of code are as follows: when it is necessary to use 2 k page box blocks to meet 2 h page box requests (H <K ), the program will allocate the first 2 h page boxes, and the next 2 k-2 h page boxes will be repeatedly distributed to the elements in the free_area linked list with subscripts between H and K:
Size = 1 <curr_order;
While (curr_order> order ){
Area --;
Curr_order --;
Size> = 1;
Buddy = page + size;
/* Insert buddy as first element in the list */
List_add (& buddy-> LRU, & Area-> free_list );
Area-> nr_free ++;
Buddy-> private = curr_order;
Setpageprivate (Buddy );
}
Return page;
Because the _ rmqueue () function has found a suitable idle block, it returns the address page of the page descriptor corresponding to the allocated first page box.
3 release
The _ free_pages_bulk () function is released in accordance with the policy of the partner system. It uses three basic input parameters:
Page: the address of the first page frame descriptor contained in the released block.
ZONE: The management zone descriptor address.
Order: the logarithm of the block size.
_ Free_pages_bulk () first declare and initialize some local variables:
Struct page * base = Zone-> zone_mem_map;
Unsigned long buddy_idx, page_idx = page-base;
Struct page * Buddy, * coalesced;
Int order_size = 1 <order;
The page_idx local variable contains the subscript of the first page box in the block, which is relative to the first page box in the management area. The order_size local variable is used to increase the counters in the idle page box in the management area:
Zone-> free_pages + = order_size;
Now the function starts to execute a loop, which can be a maximum of 10-order times. Every time, we try to merge a block with its partners. The function starts with the smallest block and moves up to the top:
While (Order <10 ){
Buddy_idx = page_idx ^ (1 <order );
Buddy = base + buddy_idx;
If (! Page_is_buddy (Buddy, order ))
Break;
List_del (& buddy-> LRU );
Zone-> free_area [order]. nr_free --;
Clearpageprivate (Buddy );
Buddy-> private = 0;
Page_idx & = buddy_idx;/* merge */
Order ++;
}
After reading this cycle for a long time, I did not understand it. Later I gave an example and drew a picture to gradually understand it. For example, if order is 4, the value of order_size is 24, that is, 16, indicating that 16 consecutive pages are to be released. Page_idx is the subscript of the mem_map array of the Top 16 consecutive pages. After entering the loop, the function first looks for the partner of the block, that is, the page_idx-16 in the mem_map array or the subscript buddy_idx of page_idx + 16, further explain, it is to find an idle block in the free_area marked as 16, and this block is adjacent to the block with 16 pages in the page.
Note: This line of code is buddy_idx = page_idx ^ (1 <order. This line of code is clever and short and lean. Because order is equal to 4, the loop starts from 4, that is, the first loop is buddy_idx = page_idx ^ (1 <4), that is, buddy_idx = page_idx ^ 10000. If page_idx 5th bits are 1, for example, page 20 (10100), The buddy_idx is page 4 (00100) after the difference or ). If page_idx 5th bits are 0, for example, page 40th (101000), The buddy_idx is page 56 (111000) after the difference or ).
Why is such an operation required? Think about what our purpose is. _ Free_pages_bulk is to locate a partner on the 2 ^ order page headed by its parameter page and merge it with it. In the mem_map array, the partner's boss is either 2 ^ order before the page, or 2 ^ order after the page. If you add or subtract only, the previous or later partners will be ignored. Let's take a closer look. As for why don't I add and subtract it? I guess Linux developers didn't do this because of performance problems. I also said that we should try to merge all the materials here, we will leave him alone.
After finding the partner, assign the address of the partner's boss page to Buddy:
Buddy = base + buddy_idx;
Now the function calls page_is_buddy () to check whether buddy is a real trusted partner, that is, the first page of the free page block with the order_size value.
Int page_is_buddy (struct page * Page, int order)
{
If (pageprivate (Buddy) & page-> private = Order &&
! Pagereserved (Buddy) & page_count (PAGE) = 0)
Return 1;
Return 0;
}
As you can see, to become a partner, you must meet the following four conditions:
(1) The first page of Buddy must be idle (_ count field equals-1 );
(2) It must belong to the dynamic memory (pg_reserved bit is cleared );
(3) Its private field must be meaningful (pg_private location );
(4) its private field must store the order of the block to be released.
If all these conditions are met, it indicates that a new partner exists. Then, the partner block must be combined with my page, and must be separated from the original free_list, execute page_idx & = buddy_idx Merge (note that this line of code is closely related to the buddy_idx = page_idx ^ (1 <order) on the front ), and then execute a loop to find a partner block that is twice the size.
If at least one condition in page_is_buddy () is not met, the function jumps out of the loop because the obtained idle block cannot be merged with other idle blocks. The function inserts it into an appropriate linked list and updates the private field of the first page box with the block size order.
Coalesced = base + page_idx;
Coalesced-> private = order;
Setpageprivate (coalesced );
List_add (& coalesced-> LRU, & zone-> free_area [order]. free_list );
Zone-> free_area [order]. nr_free ++;