2.6.29 an improvement on vmscan in the kernel

Source: Internet
Author: User

Active release is not required. In the past, automatic scanning to end only prevented high-end zones from being overly scanned, but not completely blocked, although kswap automatic scanning does not overlap with the active scan to scan the high-end zone, if multiple active memory scanning processes exist at the same time, it constitutes a competition. For example, the node's current system physical memory is tight, then, on multiple CPUs, multiple processes call try_to_free_pages without being able to tell the memory at the same time, and then there are multiple scanning processes, which will cause excessive memory to be released, resulting in jitter, because many frequently used pages are replaced and released, how can the replacement algorithm replace frequently referenced pages? In fact, this is a small vulnerability in the replacement algorithm in Linux. Imagine the replacement process. The replacement process of the kernel scans the memory at the same time and ages the page, in this process, the LRU algorithm is executed very well, because frequently used pages will be referenced again, and non-0 will be returned in ptep_clear_flush_young_policy, this indicates that the page has been visited recently, and the general processor hardware has a certain mechanism. When MMU accesses a page, the access bit of the corresponding page table item is set to 1. The operating system kernel can use this mechanism to implement the LRU algorithm very easily.

If everything is scanned in the above way, there will be no imbalance at all, but not only does kswap scan the memory LRU linked list in one place, when the application allocates memory, if the current memory condition cannot meet this allocation, try_to_free_pages will also scan the memory LRU linked list, and the scanning method is the same as that of the kswap LRU algorithm, it is also to scan the page for aging, and finally replace the pages of the most aging inactive linked list. Although there are kswap kernel threads running, but when the memory pressure is relatively high, the memory subsystem will still Replace the memory itself. The aging scanning for this kind of replacement means active scanning, that is, it disrupts the standard LRU algorithm. Imagine, on a multi-processor system, there are a total of 4 CPUs. Exactly four CPUs are executing try_to_free_pages. If the first 1st CPUs have just scanned the LRU linked list, the page of the linked list is aging by a unit, then the second CPU is executed again, and the Unit is aging again ,... in a short period of time, many active pages frequently experience aging But they are indeed active, so they eventually replace a large number of pages, resulting in a large amount of memory jitter in the near future.

Therefore, some restrictions need to be made. In the case of active scanning, do not scan the LRU linked list of memory in kswapd mode without any restrictions on the Code logic, this restriction is embodied in kernel 2.6.29:

Static void shrink_zone (INT priority, struct zone * zone, struct scan_control * SC)

{

Unsigned long Nr [nr_lru_lists];

Unsigned long nr_to_scan;

Unsigned long percent [2];

Enum lru_list L;

Unsigned long nr_reclaimed = SC-> nr_reclaimed;

Unsigned long swap_cluster_max = SC-> swap_cluster_max;

Get_scan_ratio (zone, SC, percent );

For_each_evictable_lru (l ){

Int file = is_file_lru (L );

Int scan;

Scan = zone_nr_pages (zone, SC, L );

If (priority ){

Scan >>= priority;

Scan = (SCAN * percent)/100;

}

If (scanning_global_lru (SC )){

Zone-> LRU [l]. nr_scan + = scan;

NR [l] = Zone-> LRU [l]. nr_scan;

If (NR [l]> = swap_cluster_max)

Zone-> LRU [l]. nr_scan = 0;

Else

NR [l] = 0;

} Else

NR [l] = scan;

} // If priority is small or the system memory is large, NR [l] is easily not 0, leading to the following chain scan (releasing inactive pages will cause inactive pages to be insufficient, Thus scanning active pages, the next round of scanning will also repeat this unexpected event, which is more likely to happen in multiple active scans)

While (NR [lru_inactive_anon] | Nr [lru_active_file] | Nr [lru_inactive_file]) {

For_each_evictable_lru (l ){

If (NR [l]) {

Nr_to_scan = min (NR [L], swap_cluster_max );

NR [l]-= nr_to_scan;

Nr_reclaimed + = shrink_list (L, nr_to_scan, zone, SC, priority );

}

}

/* Comments are good and retained

* On large memory systems, scan> priority can become

* Really large. This is fine for the starting priority;

* We Want to put equal scanning pressure on each zone.

* However, if the VM has a harder time of freeing pages,

* With multiple processes reclaiming pages, the total

* Freeing target can get unreasonably large.

*/

// Therefore, when judgment is added, you can immediately exit the crazy scan code. The goal is to exit as long as enough, and there are not many

If (nr_reclaimed> swap_cluster_max &&

Priority

Break;

}

SC-> nr_reclaimed = nr_reclaimed;

/*

* Even if we did not try to evict anon pages at all, we want

* Rebalance the Anon LRU active/inactive ratio.

*/

If (inactive_anon_is_low (zone, SC ))

Shrink_active_list (swap_cluster_max, zone, SC, priority, 0 );

Throttle_vm_writeout (SC-> gfp_mask );

}

If a process calls try_to_free_pages, the SC needs to be released. swap_cluster_max page, that is, the swap_cluster_max page above. The correct result is that the swap_cluster_max page is released, but the vmscan of the old kernel version is not implemented in this way, the shrink_zone above is of the new version 2.6.30. In 2.6.28 and earlier versions, there is no if (nr_reclaimed> swap_cluster_max &&... let's see what happens? In the beginning, priority was relatively high and the pressure to shrink the memory was relatively small, but it would also be quite large in large memory systems. If there were several active scanning processes, then it happens that they almost detect insufficient memory at the same time, so when they replace inactive, they begin to take turns aging active linked lists, and the result is in the first few priority, it is mainly to replace a small number of inactive linked list pages and then aging and inactive more active linked list, that is, an inactive linked list accumulation process, at the same time, the more active scanning processes, finally, the more pages accumulated into the inactive linked list, because after detecting inactive insufficiency, the pages of the active linked list are aged and filled with the inactive linked list. This process will be completed at the beginning. With the decrease of priority, the number of pages to be scanned increases, and it is easy and clear that the actual number of pages to be released is much higher than that of swap_cluster_max. Why? It is the result of other active scanning processes. All scanning processes operate on the same LRU series linked list. Many processes are aging together, and of course they are much faster than a crash. An example of an image is that at the beginning, four processes began scanning almost all zones at the same time. Because there were not many inactive pages in each zone, they almost entered shrink_active_list at the same time, the result is that more pages are accumulated on the inactive linked list than expected. If the priority is reduced and the inactive page is not detected enough, it will be returned as is, as a result, when priority is low, a large number of pages have been accumulated on inactive, and it takes a long time to release them. Many of these pages are frequently used, multiple processes enter the inactive linked list only after repeated scans. 2.6.29 the kernel has improved this situation. Before that, the vmscan code can judge whether the end of the Code should be completed only after a priority scans all zones, the 2.6.29 Code adds a judgment in each scan of the zone. In fact, when the code is scanned for the first time, the priority is the highest. At this time, the inactive linked list is usually cleared, then, from the active linked list to the inactive table, the smaller priority is the real operation to release the memory for this memory application. This is a relatively complex interface, so you don't have to worry about it too much, for example, when a scan process releases a dirty page, the delay causes many pages to not be replaced and released in time (in fact, they can be replaced and released immediately ), the subsequent scanning process still considers the inactive page to be insufficient and scans the active page,

There are many other similar scenarios, so you don't have to think about them very well. You only need to understand the problems. Adding judgment to each scan will help avoid such misjudgment. If multiple processes scan a large amount of memory, the memory Jitter Caused by replacement instead of replacement will consume too much time.

Appendix: An Example of the mechanism and policy boundary

Today I checked the lkml of vmscan and found that a buddy proposed a patch called: Make mapped executable pages the first class citizen, which roughly means that when scanning the page, this idea is too wrong to let go of pages with the prot_exec attribute. It is basically a repetition of mlock and a policy of mlock, because in Linux, never implement one thing that can be implemented using N mechanisms as a single mechanism. In the Linux philosophy, that thing is called a policy. But some people have to refute it. It is also a vmscan patch. Why have we adopted the LRU linked list classification? In the LRU classification mechanism, why does the author say that the File Cache is preferentially released rather than the anonymous page, because the latter is more likely to be dirty, and LRU is divided into the scanning priority level, isn't this a kernel policy? Is the implementation policy in the kernel appropriate? Specifically, it is not suitable, but note that LRU classification is a mechanism rather than a policy, and priority is given to File Cache releasing. However, does the kernel ensure that File Cache must be given priority? No. This policy is left to the user space. In this case, the LRU classification mechanism is implemented to better influence users' behaviors of vmscan. This is just to make the kernel behavior more controllable and more delicate. Note that the kernel itself only leaves a default policy and there is no other policy to implement, all policy implementations require the user space to be implemented through sysctl or other mechanisms. In either case, the mechanism becomes a policy. In the first case, the mechanism becomes a policy when it is too fine. In the second case, the mechanism is fixed. If it is written to an end, it cannot be changed to a policy, the mechanism must have a design scheme similar to the fine-tuning button.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.