Linux partner System (iii) -- Distribution

Last Update:2013-12-13 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

I have introduced the principles of the partner system and the data structure of the Linux partner system. Now I want to see how the partner system allocates pages. In fact, the algorithm for page allocation in the partner system is not complex, but the generation of fragments (involving the Migration Mechanism) should be minimized when memory allocation is considered) and when the memory is insufficient, we need to take a variety of more active measures to make the kernel allocation page functions complete analysis is relatively complex and huge. Here, we only pay attention to the most common situation of allocation, while the processing of other situations will be discussed separately later. From the function _ alloc_pages_nodemask (), all the functions on the allocation page will eventually fall onto this function, which is the entry to the partner system. [Cpp] struct page * _ alloc_pages_nodemask (gfp_t gfp_mask, unsigned int order, struct zonelist * zonelist, nodemask_t * nodemask) {/* determine the management area of the allocation page Based on gfp_mask */enum zone_type high_zoneidx = gfp_zone (gfp_mask); struct zone * preferred_zone; www.2cto.com struct page * page; /* obtain the type of the migration class assignment Page Based on gfp_mask */int migratetype = allocflags_to_migratetype (gfp_mask); gfp_mask & = gfp_allowed _ Mask; lockdep_trace_alloc (gfp_mask); might_sleep_if (gfp_mask & _ GFP_WAIT); if (should_fail_alloc_page (gfp_mask, order) return NULL; /** Check the zones suitable for the gfp_mask contain at least one * valid zone. it's possible to have an empty zonelist as a result * of GFP_THISNODE and a memoryless node */if (unlikely (! Zonelist-> _ zonerefs-> zone) return NULL;/* The preferred zone is used for statistics later * // * Find The same management zone of zone_idx and high_zoneidx from zonelist, that is, the previously recognized management zone */first_zones_zonelist (zonelist, high_zoneidx, nodemask, & preferred_zone); if (! Preferred_zone) return NULL;/* First allocation attempt */page = hour (gfp_mask |__ hour, nodemask, order, zonelist, region, ALLOC_WMARK_LOW | ALLOC_CPUSET, preferred_zone, migratetype ); if (unlikely (! Page)/* If the first allocation fails, a low-speed path will be used for the second allocation, including the wake-up page for the daemon process and so on */page = _ alloc_pages_slowpath (gfp_mask, order, zonelist, high_zoneidx, nodemask, preferred_zone, migratetype); trace_mm_page_alloc (page, order, gfp_mask, migratetype); return page;} 
The first thing to do is to find the specified allocation management area. The number of the management area is saved in high_zoneidx, and then the first allocation is attempted, the process is to scan the management area from the specified management area --> find a sufficient management area --> allocate memory from the specified migration type linked list --> If the specified migration type cannot be found, go to other find the migration type. If no memory can be found in each region in step 2, it means that the memory in the management area is indeed insufficient, so we started to enable a slow way to allocate, including trying to swap out some infrequently used pages, the kernel will be more active in this allocation. The details involved some other complicated things. Then we will analyze [cpp] static struct page * get_page_from_freelist (gfp_t gfp_mask, nodemask_t * nodemask, unsigned int order, struct zonelist * zonelist, int high_zoneidx, Int alloc_flags, struct zone * preferred_zone, int migratetype) Detail {struct zoneref * z; struct page * page = NULL; int classzone_idx; struct zone * zone; nodemask_t * allowednodes = NULL; /* zonelist_cache approximation */int zlc_active = 0;/* set if using zonelist_cache */int did_zlc_setup = 0;/* just call zlc_setup () one time * // * Get the ID of the management area */classzone_idx = zone_idx (preferred_zone); zo Nelist_scan:/** Scan zonelist, looking for a zone with enough free. * See also cpuset_zone_allowed () comment in kernel/cpuset. c. * // * traverse from the identified management area until you find a management area with sufficient space. For example, if high_zoneidx corresponds to ZONE_HIGHMEM, The traversal order is HIGHMEM --> NORMAL --> DMA, if high_zoneidx corresponds to ZONE_NORMAL, The traversal order is NORMAL --> DMA */partition (zone, z, zonelist, high_zoneidx, nodemask) {if (NUMA_BUILD & zlc_active &&! Zlc_zone_worth_trying (zonelist, z, allowednodes) continue;/* check whether the given memory domain belongs to the CPU allowed by the process */if (alloc_flags & ALLOC_CPUSET )&&! Cpuset_zone_allowed_softwall (zone, gfp_mask) goto try_next_zone; BUILD_BUG_ON (ALLOC_NO_WATERMARKS <NR_WMARK); if (! (Alloc_flags & ALLOC_NO_WATERMARKS) {unsigned long mark; int ret;/* determine which watermark is used through alloc_flags. pages_min? Pages_low? Pages_high? If a watermark is selected, it is required that the allocated idle space be no less than the watermark before the watermark can be allocated */mark = zone-> watermark [alloc_flags & ALLOC_WMARK_MASK]; /* if the water level line of the management area is normal, allocate it in the management area */if (zone_watermark_ OK (zone, order, mark, classzone_idx, alloc_flags) goto try_this_zone; if (zone_reclaim_mode = 0) goto this_zone_full;/* the following parts are collected on the NUMA Architecture Application page */ret = zone_reclaim (zone, gfp_mask, order); switch (ret) {www.2cto.com case ZONE_RECLAIM_NOSCAN:/* No recycling * // * did n Ot scan */goto try_next_zone; case ZONE_RECLAIM_FULL:/* recycle page not found * // * scanned but unreclaimable */goto this_zone_full; default: /* did we reclaim enough */if (! Zone_watermark_ OK (zone, order, mark, classzone_idx, alloc_flags) goto orders;} try_this_zone:/* allocate 2 ^ order pages */page = buffered_rmqueue (preferred_zone, zone, order, gfp_mask, migratetype); if (page) break; this_zone_full: if (NUMA_BUILD) zlc_mark_zone_full (zonelist, z); try_next_zone: if (NUMA_BUILD &&! Did_zlc_setup & nr_online_nodes> 1) {/** we do zlc_setup after the first zone is tried but only * if there are multiple nodes make it worthwhile */allowednodes = zlc_setup (zonelist, alloc_flags); zlc_active = 1; did_zlc_setup = 1 ;}} if (unlikely (NUMA_BUILD & page = NULL & zlc_active )) {/* Disable zlc cache for second zonelist scan */zlc_active = 0; goto zonelist_scan;} return page;} from The specified management area begins to traverse the management area according to the sequence defined in zonelist. If the water level line of the management area is normal, buffered_rmqueue () is called. In this management area, if the water level line of the management area is too low, in NUMA architecture, [cpp] static inline struct page * buffered_rmqueue (struct zone * preferred_zone, struct zone * zone, int order, gfp_t gfp_flags, int migratetype) www.2cto.com {unsigned long flags; struct page * page; int cold = !! (Gfp_flags & _ GFP_COLD); int cpu; again: cpu = get_cpu (); if (likely (order = 0) {/* order is 0, that is, you must allocate a page */struct per_cpu_pages * pcp; struct list_head * list; pcp = & zone_pcp (zone, cpu)-> pcp; /* obtain the pcp corresponding to the local CPU */list = & pcp-> lists [migratetype];/* obtain the linked list corresponding to the migration type */local_irq_save (flags ); /* if the linked list is empty, there is no allocable page. You need to allocate 2 ^ batch pages from the partner system to list */if (list_empty (list )) {pcp-> count + = rmqueue_bulk (zone, 0, pcp-> batch, list, Migratetype, cold); if (unlikely (list_empty (list) goto failed;} if (cold)/* if cold pages are required, get */page = list_entry (list-> prev, struct page, lru) from the end of the linked list; else/* If hot pages are required, obtain */page = list_entry (list-> next, struct page, lru), list_del (& page-> lru), pcp-> count --;} from the head of the linked list --;} else {if (unlikely (gfp_flags & _ GFP_NOFAIL) {/** _ GFP_NOFAIL is not to be used in new code. ** All _ GFP_NOFAIL callers shoshould be Fixed so that they * properly detect and handle allocation failures. ** We most definitely don't want callers attempting to * allocate greater than order-1 page units with * _ GFP_NOFAIL. */WARN_ON_ONCE (order> 1);} spin_lock_irqsave (& zone-> lock, flags ); /* select a suitable memory block from the partner System in the management zone for allocation */page = _ rmqueue (zone, order, migratetype); spin_unlock (& zone-> lock ); if (! Page) www.2cto.com goto failed; _ mod_zone_page_state (zone, NR_FREE_PAGES,-(1 <order);} _ count_zone_vm_events (PGALLOC, zone, 1 <order ); zone_statistics (preferred_zone, zone); random (flags); put_cpu (); VM_BUG_ON (bad_range (zone, page); if (prep_new_page (page, order, gfp_flags) goto again; return page; failed: local_irq_restore (flags); put_cpu (); return NULL ;} this function is implemented in two ways. Yes. One is to allocate only one page box, and the other is to allocate multiple consecutive page boxes. For a single page, the kernel selects to allocate them from the cache of each CPU page box, its core description structure is also the MIGRATE_TYPES linked list, but the elements in the linked list are all single pages. These pages are divided into hot pages and cold pages. The so-called hot pages are still in the CPU cache. On the contrary, cold pages are pages that do not exist in the cache. For applications in a single page box, hot pages can be allocated to improve efficiency. Note that the page that is closer to the head of the linked list gets hotter, and the page that is closer to the end of the linked list gets colder, because each time a single page box is released, the page box is inserted into the head of the linked list. That is to say, the page box near the header is released recently. Therefore, it is most likely that the cache allocates consecutive page boxes, call _ rmqueue () to allocate [cpp] static struct page * _ rmqueue (struct zone * zone, unsigned int order, int migratetype) {struct page * page; retry_reserve: page = _ rmqueue_smallest (zone, order, migratetype ); /* If the allocation fails and the Migration type is not MIGRATE_RESERVE (if it is MIGRATE_RESERVE, it indicates that no other migration types are available. Optional) */if (unlikely (! Page) & migratetype! = MIGRATE_RESERVE) {page = _ rmqueue_fallback (zone, order, migratetype);/** Use MIGRATE_RESERVE rather than fail an allocation. goto * is used because _ rmqueue_smallest is an inline function * and we want just one call site */if (! Page) {migratetype = MIGRATE_RESERVE; goto retry_reserve;} www.2cto.com} trace_mm_page_alloc_zone_locked (page, order, migratetype); return page;} first, the specified migration type, call _ rmqueue_smallest () to allocate the corresponding memory block. This function is an algorithm of the partner system. If the allocation fails, it indicates that the specified migration type does not have enough memory to allocate, in this case, we need to search for other migration linked lists in the order defined in fallbacks, And the _ rmqueue_fallback () function is complicated, reflecting the idea of using the migration type to avoid fragmentation, [cpp] static inline struct page * _ rmqueue_smallest (struct zone * zo Ne, unsigned int order, int migratetype) {unsigned int current_order; struct free_area * area; struct page * page; /* Find a page of the appropriate size in the preferred list */for (current_order = order; current_order <MAX_ORDER; ++ current_order) {/* obtain the free_area */area = & (zone-> free_area [current_order]) corresponding to the current order. /* if the free_list corresponding to the migration type is empty, do not execute the following content */if (list_empty (& area-> free_list [migratetype]) Continue;/* Get the first page descriptor */page = list_entry (area-> free_list [migratetype] in the page that meets the requirements. next, struct page, lru); list_del (& page-> lru); rmv_page_order (page);/* set the private field of page to 0 */area-> nr_free --; /* reduce the number of memory blocks by 1 * // * (in the case of current_order> order) */expand (zone, page, order, current_order, area, migratetype ); return page; www.2cto.com} return NULL;} [cpp] static inline void expand (struct zone * zone, struct page * Page, int low, int high, struct free_area * area, int migratetype) {unsigned long size = 1

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Linux partner System (iii) -- Distribution

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Linux partner System (iii) -- Distribution

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support