Non-continuous memory Zone

Source: Internet
Author: User

We have learned from the previous blog that it is best to map a memory area that stores the slab structure to a group of consecutive physical pages, this will make full use of the cache and get a lower average access time.

However, the above method is mainly designed for kernel data structures that are frequently used, such as task_struct and inode. If requests to the memory area are not frequent, it makes sense to access a non-contiguous physical page frame through a continuous linear address instead of a physical address.

The main advantage of this mode is to avoid external fragments, but the disadvantage is that the kernel page table must be disrupted. In addition, the size of the non-contiguous memory zone must be a multiple of 4096. Linux uses non-contiguous memory zones in several aspects: allocates data structures for active SWAp areas, allocates space for modules, or allocates buffers for some I/O drivers. In addition, the non-contiguous memory area provides another method to use the high-end Memory Page box (see the previous section "high-end memory ing"
"Blog ).

1 linear address of the discontinuous memory area

To find a free zone of a linear address, we can start from page_offset (usually 0xc0000000, that is, the starting address of 4th GB ). Let's recall how to use a 4th GB linear address:

 

 

Recall:

(1) The starting part of the memory zone contains the linear address mapped to the first 896 mb ram. The linear address at the end of the physical memory mapped directly is stored in the high_memory global variable. When the physical memory is less than 896 MB, the linear address MB after 0xc0000000 corresponds to it one by one. When the physical memory is larger than MB and less than 4 GB, only the first MB address is directly mapped to the linear space after 0xc0000000, and other parts of the linear space are mapped to the 896 MB and 4 GB physical space, which is called dynamic re ing, which is the focus of this blog; when the physical memory is larger than 4 GB, you need to consider the PAE situation. There is no difference between other things. We will not recall much.

(2) The kernel page table is maintained by the global directory variable swapper_pg_dir on the kernel page, and pagetable_init () creates the kernel page table item.

(3) The end of the memory area contains linear fixed ing addresses, which are mainly used to store some constant linear addresses. For details, see "high-end memory ing ".
"Blog.

(4) starting from pkmap_base, we will find the linear address for the permanent kernel ing in the high-end Memory Page. For details, refer to "high-end memory ing ".
"Blog.

(5) other linear addresses can be used in non-contiguous memory areas. Insert a security zone of 8 MB (macro vmalloc_offset) between the end of the physical memory ing and the first memory zone to "capture" out-of-bounds access to the memory. For the same reason, insert other security zones of 4 kb to isolate non-contiguous memory areas.

This blog will discuss in detail (5) -- The starting address of the linear address space reserved for non-contiguous memory areas is defined by the vmalloc_start macro, And the ending address is defined by the vmalloc_end macro.

2. descriptor of the discontinuous memory Zone

Each discontinuous memory zone corresponds to a descriptor of the vm_struct type:
Struct vm_struct {
Void * ADDR;
Unsigned long size;
Unsigned long flags;
Struct page ** pages;
Unsigned int nr_pages;
Unsigned long phys_addr;
Struct vm_struct * next;
};

This section describes its fields:

Void * linear address (first address) of the first memory unit in the ADDR memory Zone)
Unsigned long size: the size of the memory area plus 4096 (the size of the security interval between memory areas)
Memory type mapped to unsigned long flags non-contiguous memory Zone
Struct page ** pages refers to the pointer to the nr_pages array, which consists of the pointer to the page Descriptor
Unsigned int nr_pages memory-filled page count
Unsigned long phys_addr this field is set to 0, unless the memory has been created to map the I/O shared memory of a hardware device
Struct vm_struct * Next pointer to the next vm_struct Structure

With the next field, These descriptors are inserted into a simple linked list. The address of the first element of the linked list is stored in the vmlist variable. Access to this linked list is protected by the vmlist_lock read/write spin lock.

The flags field identifies the memory type mapped to the discontinuous zone:
(1) vm_alloc indicates the page obtained by using vmalloc;
(2) vm_map indicates the allocated page mapped using vmap;
(3) vm_ioremap indicates the on-board memory of the hardware device mapped using ioremap.

The get_vm_area () function finds an idle area between the linear address vmalloc_start and vmalloc_end. This function uses two parameters: the size of the byte in the created memory zone and the flag of the specified idle zone type (see the above ). Perform the following steps:

1. Call kmalloc () to get a memory zone for the new descriptor of the vm_struct type.

2. to get the vmlist_lock lock and scan the Chain List of descriptors of the vm_struct type to find a linear address in an idle area, cover at least size + 4096 addresses (4096 is the security interval between memory areas ).

3. If such an interval exists, the function initializes the descriptor field, releases the vmlist_lock lock, and ends with the starting address of the returned non-contiguous memory zone.

4. Otherwise, get_vm_area () releases the previously obtained descriptor, releases vmlist_lock, and returns NULL.

3. Allocate non-contiguous memory areas

The vmalloc () function allocates a non-contiguous memory area to the kernel. The size parameter indicates the size of the requested memory zone. If this function can meet the request, the starting address of the new memory zone is returned; otherwise, a null pointer (mm/vmalloc. c) is returned ):

Void * vmalloc (unsigned Long SIZE)
{
Struct vm_struct * area;
Struct page ** pages;
Unsigned int array_size, I;
Size = (size + page_size-1) & page_mask;
Area = get_vm_area (size, vm_alloc );

If (! Area)
Return NULL;
Area-> nr_pages = size> page_shift;
Array_size = (area-> nr_pages * sizeof (struct page *));
Area-> pages = kmalloc (array_size, gfp_kernel );

If (! Area-> pages ){
Remove_vm_area (area-> ADDR );
Kfree (area );
Return NULL;
}
Memset (area-> pages, 0, array_size );

For (I = 0; I <area-> nr_pages; I ++ ){
Area-> pages [I] = alloc_page (gfp_kernel | _ gfp_highmem );

If (! Area-> pages [I]) {
Area-> nr_pages = I;
Fail: vfree (area-> ADDR );
Return NULL;
}
}
If (map_vm_area (area, _ pgprot (0x63), & pages)
)
Goto fail;
Return area-> ADDR;
}

The function first sets the parameter size to an integer multiple of 4096 (page size. Vmalloc () then calls get_vm_area () to create a new Descriptor and returns the linear address allocated to this memory area. The flags field of the descriptor is initialized as the vm_alloc flag, which means that by using the vmalloc () function, the non-contiguous page box will be mapped to a linear address range. Then the vmalloc () function calls kmalloc () to request a set of consecutive page boxes, which are sufficient to contain an array of page descriptor pointers. Call the memset () function to set all these pointers to null. Then, the alloc_page () function is repeatedly called. Each time a page box is allocated for each nr_pages page in the interval, and the corresponding page descriptor address is stored in the area-> pages array. Note that the area-> pages array must be used because the page boxes may belong to the zone_highmem memory management area, so they do not need to be mapped to a linear address.

 

This section briefly describes the implementation functions of memset (area-> pages, 0, array_size:
Static inline void * _ memset_generic (void * s, char C, size_t count)
{
Int D0, D1;
_ ASM _ volatile __(
"Rep/n/t"
"Stosb"
: "= & C" (D0), "= & D" (D1)
: "A" (c), "1" (s), "0" (count)
: "Memory ");
Return S;
}

Now it's tricky. Until now, a new continuous linear address range has been obtained and a set of non-consecutive page boxes have been assigned to map these linear addresses. The last important step is to modify the page table items used by the kernel.
This indicates that each page box allocated to the non-contiguous memory zone now corresponds to a linear address, which is included in the non-contiguous linear address range generated by vmalloc. This is what map_vm_area () is to be done. The following describes in detail:

 

The map_vm_area () function uses the following three parameters:

Area: pointer to the vm_struct descriptor in the memory area.
Prot: the protection space of the allocated page. It is always set to 0x63, corresponding to present, accessed, read/write and dirty.
Pages: the address of the variable pointing to a pointer array. the pointer to the pointer array points to the page Descriptor (therefore, structpage *** is used as the Data Type !).

The function first allocates the starting and ending linear addresses of the memory area to the local variables address and end respectively:
Address = Area-> ADDR;
End = address + (area-> size-page_size );

Remember that area-> size stores the actual address of the memory area and the security interval between 4 kb memory. Then the function uses the pgd_offset_k macro to obtain the directory items in the global directory of the main kernel page. This item corresponds to the linear address at the beginning of the memory area, and then obtains the kernel page table spin lock:
PGD = pgd_offset_k (Address );
Spin_lock (& init_mm.page_table_lock );

 

Then, the function executes the following loop:
Int ret = 0;
For (I = pgd_index (Address); I <pgd_index (end-1); I ++ ){
Pud_t * pud = pud_alloc (& init_mm, PGD, address );
Ret =-enomem;
If (! PUD)
Break;
Next = (address + pgdir_size) & pgdir_mask;
If (next <address | next> end)
Next = end;
If (map_area_pud (PUD, address, next, Prot, pages ))
Break;
Address = next;
PGD ++;
Ret = 0;
}
Spin_unlock (& init_mm.page_table_lock );
Flush_cache_vmap (unsigned long) area-> ADDR, end );
Return ret;

Every cycle first calls pud_alloc () to create a parent directory for the new memory area, and write its physical address to the appropriate table item in the global directory of the kernel page. Then, alloc_area_pud () is called to allocate all related page tables to the new parent directory of the page. Next, set the constant 230 (when PAE is activated, otherwise it is 222) add the value to the current address (230 is the size of the linear address range crossed by the parent directory of the page), and add the pointer PGD pointing to the global directory of the page.

 

The condition for loop termination is that all page table items pointing to the non-contiguous memory area are created.

The map_area_pud () function executes a similar loop for all the page tables pointed to by the parent directory of the page:
Do {
Pmd_t * PMD = pmd_alloc (& init_mm, pud, address );
If (! PMD)
Return-enomem;
If (map_area_pmd (PMD, address, end-address, Prot, pages ))
Return-enomem;
Address = (address + pud_size) & pud_mask;
Pud ++;
} While (address <End );

The map_area_pmd () function executes a similar loop for all page tables pointed to by the directory in the middle of the page:
Do {
Pte_t * PTE = pte_alloc_kernel (& init_mm, PMD, address );
If (! PTE)
Return-enomem;
If (map_area_pte (PTE, address, end-address, Prot, pages ))
Return-enomem;
Address = (address + pmd_size) & pmd_mask;
PMD ++;
} While (address <End );

The pte_alloc_kernel () function allocates a new page table and updates the corresponding directory items in the middle directory of the page. Next, map_area_pte () allocates all the page boxes for the corresponding table items in the page table. The address value is increased by 222 (222 is the size of the linear address range crossed by a page table) and is executed repeatedly.

The main cycle of map_area_pte () is:
Do {
Struct page * page = ** pages;
Set_pte (PTE, mk_pte (page, Prot ));
Address + = page_size;
PTE ++;
(* Pages) ++;
} While (address <End );

The page descriptor address page of the mapped page box is read from the array entry pointed to by the variable at the address pages. Use set_pte and mk_pte macros to write the physical address of the new page into the page table. After the constant 4096 (the length of a page box) is added to the address, the loop is repeated.

 

Note: map_vm_area () does not touch the page table of the current process. Therefore, a page is missing when a kernel-state process accesses a non-contiguous memory zone because the table items in the process page table corresponding to the memory zone are empty. However, the page missing handler should check whether the linear address of the page missing is in the main kernel page table (that is, the global directory of the init_mm.pgd page and its sub-page table ). Once the handler finds that a primary kernel page table contains a non-null entry of this linear address, it copies its value to the corresponding process page table item and resumes normal execution of the process. This mechanism will be described in the "missing page exception handler" blog.

In addition to the vmalloc () function, the non-continuous memory zone can be allocated by the vmalloc_32 () function, which is similar to vmalloc, however, it only uses the zone_normal and zone_dma memory management areas as page boxes.

Linux 2.6 also provides a vmap () function that maps allocated pages in non-contiguous memory areas: essentially, this function receives a set of pointers pointing to page descriptors as parameters. Call get_vm_area () to obtain a new vm_struct descriptor, and then call map_vm_area () to map the page boxes. Therefore, this function is similar to vmalloc (), but it does not have any page frame.

 

4. Release the non-contiguous memory Zone

The vfree () function releases the non-contiguous memory zone created by vmalloc () or vmalloc_32 (), while the vunmap () function releases the memory zone created by vmap. Both functions use the same parameter-the starting linear Address of the memory zone to be released; both depend on the _ vunmap () function for substantive work.

The _ vunmap () function receives two parameters: ADDR, the address of the Start address of the memory area to be released, and the deallocate_pages sign, if the page boxes mapped to the memory area should be released to the partition page box distributor (call vfree (), this flag is set to a bit; otherwise, it is cleared (vunmap () called ). This function performs the following operations:
1. Call the remove_vm_area () function to obtain the address area of the vm_struct descriptor, and clear the page table items of the kernel corresponding to the linear address in the non-contiguous memory area.
2. if deallocate_pages is set, function Scan points to the area-> pages pointer array of the page descriptor. For each element of the array, call the release page of the _ free_page () function to go to the partition page box distributor. In addition, run kfree (area-> pages) to release the array itself.
3. Call kfree (area) to release the vm_struct descriptor.

 

The remove_vm_area () function executes the following loop:
Write_lock (& vmlist_lock );
For (P = & vmlist; (TMP = * P); P = & TMP-> next ){
If (TMP-> ADDR = ADDR ){
Unmap_vm_area (TMP );
* P = TMP-> next;
Break;
}
}
Write_unlock (& vmlist_lock );
Return TMP;

The memory area itself is released by calling unmap_vm_area. This function receives a single parameter, that is, the pointer area pointing to the vm_struct descriptor in the memory area. It performs the following loop to perform reverse operations on map_vm_area:
Address = Area-> ADDR;
End = address + area-> size;
PGD = pgd_offset_k (Address );
For (I = pgd_index (Address); I <= pgd_index (end-1); I ++ ){
Next = (address + pgdir_size) & pgdir_mask;
If (next <= address | next> end)
Next = end;
Unmap_area_pud (PGD, address, next-address );
Address = next;
PGD ++;
}

Unmap_area_pud:
Do {
Unmap_area_pmd (PUD, address, end-address );
Address = (address + pud_size) & pud_mask;
Pud ++;
} While (address & (address <End ));

The unmap_area_pmd () function executes the reverse operation of map_area_pmd () in a loop:
Do {
Unmap_area_pte (PMD, address, end-address );
Address = (address + pmd_size) & pmd_mask;
PMD ++;
} While (address <End );

Finally, unmap_area_pte () executes map_area_pte () in the Loop:
Do {
Pte_t page = ptep_get_and_clear (PTE );
Address + = page_size;
PTE ++;
If (! Pte_none (PAGE )&&! Pte_present (page ))
Printk ("whee... swapped out page in kernel page table/N ");
} While (address <End );

During each cycle, the ptep_get_and_clear macro sets the page table entry pointed to by Pte to 0.

Like vmalloc (), the kernel modifies the global directory of the master kernel page and the corresponding items in its sub-page table, but the items mapped to the 4th GB process page table remain unchanged. This is reasonable, because the kernel will never take back the parent directory, intermediate directory, and page table rooted in the global directory of the main kernel page.

For example, it is assumed that the kernel-state process accesses a non-contiguous memory zone to be released subsequently. The global directory item of the Process page is equal to the corresponding item in the global directory of the main kernel page. Because of the mechanism described in the "page missing exception handler" blog, these directory items point to the same page parent directory, page intermediate directory, and page table. The unmap_area_pte () function only clears items in the page table (the page table itself is not recycled ). Further access by the process to the released non-contiguous memory zone will trigger a page missing exception due to the empty page table item. However, the page missing handler considers this access as an error because the main kernel page table does not contain valid table items.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.