Authoritative explanation of high-end memory

Source: Internet
Author: User
Note: This article is the most detailed and clear explanation of all the high-end memory I have seen. Other posts write Spam in a few words and save it for convenience, thanks to the original author! Address: bbs.chinaunix.netthread-1938084-1-1.html note: the physical address space mentioned in this article can be understood

Note: This article is the most detailed and clear explanation of all the high-end memory I have seen. Other posts write Spam in a few words and save it for convenience, thanks to the original author! Address: http://bbs.chinaunix.net/thread-1938084-1-1.html note: the physical address space mentioned in this article can be understood

Note: This article is all I have seenAboutHigh-end memoryExplanationThe most detailed and clearExplanationOther posts are Spam in a few words. Save them to make it easier for me and myself. Thanks to the original author!

Address: http://bbs.chinaunix.net/thread-1938084-1-1.html



Note: The physical address space mentioned in this article can be understood as physical memory, but in some cases, it is wrong to regard it as physical memory.

The environment discussed in this article is NON-PAE's i386 platform, kernel version 2.6.31-14

1. What is high-end memory?

In linux, the kernel uses a linear address space of 3g-4g, that is, a total of 1 GB of address space can be used to map physical address space. But what if the memory is larger than 1 GB? Is it impossible to use memory that exceeds 1 GB? Therefore, the kernel introduces a high-end Memory concept, which divides 1g linear address space into two parts: less than m physical address space is called low-end memory, the physical addresses of this part of the memory correspond to the linear addresses starting with 3G, that is, the kernel uses a linear address space of 3G -- (3G + 896 M) it corresponds to the physical address space ranging from 0 to M. The remaining M linear space is used to map the remaining physical address space larger than m, which is what we usually call the high-end memory zone.

The so-called high-end memory ing is to use a linear address to access pages in the high-end memory. How can we understand this sentence? After pagination is enabled, We need to access a physical memory address, which must be converted to MMU, that is, the high 10 bits of a 32-bit vaddr address is used to find the directory items on the page where the vaddr is located, use 12-21 bits to search for page table items, add the 0-11 bits and the starting physical address of the page to get the paddr, and then place the paddr on the frontend bus, then we can access the physical memory corresponding to the vaddr. In the low-end memory, such a ing exists for each physical memory page during system initialization. There is no such ing in the high-end memory (page Directory items and page tables are empty), so we must provide a series of functions to implement this function after the system initialization, this is the so-called high-end memory ing. So why do we no longer set up all the memory mappings during system initialization? The main reason is that the Kernel linear address space is insufficient to accommodate all the physical address space (1 GB Kernel linear address space and up to 4 GB physical address space ), therefore, we need to reserve a part of (128 M) linear address space to dynamically map all physical address spaces, so we have produced a so-called high-end memory ing.

2. How to manage high-end memory in the kernel


The figure above shows how the kernel uses the 3g-4g linear address space. FirstExplanationWhat is high_memory?

The following code is used in arch/x86/mm/init_32.c:

# Ifdef CONFIG_HIGHMEM

Highstart_pfn = highend_pfn = max_pfn;

If (max_pfn> max_low_pfn)

Highstart_pfn = max_low_pfn;

E820_register_active_regions (0, 0, highend_pfn );

Sparse_memory_present_with_active_regions (0 );

Printk (KERN_NOTICE "% ldMB HIGHMEM available. \ n ",

Pages_to_mb (highend_pfn-highstart_pfn ));

Num_physpages = highend_pfn;

High_memory = (void *) _ va (highstart_pfn * PAGE_SIZE-1) + 1;

# Else

E820_register_active_regions (0, 0, max_low_pfn );

Sparse_memory_present_with_active_regions (0 );

Num_physpages = max_low_pfn;

High_memory = (void *) _ va (max_low_pfn * PAGE_SIZE-1) + 1;

# Endif

High_memory is the virtual address corresponding to the upper limit of the physical memory. It can be understood that when the memory size is smaller than MB, high_memory = (void *) _ va (max_low_pfn * PAGE_SIZE), max_low_pfn is the last page frame number in the memory, so high_memory = 0xc0000000 + physical memory size; when the memory is larger than MB, then highstart_pfn = max_low_pfn. In this case, max_low_pfn is not the last page frame number of the physical memory, but the last page frame number when the memory is 896M. Then, high_memory = 0xc0000000 + 896M. in short, high_memory cannot exceed 0xc0000000 + 896 M.

Because we are discussing the situation where the physical memory is larger than 896 MB, high_memory is actually 0xc0000000 + 128 M, M (4G-high_memory) starting from high_memory) it is used to map the remaining memory larger than MB. Of course, this MB can also be used to map the memory of the device (MMIO ).

We can see macro terms such as VMALLOC_START, VMALLOC_END, PKMAP_BASE, and FIX_ADDRESS_START. In fact, these terms divide the linear space of M into three areas: VMALLOC region (this article does not cover this part and focuses on other articles in this blog), permanetkernelmappings, and temporary kernelmappings ). these three regions can be used to map high-end memory. This article focuses on how the last two Regions map high-end memory.

3. permanet kernel mappings)

1. Several definitions are introduced:

PKMAP_BASE: the starting linear address of the permanent ing area.

Pkmap_page_table: page table corresponding to the permanent ing area.

LAST_PKMAP: pkmap_page_table contains the number of entries = 1024

Pkmap_count [LAST_PKMAP] array: The reference count of each element corresponding to one entry.AboutThe reference count value in the following situations:

0: indicates that this entry is available.

1: The entry is unavailable. Although the entry is not used to map any memory, its TLBentry is not flush,

So it is still unavailable.

N: There are N-1 objects using this page

First, we need to know that the size of this region is 4 M. That is to say, only 4 M linear address space is used for permanent ing in the linear address space of M. Which 4 M is determined by PKMAP_BASE? This variable represents the starting linear address of the 4 M range used for permanent memory ing.

On i386 of the NON-PAE, each item in the page Directory points to a 4 m space, so the permanent ing area only needs one page Directory item. When a page Directory item points to a page table, the permanent ing area can be expressed by a page table, so we use pkmap_page_table to point to this page table.

Pgd = swapper_pg_dir + pgd_index (vaddr );

Pud = pud_offset (pgd, vaddr); // pud = pgd

Pmd = pmd_offset (pud, vaddr); // pmd = pud = pgd

Pte = pte_offset_kernel (pmd, vaddr );

Pkmap_page_table = pte;

2. Code Analysis (2.6.31)

Void * kmap (struct page * page)

{

Might_sleep ();

If (! PageHighMem (page ))

Return page_address (page );

Return kmap_high (page );

}

The kmap () function is a function used to establish a permanent ing. Because calling the kmap function may cause process blocking, it cannot be called in the context where the processing function is interrupted and cannot be blocked, the role of might_sleep () is to print stack information when the function is called in an unblocked context. Next, determine whether the page for permanent ing is indeed a high-end memory, because we know that each page of the low-end memory already has a linear address ing, so, the page_address () function returns the linear address of the page. (AboutFor more information about the page_address () function, see the special articles in this blog.Explanation). Finally, call kmap_high (page). We can see that kmap_high () is used to create a permanent ing.

/**

* Kmap_high-map a highmem page into memory

* @ Page: & struct page to map

*

* Returns the page's virtual memory address.

*

* We cannot call this from interrupts, as it may block.

*/

Void * kmap_high (struct page * page)

{

Unsigned long vaddr;

/*

* For highmem pages, we can't trust "virtual"

* After we have the lock.

*/

Lock_kmap ();

Vaddr = (unsigned long) page_address (page );

If (! Vaddr)

Vaddr = map_new_virtual (page );

Pkmap_count [PKMAP_NR (vaddr)] ++;

BUG_ON (pkmap_count [PKMAP_NR (vaddr)] <2 );

Unlock_kmap ();

Return (void *) vaddr;

}

Kmap_high Function Analysis: first obtain the lock for the pkmap_page_table operation, and then call page_address () to return whether the page has been mapped. We can see that we have already determined it in kmap, why do we need to judge again here? Because when the lock is acquired again, it is possible that the lock is taken away by other CPUs. When other CPUs take the lock, this code is also executed and mapped to the same page, when it releases the lock, it actually indicates that the page ing has been established, and we do not need to execute this code here, therefore, it is necessary to judge after obtaining the lock.

If we find that vaddr is not empty, we just mentioned that it has been created by tasks executed on other CPUs, here, you only need to add the pkmap_count [] that indicates the reference count of this page. At the same time, call BUG_ON to ensure that the reference count is indeed no less than 2, otherwise there is a problem. Then return vaddr, and the entire creation is complete.

What if vaddr is empty? By calling the map_new_virtual () function, we can see that the code for creating a ing is actually in this function.

Static inline unsigned long map_new_virtual (struct page * page)

{

Unsigned long vaddr;

Int count;

Start:

Count = LAST_PKMAP; // LAST_PKMAP = 1024

/* Find an empty entry */

For (;;){

Last_pkmap_nr = (last_pkmap_nr + 1) & LAST_PKMAP_MASK;

If (! Last_pkmap_nr ){

Flush_all_zero_pkmaps ();

Count = LAST_PKMAP;

}

If (! Pkmap_count [last_pkmap_nr])

Break;/* Found a usable entry */

If (-- count)

Continue;

/*

* Sleep for somebody else to unmap their entries

*/

{

DECLARE_WAITQUEUE (wait, current );

_ Set_current_state (TASK_UNINTERRUPTIBLE );

Add_wait_queue (& pkmap_map_wait, & wait );

Unlock_kmap ();

Schedule ();

Remove_wait_queue (& pkmap_map_wait, & wait );

Lock_kmap ();

/* Somebody else might have mapped it while we slept */

If (page_address (page ))

Return (unsigned long) page_address (page );

/* Re-start */

Goto start;

}

}

Vaddr = PKMAP_ADDR (last_pkmap_nr );

Set_pte_at (& init_mm, vaddr,

& (Pkmap_page_table [last_pkmap_nr]), mk_pte (page, kmap_prot ));

Pkmap_count [last_pkmap_nr] = 1;

Set_page_address (page, (void *) vaddr );

Return vaddr;

}

Last_pkmap_nr: records the position of the last allocated page table entry in pkmap_page_table. The initial value is 0. Therefore, last_pkmap_nr equals to 1 during the first allocation.

Next, determine when last_pkmap_nr is equal to 0. If it is equal to 0, the 1023 (LAST_PKMAP (1024)-1) page table items have been allocated. In this case, call the flush_all_zero_pkmaps () function, flush the entries in the TLB for all the page table items with the pkmap_count [] count as 1 and reset them to 0, which means that the page table items can be used again, I may wonder why I didn't flush TLB when I set pkmap_count to 1? I personally think it may be for efficiency. After all, refresh the page when it is not enough. The efficiency should be better.

If the value of pkmap_count [last_pkmap_nr] is 0 or 0, the table items on the page are available, and the loop is displayed below.

PKMAP_ADDR (last_pkmap_nr) returns the linear address vaddr corresponding to this page table item.

# DefinePKMAP_ADDR (nr) (PKMAP_BASE + (nr) <PAGE_SHIFT ))

Set_pte_at (mm, addr, ptep, pte) function in the NON-PAE i386 implementation is actually very simple, in fact it is equivalent to the following code:

Staticinline void native_set_pte (pte_t * ptep, pte_t pte)

{

* Ptep = pte;

}

We already know that the linear start address of the page table is stored in pkmap_page_table, and the address of the available page table items is & pkmap_page_table [last_pkmap_nr]. The address of the page table item is obtained, as long as the corresponding pte is filled in, will the entire ing be completed?

The pte consists of two parts: the 20-bit high represents the physical address, and the 12-bit low represents the description of the page.

How can I find the corresponding physical address on the page (refer to page_address )? In fact, it is very easy to use (page-mem_map) and then shift the PAGE_SHIFT bit.

The page description with a minimum of 12 characters is fixed: kmap_prot = (_ PAGE_PRESENT | _ PAGE_RW | _ PAGE_DIRTY | _ PAGE_ACCESSED | _ PAGE_GLOBAL ).

The following code is used to do these tasks:

Mk_pte (page, kmap_prot ));

# Definemk_pte (page, pgprot) pfn_pte (page_to_pfn (page), (pgprot ))

# Definepage_to_pfn _ page_to_pfn

# Define _ page_to_pfn (page) (unsigned long) (page)-mem_map) + \

ARCH_PFN_OFFSET)

Staticinline pte_t pfn_pte (unsigned long page_nr, pgprot_t pgprot)

{

Return _ pte (phys_addr_t) page_nr <

Massage_pgprot (pgprot ));

}

Next, set pkmap_count [last_pkmap_nr] To. Isn't it unavailable? Since the ing has been established, it should be assigned a value of 2, in fact, this operation is completed in the upper-layer function kmap_high (pkmap_count [PKMAP_NR (vaddr)] ++ ).

By now, the entire ing is complete, and you can add the page and the corresponding linear address to the page_address_htable hash linked list (refer to page_address ).

We will continue to see how to deal with the memory when all page table items are used, that is, 1024 page table items are fully mapped to the memory. In this case, count = 0, so the following code is entered:

/*

* Sleepfor somebody else to unmap their entries

*/

{

DECLARE_WAITQUEUE (wait, current );

_ Set_current_state (TASK_UNINTERRUPTIBLE );

Add_wait_queue (& pkmap_map_wait, & wait );

Unlock_kmap ();

Schedule ();

Remove_wait_queue (& pkmap_map_wait, & wait );

Lock_kmap ();

/* Somebody else might have mapped it while we slept */

If (page_address (page ))

Return (unsignedlong) page_address (page );

/* Re-start */

Goto start;

}

This code is actually very simple. It is to add the current task to the waiting queue pkmap_map_wait. When other tasks wake up the queue, continue gotostart and repeat the entire process. Here is the reason why kmap function calling may be blocked.

When will the pkmap_map_wait queue be awakened? When the kunmap_high function is called to release a ing.

In fact, the page of the kunmap_high function is very simple. It is to reduce the count of the page table items to be released by 1. If it is equal to 1, it indicates that there are available page table items, and then wake up the pkmap_map_wait queue.

/**

* Kunmap_high-map a highmem page into memory

* @ Page: & struct page to unmap

*

* IfARCH_NEEDS_KMAP_HIGH_GET is not defined then this may be called

* Onlyfrom user context.

*/

Voidkunmap_high (struct page * page)

{

Unsigned long vaddr;

Unsigned long nr;

Unsigned long flags;

Int need_wakeup;

Lock_kmap_any (flags );

Vaddr = (unsigned long) page_address (page );

BUG_ON (! Vaddr );

Nr = PKMAP_NR (vaddr );

/*

* A count must never go down to zero

* Without a TLB flush!

*/

Need_wakeup = 0;

Switch (-- pkmap_count [nr]) {// subtract one

Case 0:

BUG ();

Case 1:

/*

* Avoidan unnecessary wake_up () function call.

* Thecommon case is pkmap_count [] = 1,

* Nowaiters.

* Thetasks queued in the wait-queue are guarded

* By boththe lock in the wait-queue-head and

* Thekmap_lock. As the kmap_lock is held here,

* No needfor the wait-queue-head's lock. Simply

* Test ifthe queue is empty.

*/

Need_wakeup = waitqueue_active (& pkmap_map_wait );

}

Unlock_kmap_any (flags );

/* Do wake-up, if needed, race-free outside ofthe spin lock */

If (need_wakeup)

Wake_up (& pkmap_map_wait );

}

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.