The first generation of UNIX systems implements a silly Process Creation: When a fork () system call is made, the kernel copies the entire address space of the parent process as is and assigns the copied address space to the child process. This behavior is very time-consuming because it requires:
-Allocate a page for the sub-process page table
-Allocate a page for sub-process pages
-Initialize the sub-process page table
-Copy the page of the parent process to the corresponding page of the child process.
This method involves a lot of memory access, consuming many CPU cycles, and completely damaging the content in the cache. In most cases, this is often meaningless, because many sub-processes load a new program to start their execution, and thus completely discard the inherited address space.
The current Linux kernel adopts a more effective method, called copy on write (COW ). This idea is quite simple: the parent process and child process share page box instead of the copy page box. However, as long as the page boxes are shared, they cannot be modified, that is, the page boxes are protected. No matter when the parent process or child process tries to write a shared page box, an exception occurs. Then, the kernel copies the page to a new page box and marks it as writable. The original page is still write-protected: when other processes attempt to write data, the kernel checks whether the write process is the only owner of the page box. If yes, mark this page as Writable to this process.
The Count field of the page descriptor is used to track the number of processes that share the corresponding page box. The Count field of a process is reduced as long as the process releases a page box or copies the page when it executes the write operation. The page box is released only when the Count value changes to-1, this knowledge has been discussed in the previous blog.
Now let's talk about how to implement write-time replication in Linux. When the handle_pte_fault () function in the previous blog determines that the page missing exception is caused by an existing page in the access memory, let's recall:
PTL = pte_lockptr (mm, PMD );
Spin_lock (PTL );
If (unlikely (! Pte_same (* PTE, entry )))
Goto unlock;
If (write_access ){
If (! Pte_write (entry ))
Return do_wp_page (mm, VMA, address,
PTE, PMD, PTL, entry );
Entry = pte_mkdirty (entry );
}
Entry = pte_mkyoung (entry );
If (! Pte_same (old_entry, entry )){
Ptep_set_access_flags (VMA, address, PTE, entry, write_access );
Update_mmu_cache (VMA, address, entry );
Lazy_mmu_prot_update (entry );
} Else {
/*
* This is needed only for protection faults but the arch code
* Is not yet Telling us if this is a protection fault or not.
* This still avoids useless TLB flushes for. text page faults
* With threads.
*/
If (write_access)
Flush_tlb_page (VMA, address );
}
Unlock:
Pte_unmap_unlock (PTE, PTL );
Return vm_fault_minor;
The handle_pte_fault () function is not related to the architecture: it considers any possibility of violating page access permissions. However, in the 80x86 architecture, if a page exists, the access permission is write-allowed (write_access = 1) the page is write-protected (refer to the previous blog "handle error addresses in the address space ). Therefore, you always need to call the do_wp_page () function.
Static int do_wp_page (struct mm_struct * Mm, struct vm_area_struct * VMA,
Unsigned Long Address, pte_t * page_table, pmd_t * PMD,
Spinlock_t * PTL, pte_t orig_pte)
{
Struct page * old_page, * new_page;
Pte_t entry;
Int reuse = 0, ret = vm_fault_minor;
Struct page * dirty_page = NULL;
Int dirty_pte = 0;
Old_page = vm_normal_page (VMA, address, orig_pte );
If (! Old_page)
Goto gotten;
/*
* Take out anonymous pages first, anonymous shared VMAs are
* Not dirty accountable.
*/
If (pageanon (old_page )){
If (testsetpagelocked (old_page )){
Page_cache_get (old_page );
Pte_unmap_unlock (page_table, PTL );
Lock_page (old_page );
Page_table = pte_offset_map_lock (mm, PMD, address,
& PTL );
If (! Pte_same (* page_table, orig_pte )){
Unlock_page (old_page );
Page_cache_release (old_page );
Goto unlock;
}
Page_cache_release (old_page );
}
Reuse = can_pai_swap_page (old_page );
Unlock_page (old_page );
} Else if (unlikely (VMA-> vm_flags & (vm_write | vm_shared) =
(Vm_write | vm_shared ))){
/*
* Only catch write-faults on shared writable pages,
* Read-only shared pages can get cowed
* Get_user_pages (. Write = 1,. Force = 1 ).
*/
Vfs_check_frozen (VMA-> vm_file-> f_dentry-> d_inode-> I _sb,
Sb_freeze_write );
If (VMA-> vm_ops & VMA-> vm_ops-> page_mkwrite ){
/*
* Every y the address space that the page is about
* Become writable so that it can prohibit this or wait
* For the page to get into an appropriate state.
*
* We do this without the lock held, so that it can
* Sleep if it needs.
*/
Page_cache_get (old_page );
Pte_unmap_unlock (page_table, PTL );
If (VMA-> vm_ops-> page_mkwrite (VMA, old_page) <0)
Goto unwritable_page;
Page_cache_release (old_page );
/*
* Since we dropped the lock we need to revalidate
* The PTE as someone else may have changed it. If
* They did, we just return, as we can count on
* MMU to tell us if they didn't also make it writable.
*/
Page_table = pte_offset_map_lock (mm, PMD, address,
& PTL );
If (! Pte_same (* page_table, orig_pte ))
Goto unlock;
}
Dirty_page = old_page;
Get_page (dirty_page );
Reuse = 1;
}
If (reuse ){
Flush_cache_page (VMA, address, pte_pfn (orig_pte ));
Entry = pte_mkyoung (orig_pte );
Entry = maybe_mkwrite (pte_mkdirty (entry), VMA );
Dirty_pte ++;
Ptep_set_access_flags (VMA, address, page_table, entry, 1 );
Update_mmu_cache (VMA, address, entry );
Lazy_mmu_prot_update (entry );
RET | = vm_fault_write;
Goto unlock;
}
/*
* OK, we need to copy. Oh, well ..
*/
Page_cache_get (old_page );
Gotten:
Pte_unmap_unlock (page_table, PTL );
If (unlikely (anon_vma_prepare (VMA )))
Goto OOM;
If (old_page = zero_page (Address )){
New_page = alloc_zeroed_user_highpage (VMA, address );
If (! New_page)
Goto OOM;
} Else {
New_page = alloc_page_vma (gfp_highuser, VMA, address );
If (! New_page)
Goto OOM;
Cow_user_page (new_page, old_page, address );
}
/*
* Re-check the PTE-we dropped the lock
*/
Page_table = pte_offset_map_lock (mm, PMD, address, & PTL );
If (likely (pte_same (* page_table, orig_pte ))){
If (old_page ){
Page_remove_rmap (old_page );
If (! Pageanon (old_page )){
Dec_mm_counter (mm, file_rss );
Inc_mm_counter (mm, anon_rss );
Trace_mm_filemap_cow (mm, address, new_page );
}
} Else {
Inc_mm_counter (mm, anon_rss );
Trace_mm_anon_cow (mm, address, new_page );
}
Flush_cache_page (VMA, address, pte_pfn (orig_pte ));
Entry = mk_pte (new_page, VMA-> vm_page_prot );
Entry = maybe_mkwrite (pte_mkdirty (entry), VMA );
Dirty_pte ++;
Lazy_mmu_prot_update (entry );
/*
* Clear the PTE entry and flush it first, before updating
* PTE with the new entry. This will avoid a race condition
* Seen in the presence of one thread doing SMC and another
* Thread doing cow.
*/
Ptep_clear_flush_notify (VMA, address, page_table );
Set_pte_at (mm, address, page_table, entry );
Update_mmu_cache (VMA, address, entry );
Lru_cache_add_active (new_page );
Page_add_new_anon_rmap (new_page, VMA, address );
/* Free the old page ..*/
New_page = old_page;
RET | = vm_fault_write;
}
If (new_page)
Page_cache_release (new_page );
If (old_page)
Page_cache_release (old_page );
Unlock:
Pte_unmap_unlock (page_table, PTL );
If (dirty_page ){
If (flush_mmap_pages |! Dirty_pte)
Set_page_dirty_balance (dirty_page );
Put_page (dirty_page );
}
Return ret;
Oom:
If (old_page)
Page_cache_release (old_page );
Return vm_fault_oom;
Unwritable_page:
Page_cache_release (old_page );
Return vm_fault_sigbus;
}
Do_wp_page () function (to simplify the description of this function, we skipped the statement for reflection) first, obtain the page box descriptor related to the page missing exception (the page box corresponding to the page missing table item ):
Old_page = vm_normal_page (VMA, address, orig_pte );
Next, the function determines whether page replication is really necessary. If there is only one process that owns this page, you do not need to apply the page when writing a copy, and the process should write the page freely. Specifically, the function reads the _ count field of the page descriptor: If it is equal to 0 (only one owner), it does not need to be copied during writing.
In fact, the check is a little more complex because the _ count field is also added when the page is inserted into the swap cache (and when the pg_private flag of the page descriptor is set. However, when the copy operation fails, the page is marked as Writable to avoid further page missing exceptions:
Set_pte (page_table, maybe_mkwrite (pte_mkyoung (pte_mkdirty (PTE), VMA ));
Flush_tlb_page (VMA, address );
Pte_unmap (page_table );
Spin_unlock (& mm-> page_table_lock );
Return vm_fault_minor;
If two or more processes copy the shared page box when writing, the function copies the content of the old page to the new page. To avoid competition, call get_page () to add 1 to the Count of old_page before starting the copy operation:
Old_page = pte_page (PTE );
Pte_unmap (page_table );
Get_page (old_page );
Spin_unlock (& mm-> page_table_lock );
If (old_page = maid (empty_zero_page ))
New_page = alloc_page (gfp_highuser | _ gfp_zero );
} Else {
New_page = alloc_page (gfp_highuser );
Vfrom = kmap_atomic (old_page, km_user0)
VTO = kmap_atomic (new_page, km_user1 );
Copy_page (VTO, vfrom );
Kunmap_atomic (vfrom, km_user0 );
Kunmap_atomic (VTO, km_user0 );
}
If the old page box is zero, it is filled with 0 when a new page box is assigned (_ gfp_zero flag. Otherwise, use the copy_page () macro to copy the content of the page box. It is not necessary to perform special processing on zero pages, but special processing can indeed improve the system performance because it reduces address reference and protects the hardware high-speed cache of the microprocessor.
Because the page box allocation may block the process, the function checks whether the page table items have been modified since the function was executed (PTE and * page_table have different values ). If yes, the new page box is released, the old_page counter is reduced (previously added is canceled), and the function ends.
If everything looks smooth, the physical address of the new page is eventually written into the page table, and the corresponding TLB register is invalid:
Spin_lock (& mm-> page_table_lock );
Entry = maybe_mkwrite (pte_mkdirty (mk_pte (new_page,
VMA-> vm_page_prot), VMA );
Set_pte (page_table, entry );
Flush_tlb_page (VMA, address );
Lru_cache_add_active (new_page );
Pte_unmap (page_table );
Spin_unlock (& mm-> page_table_lock );
The lru_cache_add_active () function inserts a new page into the data structure related to the exchange.
Finally, do_wp_page () reduces the number of old_page counters twice. The first decrease is the increase in security before the content of the copy page box is canceled; the second decrease is to reflect the fact that the current process no longer owns the page box.