Memory knowledge that every programmer should know (3)-Virtual Memory

Last Update:2018-12-06 Source: Internet

Author: User

Tags prefetch

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

English original post address: http://lwn.net/Articles/253361/

Http://www.oschina.net/translate/what-every-programmer-should-know-about-virtual-memory-part3

4. Virtual Memory

The virtual memory subsystem of the processor implements virtual address space for each process. This makes every process think it is independent in the system. The advantage list of virtual memory is described in detail elsewhere, so it will not be repeated here. This section focuses on the actual implementation details of virtual memory and related costs.

The virtual address space is implemented by the CPU Memory Management Unit (MMU. The OS must fill in the data structure of the page table, but most CPUs do the rest. This is actually a rather complex mechanism; the best way to understand it is to introduce data structures to describe virtual address spaces.

The input address for address translation by MMU is a virtual address. Usually there is little limit on its value-if there is another point. The virtual address is a 32-bit value in a 32-bit system and a 64-bit value in a 64-bit system. In some systems, such as x86 and x86-64, the addresses used actually contain indirect addressing at another level: these structures use segments that simply add a shift to each logical address. We can ignore the generation of this part of the address, which is not important, not the memory processing performance that programmers are very concerned about. {The block limitation of X86 is related to performance, but that is another thing.}

4.1 simplest address translation

What's interesting is the conversion from a virtual address to a physical address. MMU can remap addresses on a page-by-page basis. Just like when the address cache is arranged, the virtual address is divided into different parts. These parts are used to index multiple tables, and these tables are used to create the final physical address. The simplest model is only a first-level table.

Figure 4.1: 1-level Address Translation

Figure 4.1 shows how different parts of a virtual address are used. The high-byte part is used to select entries in a page Directory. Each address in that directory can be set separately by the OS. The page Directory Entries determine the address of the physical memory page. There can be more than one entry pointing to the same physical address. The complete physical memory address is determined by the combination of the page address obtained from the page Directory and the Low-byte part of the virtual address. The page Directory Entries also contain additional page information, such as access permissions.

The data structure of the page Directory is stored in the memory. The OS must allocate continuous physical memory and store the base address of this address range in a special register. Then the appropriate bit of the virtual address is used as the index of the page Directory, which is actually a list of directory entries.

As an example, this is a 4 MB paging design for x86 machines. The displacement of the virtual address is 22 Bits, which is enough to locate every byte in a 4 m page. The remaining 10 characters in the virtual address specify one of the 1024 entries in the page Directory. Each entry contains a base address on a 10-bit 4 m page, which is combined with the displacement to form a complete 32-bit address.

4.2 multi-level page tables

4 MB page is not a standard, they will waste a lot of memory, because the OS needs to execute many operations require a queue of memory pages. For a 4 kb page (32-bit machine specification, or even 64-bit machine specification), the displacement of the virtual address is only 12-bit. This leaves 20 bits as the page Directory pointer. Tables with 220 entries are impractical. The table size is 4 MB even if each entry only needs 4 bits. Since each process may have its own unique page Directory, many of these page directories are bound to physical memory in the system.

The solution is to use a multi-level page table. Then, this can represent a sparse and large page Directory. Some areas in the directory do not need to be allocated memory. Therefore, this representation is more compact, making it possible to use a page table for many processes in the memory without affecting performance too much .. The offset of an address is 22 Bits, which is enough to locate every byte in a 4 m page. The remaining 10 characters in the virtual address specify one of the 1024 entries in the page Directory. Each entry contains a base address on a 10-bit 4 m page, which is combined with the displacement to form a complete 32-bit address.

The most complex page table structure today consists of four levels. Figure 4.2 shows a schematic diagram of this implementation.

Figure 4.2: 4-level Address Translation

In this example, a virtual address is divided into at least five parts. The four parts are the indexes of different directories. The referenced 4th-level directory uses a special register in the CPU. The contents of directories ranging from 4th to 2nd are references to sub-low-level directories. If the ID of a directory entry is empty, it is clear that it does not need to point to any low-level directory. In this way, the page table tree can be sparse and compact. The entries in the positive 4.1 and 1st levels of directories are part of the physical addresses, plus auxiliary data like access permissions.

To determine the physical address relative to the virtual address, the processor first determines the address of the highest level directory. This address is generally stored in a register. Then, the CPU extracts the index part of the virtual address relative to this directory and selects the appropriate entry with that index. This entry is the address of the next-level Directory, which is indexed by the next part of the virtual address. The processor continues until it reaches the level 1st directory, where the value of the directory entry is the high byte part of the physical address. The physical address is complete after the page displacement in the virtual address is added. This process is called page tree traversal. Some processors (like x86 and x86-64) perform this operation in hardware, others require OS assistance.

Each process running in the system may need its own page table tree. Some shared trees are possible, but this is quite an exception. Therefore, if the page table tree requires as little memory as possible, it will be advantageous for performance and scalability. The ideal scenario is to place the memory close to the virtual address space, but the actual physical address is not affected. A small program may only need a directory of level 2, 3, and level 4 and a few directories of level 1st to cope with the past. On a x86-64 machine that uses 4 kb pages and 512 entries per directory, this allows 2 MB location with 4 levels of directories (one per level ). 1 GB of continuous memory can be located in a directory of 2nd to 4th levels and 1st directories of 512 levels.

However, it is too easy to assume that all memory can be allocated continuously. Due to the complexity, in most cases, the stack and heap areas of a process are allocated at the opposite ends of the address space. In this way, any region can grow as much as possible as needed. This means that two 2nd-level directories and more lower-level directories are most likely required.

However, this does not often match the actual situation. For security reasons, different parts of a runable (Code, Data, heap, stack, dynamic shared object, aka Shared Library) are mapped to random addresses [not selected]. Randomization extends to the relative locations of different parts. This means that different memory ranges used by a process are distributed throughout the virtual address space. Some restrictions can be applied to random address bits, but in most cases, this certainly won't allow a process to run only one or two 2nd-level and 3rd-level directories.

If performance is far more important than security, randomization can be disabled. The OS then typically loads all dynamic shared objects (DSO) consecutively in the virtual memory at least ).

4.3 optimize page table access

All data structures of the page tables are stored in the primary storage, where the OS creates and updates these tables. When a process is created or a page table changes, the CPU is notified. A page table is used to convert each virtual address to a physical address. The page table Traversal method described above is used. More information about this: at least one directory at each level is used to process virtual addresses. This requires up to four memory accesses (for a single access to a running process), which is very slow. It is possible to process these directory table entries like normal data and cache them in l1d, L2, etc., but this is still very slow.

From the early stages of virtual memory, the CPU designer adopted a different optimization. Simple calculations show that only storing directory table entries in l1d and more advanced caches can cause terrible performance problems. The calculation of each absolute address requires a large amount of l1d access relative to the depth of the page table. These accesses cannot be parallel because they depend on the results of the preceding query. On a machine with a four-level page table, such a single linearity requires at least 12 cycles. In addition, the possibility of being non-life in l1d makes the result that there is nothing to hide in the command line. Additional l1d access also consumes precious cache bandwidth.

Therefore, instead of caching directory table entries, the complete computing result of the physical page address is cached. For the same reason, the code and data cache work, and the cache of such address computing results is efficient. Because the page displacement of a virtual address does not play any role in the calculation of the physical page address, only the remaining part of the virtual address is used as the cache label. Based on the page size, this means that hundreds of instructions or data objects share the same tag, so they also share the same physical address prefix.

The cache for storing calculated values is called the bypass conversion cache (TLB ). Because it must be very fast, it is usually a small cache. Like other caches, modern CPUs provide multi-level TLB caches. The higher the level, the slower the cache. The small l1-level TLB is usually used for full-link image caching and adopts the LRU recovery policy. Recently, the cache size has become larger and the set is connected in the processor. One of the results is that when a new entry must be added, it may not be replaced by the oldest entry.

As mentioned above, tags used to access TLB are part of virtual addresses. If Tags match in the cache, the final physical address is calculated by adding the page displacement address from the virtual address to the cache value. This is a very fast process. This is also required, because each instruction using an absolute address requires a physical address. In some cases, the physical address is used as the L2 search keyword. If the TLB query does not hit the page, the processor must perform a page table traversal. This may be costly.

Prefetch code or data through software or hardware, and secretly prefetch TLB entries when the address is located on another page. Hardware prefetch cannot allow this because hardware initializes illegal page table traversal. Therefore, programmers cannot rely on the hardware prefetch mechanism to prefetch TLB entries. It must be completed explicitly using the prefetch command. Like data and command caching, TLB can be displayed in multiple levels. As for data caching, TLB generally takes two forms: the command TLB (ITLB) and the data TLB (dtlb ). High-level TLB is generally unified like l2tlb, just like other cache scenarios.

4.3.1 precautions for using TLB

TLB is a global resource with a processor as the core. All threads running on the processor use the same TLB as the process. Because the conversion from virtual to physical addresses depends on which page table tree is installed, if the page table changes, the CPU cannot blindly reuse cached entries. Each process has a different page table tree (not a thread in the same process), and the kernel is the same as the Memory Manager vmm (hypervisor), if any. The address space layout of a process may also change. There are two ways to solve this problem:

When the page table tree changes, TLB is refreshed.
The label of a TLB entry is extended and uniquely identifies the page table tree involved by it.

In the first case, a context switch TLB is refreshed. In most operating systems, some core code is required for switching from one thread/process to another. TLB refresh is restricted to entering or leaving the core address space. In a virtualized system, this also happens when the kernel must call the Memory Manager vmm and return it. If the kernel and/or memory manager do not use a virtual address, or when the process or kernel calls the system/Memory Manager, the same virtual address can be reused, TLB must be refreshed. When you leave the kernel or memory manager, the processor continues to execute a different process or kernel.

Refreshing TLB is efficient but expensive. For example, when you execute a system call, the kernel code that you touch may be limited to thousands of commands, or a few new pages (or a large page, like some structured Linux ones ). This will replace all TLB entries that touch the page. For the Intel core2 architecture with 100 ITLB and 200 dtlb entries, a full refresh means that more than entries and entries (respectively) will be refresh without any need. When a system call returns the same process, all the refreshed TLB entries may be used again, but they do not exist. The same is true for common code in the kernel or memory manager. For each entry into the kernel, TLB must be erased and reinstalled, even if the page table of the kernel and memory manager does not change normally. Therefore, in theory, TLB entries can be kept for a long time. This also explains why the TLB cache in the processor is not large: it is very likely that the program will not run until it is full of all these entries.

Of course, the CPU structure cannot be escaped. One possible way to optimize cache refresh is to invalidate TLB entries separately. For example, if the kernel code and data fall into a specific address range, only the pages that fall into this address range must be cleared out of TLB. This only requires comparing tags, so it is not very expensive. This method is also useful when some address spaces change, for example, a call to remove memory pages,

A better solution is to expand tags for TLB access. In addition to a part of the virtual address, if a unique identifier (such as the address space of a process) corresponding to each page table tree is added, TLB does not need to be completely refreshed at all. The kernel, Memory Management Program, and independent processes can all have unique identifiers. The only problem in this scenario is that the number of digits that The TLB tag can obtain is limited, but the number of digits in the address space is not. This means that the reuse of some identifiers is necessary. In this case, TLB must be partially refreshed (if possible ). All entries with reuse identifiers must be refreshed, but it is expected to be a very small set.

When multiple processes are running in the system, this extended TLB tag has a general advantage. If each running process limits the memory usage (so the TLB entry is used) and The TLB entry recently used by the process is included in the plan again, there is a great opportunity to stay in TLB. However, there are two additional advantages:

Special address space, such as those used by the kernel and Memory Manager, usually takes a short period of time. After that, the control often returns the address space for initializing the call. If no tag exists, two TLB refresh operations are performed. If there is a tag, the call address space cache's conversion address will be saved. Because the address space of the kernel and the memory manager does not change the TLB entry frequently, the address translation before the system call can still be used.
When switching between two threads of the same process, TLB refreshing is not required at all. Although the TLB label is not extended, the entry into the kernel will destroy the TLB entry of the first thread.

Some processors sometimes implement these extension labels. AMD introduced a one-bit extension label to Pacifica virtualization extension. In the context of virtualization, this one-bit address space ID (ASID) is used to distinguish the address space of the memory management program from the customer domain. This allows the OS to avoid refreshing the customer's TLB entries every time it enters the Memory Management Program (for example, to handle a page error, or, when the control returns to the customer, refresh the TLB entry of the Memory Manager. This architecture will allow more places in the future. Other mainstream processors are likely to adapt to and support this function.

4.3.2 affect TLB performance

Some factors may affect the performance of TLB. The first is the page size. Obviously, the larger the page, the more commands or data objects are loaded. Therefore, the larger page size reduces the total number of address conversions required, that is, less TLB cache entries are required. Most architectures allow multiple different page sizes; some sizes can coexist. For example, x86/x86-64 processors have a general 4 kb page size, but they can also use 4 MB and 2 MB pages respectively. The IA-64 and PowerPC allow sizes such as 64 kB as basic page sizes.

However, the use of large page sizes also brings about some problems. The memory range of a large page must be continuous in the physical memory. If the size of the physical memory management unit is increased to the size of the Virtual Memory Page, the amount of wasted memory will increase. Various memory operations (such as loading executable files) require page boundary alignment. This means that the average ing wastes half of the page size in the physical memory. This kind of waste is easy to accumulate; therefore, it limits the reasonable unit size allocated to the physical memory.

Adding the unit size to 2 mb in the x86-64 structure is of course not practical to accommodate large pages. This is a large size. However, this means that each large page must be composed of many small pages. These small pages must be contiguous in the physical memory. It is challenging to allocate 2 MB continuous physical memory to the page size in units of 4 kb. It needs to find the idle area with 512 consecutive pages. This can be extremely difficult (or impossible) after the system runs for a period of time and the physical memory starts to be fragmented)

Therefore, in Linux, it is necessary to pre-allocate these large pages with a special huge tlbfs file system when the system is started. A fixed number of physical pages are retained for separate use as large virtual pages. This makes resources that may not be frequently used bundled. It is also a limited pool; increasing it usually means restarting the system. Despite this, the big page is a way to enter some situations. In these situations, the performance is reliable, the resources are rich, and troublesome installation will not be a major obstacle. The Database Server is an example.

The smallest virtual page size (just as the opposite of a large page) also has its problem. Memory ing operations (such as loading applications) must confirm the size of these pages. It is impossible to have a smaller ing. For most architectures, each part of an executable program has a fixed relationship. If the page size exceeds the size considered when an executable program or DSO (dynamic shared object) is created, the loading operation cannot be performed. It is important to remember this restriction. Figure 4.3 shows how alignment requirements of an elf binary are determined. It is encoded in the ELF Program header.

$ eu-readelf -l /bin/lsProgram Headers:  Type   Offset   VirtAddr           PhysAddr           FileSiz  MemSiz   Flg Align...  LOAD   0x000000 0x0000000000400000 0x0000000000400000 0x0132ac 0x0132ac R E 0x200000  LOAD   0x0132b0 0x00000000006132b0 0x00000000006132b0 0x001a71 0x001a71 RW  0x200000...

Figure 4.3: ELF Program header indicates alignment requirements

In this example, a x86-64 binary, with a value of 0x200000 = 2,097,152 = 2 MB, conforms to the maximum page size supported by the processor.

The use of large memory size has a second effect: the number of levels in the page table tree is reduced. As the virtual address is increased relative to the page displacement, there is no more space left to use in the page Directory. This means that when a TLB fails, the amount of work to be done is reduced.

If the size of a large page is exceeded, it may reduce the number of TLB entries that need to be used at the same time when moving data to several pages. This is similar to some of the optimization mechanisms used by the cache mentioned above. Only the current alignment requirement is huge. Considering that the number of TLB entries is so small, this may be an important optimization.

4.4 impact of Virtualization

OS image virtualization will become more and more popular; this means that memory processing at another level has been added to the imagination. Process (Basic compartment) or OS container virtualization, because only one OS is involved and does not fall into this category. Similar to xen or KVM technology, OS images can run independently-with or without assistance from processors. In these cases, a separate software directly controls access to the physical memory.

Figure 4.4: xen virtualization Model

For xen (see Figure 4.4), xen vmm (xen Memory Manager) is the software. However, vmm does not implement much hardware control by itself, unlike vmm of other earlier systems (including the first version of xen vmm, hardware and processors other than memory are controlled by the privileged dom0 domain. Now, this is basically the same as the unprivileged domu kernel, which is basically different in terms of memory processing. What is important here is that the vmm distributes physical memory to the dom0 and domu kernels, and then implements common memory processing just as they run directly on a single processor.

To enable the separation between the domains required for virtualization, memory processing in the dom0 and domu kernels does not have unlimited physical memory access permissions. Vmm does not distribute memory by distributing independent physical pages and allowing the customer's OS to process addresses. This does not prevent errors or fraudulent customer domains. Instead, vmm creates its own page table tree for each customer domain and distributes memory using these data structures. The advantage is that access to page table tree management information can be controlled. If the Code does not have proper privileges, it cannot do anything.

In the virtualized xen support, such access control has been developed, regardless of the parameter or hardware (also known as full) virtualization. The customer domain creates a page table tree for each process in a way that is similar to the parameter and hardware virtualization. Every time the customer OS modifies the page table called by vmm, vmm updates its own shadow page table with the updated information in the customer domain. These are the page tables actually used by hardware. Obviously, this process is very expensive: Every modification to the page table tree requires a call from vmm. Without virtualization, the changes in memory ing are not cheap. They are now even more expensive.

Considering the change from the customer's OS to vmm and return, it is already very expensive, and the extra cost may be very high. This is why the processor has an additional function to avoid creating a shadow page table. This is good not only because of speed problems, but also because it reduces the memory consumed by vmm. Intel has an extended page table (epts). AMD calls it a nested page table (NPTS ). Basically, both technologies have a page table of the customer OS to generate Virtual Physical addresses. Then, each domain uses an EPT/nstree. These addresses are further converted to real physical addresses. This allows memory processing at the speed of almost non-virtualized scenarios, because most vmm entries used for memory processing are removed. It also reduces the memory used by vmm, because there is only one page table tree for maintenance in one domain (relative to the process.

The results of the additional address translation steps are also stored in TLB. This means that TLB does not store Virtual Physical addresses, but instead replaces them with complete query results. I have explained that AMD's phasfiica extension introduced asid to each entry to avoid TLB refreshing. The number of asid digits is one in the original version of the processor extension; this is enough to distinguish between the vmm and the customer OS. Intel has a virtual processor ID (vpids) serving the same purpose. They only have more bits. However, the vpid is fixed for each customer domain, so it cannot mark individual processes, nor avoid refreshing TLB at that level.

The workload required to modify each address space for a virtual OS is a problem. However, there is another internal vmm-based virtualization problem: there is no way to deal with two-layer memory. But memory processing is very difficult (especially considering the complexity of NUMA, see section 5th ). The xen method uses a separate vmm, which makes the best (or best) Processing difficult because of the complexity of all memory management implementations, this includes "trivial" things such as discovering the memory range, which must be copied to vmm. OS has completely mature and optimal implementations; people really want to avoid copying them.

Figure 4.5: KVM virtualization Model

This is why the analysis of the vmm/dom0 model is so attractive. Figure 4.5 shows how KVM Linux kernel extensions try to solve this problem. Instead of directly running on the hardware and managing the independent vmm of all customers, a common Linux kernel takes over this function. This means that the complete and complex memory management functions in the Linux kernel are used to manage the system memory. The customer domain runs in a common user-level process, and the Creator calls it "customer mode ". Virtualization is controlled by another user-level process KVM vmm. This is another KVM device implemented by another process with a special kernel. It happens to control a customer domain.

Compared with the xen independent vmm model, this model has the advantage that even when the client OS is in use, there are still two memory processors working and only one implementation is required in the Linux kernel. You do not need to copy the same feature from another piece of code like xen vmm. This brings less work, less bugs, and perhaps less friction between two memory processors, this is because a Linux client's memory processor shares the same assumptions with the Linux kernel's external memory processor running on bare hardware.

In general, programmers must be aware that the cost of memory operations when using virtualization is much higher than that without virtualization. Any optimization to reduce this work will be made more effort in the virtualization environment. As time passes, processor designers will gradually reduce this gap through technologies like EPT and the Treaty, but it will never completely disappear.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More