With the extensive deployment of multi-core systems, clusters, grids and even cloud computing in recent years, virtualization technology has become more and more advantageous in business applications, not only reducing it costs, but also enhancing system security and reliability, and the concept of virtualization has gradually deepened into people's daily work and life. Aiming at the x86 platform, this paper first gives the basic concept and classification of virtualization technology, then expounds the realization principle and challenge of pure software virtualization, and finally introduces INTEL-VT hardware aided virtualization technology in detail.
Introduction to Virtualization Technology
What is Virtualization
Virtualization (Virtualization) technology first appeared in IBM mainframe systems in the the 1960s, and became popular in the System 370 series of the 70, through a virtual Machine monitor, VMM's program generates many instances of virtual Machine that can run stand-alone operating system software on top of the physical hardware. With the extensive deployment of multi-core systems, clusters, grids and even cloud computing in recent years, virtualization technology has become more and more advantageous in business applications, not only reducing it costs, but also enhancing system security and reliability, and the concept of virtualization has gradually deepened into people's daily work and life.
Virtualization is a broad term that can mean different things to different people, depending on their environment. In the field of computer science, virtualization represents abstraction of computational resources, not just the concept of virtual machines. For example, the abstraction of physical memory, the virtual memory technology is generated, allowing the application to consider itself to have a contiguous space of addresses (address spaces), in which the application's code and data may be separated into multiple fragmented pages or segments, or even switched to external storage such as disk, flash memory, Applications can execute smoothly even if there is not enough physical memory.
Classification of virtualization Technologies
Virtualization technology is divided into the following major categories:
Platform Virtualization (Platform virtualization) for virtualization of computers and operating systems. Resource Virtualization (Resource virtualization) for virtualization of specific system resources, such as memory, storage, network resources, etc. Application Virtualization (creator virtualization), including simulation, simulation, interpretation technology, etc.
What we typically refer to as virtualization is the platform virtualization technology, which hides the actual physical characteristics of a particular computing platform by using control programs, also known as virtual Machine Monitor or Hypervisor, to provide users with abstract, A unified, simulated computing environment (called a virtual machine). The operating system running in the virtual machine is called the client operating system (Guest OS), and the operating system running the hypervisor is called the host operating system, with certain virtual machine monitors operating directly above the hardware (e.g., VMWARE's ESX products). The real system that runs virtual machines is what we call host systems.
Platform virtualization technology can be subdivided into the following subcategories:
Total Virtualization (full virtualization)
Full virtualization refers to virtual machines that simulate the complete underlying hardware, including processors, physical memory, clocks, peripherals, and so on, so that the operating system or other system software designed for the original hardware can be run in the virtual machine without any modification at all. The interaction between the operating system and the real hardware can be seen through a predetermined hardware interface. Full virtualization VMM provides all interfaces in a complete simulation of hardware (and must also simulate the execution of privileged directives). For example, in the x86 architecture, for operating system Switching Process page table operation, the real hardware by providing a privileged CR3 register to implement the interface, the operating system only need to execute "MOV pgtable,%%cr3" assembly instructions. The full virtualization VMM must completely simulate the entire process of executing the interface. If the hardware does not provide special support for virtualization, the simulation process will be complex: In general, VMM must be running at the highest priority to fully control the host system, and the Guest OS needs to be degraded to run, which cannot perform privileged operations. When the guest OS executes the preceding privileged assembly instructions, the host system generates an exception (general homeowner Exception), and the execution control is transferred from the guest OS to VMM. VMM assigns a variable in advance as a shadow CR3 register to the Guest OS, fills the Pgtable client's physical address (guest physical addresses) into a shadow CR3 register, and then the VMM also needs pgtable to translate the host's physical address (host Physical address) and fills in the physical CR3 register, which is then returned to the Guest OS. The VMM will then also deal with the complex Guest OS Page faults (pages Fault). The more famous full virtualization VMM has Microsoft virtual PC, VMware Workstation, Sun virtual Box, parallels Desktop for Mac, and QEMU.
Hyper-Virtualization (paravirtualization)
This is a technique that modifies the Guest OS partial access privileged state code to interact directly with VMM. In a hyper-virtualized virtual machine, some hardware interfaces are provided to the client operating system in the form of software, which can be provided by Hypercall (a direct call to the Guest OS that VMM provides, similar to system calls). For example, the Guest OS modifies the code of the paging table to invoke Hypercall to directly complete the work of modifying shadow CR3 registers and translation addresses. Because there is no need to generate additional exceptions and simulate part of the hardware execution process, hyper-virtualization can dramatically improve performance, and the more famous VMM has Denali, Xen.
Hardware-assisted virtualization (hardware-assisted virtualization)
Hardware-assisted virtualization is the use of hardware (primarily host processor) support to achieve efficient full virtualization. For example, with the support of INTEL-VT technology, the guest OS and VMM's execution environment is completely isolated, and the guest OS has its own "full set of registers" that can run directly at the highest level. Therefore, in the above example, the Guest OS can execute assembly instructions that modify the page table. INTEL-VT and Amd-v are the two hardware-assisted virtualization technologies available in the current x86 architecture.
Partial Virtualization (Partial virtualization)
VMM only simulates some of the underlying hardware, so the client operating system cannot be run in the virtual machine without modification, and other programs may need to be modified. Historically, partial virtualization has been an important milestone on the path to full virtualization, as it was first seen in the first-generation CTSS and IBM m44/44x experimental paging systems.
Operating System level virtualization (Keyboard-based system levels virtualization)
In a traditional operating system, the process of all users is essentially running in an instance of the same operating system, so defects in the kernel or application may affect other processes. Operating system-level virtualization is a lightweight virtualization technology used in server operating systems, where the kernel isolates different processes by creating multiple virtual operating system instances (kernels and libraries), and processes in different instances do not understand each other's existence at all. The more famous are the Solaris Container [2],freebsd Jail and OpenVZ.
This classification is not absolute, and a good virtualization software often incorporates a number of technologies. For example, VMware Workstation is a well-known, fully virtualized VMM, but it uses a technology called dynamic binary translation to convert access to privileged state to operations on shadow states, thus avoiding inefficient trap-and-emulate processing, This is similar to Hyper virtualization, except that hyper-virtualization is statically modifying program code. For Hyper-virtualization, if you can leverage hardware features, the management of virtual machines can be greatly simplified while maintaining high performance.
The virtualization technology discussed in this article is for the x86 platform only (AMD 64) and assumes that the Guest OS running in the virtual machine is also designed for the x86 platform.
The principle and challenge of pure software virtualization technology
Conditions for virtual machine monitors
In 1974, Popek and Goldberg presented a set of sufficient conditions called virtualization guidelines in the formal Requirements for virtualizable third generation (3) paper. Programs that meet these conditions can be referred to as virtual machine monitors (VM Machine Monitor, or VMM):
resource control. The control program must be able to manage all system resources. Equivalence. Programs that run under Program management (including the operating system) should behave in a way that is not the same as when the control is not in place, except timing and resource availability, and the predefined privileged instructions are freely executed. Efficiency. The vast majority of client directives should be executed directly by the host hardware without the involvement of the control program.
Although based on simplified assumptions, the above conditions still provide a convenient way to judge whether a computer architecture can effectively support virtualization, as well as guidelines for designing a virtualized computer architecture.
Principle Introduction
We know that the traditional x86 architecture lacks the necessary hardware support, no virtual machine monitor can directly meet the above conditions, so it is not a virtualized architecture, but we can construct the virtual machine monitor in the way of pure software implementation.
Virtual machines are abstractions and simulations of real computing environments, and VMM needs to allocate a set of data structures for each virtual machine to manage their state, including a full set of registers for a virtual processor, the use of physical memory, the status of virtual devices, and so on. VMM schedules a virtual machine to restore part of its state to the host system. Not all States need to be restored, for example, host CR3 registers contain the physical address of the page table that VMM sets, not the values set by the Guest OS. The host processor runs directly on the guest OS machine instructions, because the guest OS runs at a low privilege level, when access to the privileged state of the host system (such as write GDT registers), insufficient permissions cause the host processor to generate an exception, the runtime automatically returned to VMM. In addition, the arrival of external interrupts can also cause VMM to run. VMM may need to write the current state of the virtual machine back into the state data structure, analyze why the virtual machine is suspended, and then perform the appropriate privileged operations on behalf of the Guest OS. In the simplest case, if the guest OS changes the CR3 register, only the state data structure of the virtual machine needs to be updated. In general, in most cases, the VMM requires complex processes to complete the simple operation. Finally VMM will run the right back to the Guest os,guest OS to continue executing from where it was last interrupted, or to handle virtual interrupts and exceptions that VMM "plugs" into. This classic virtual machine operation is called Trap-and-emulate, the virtual machine is completely transparent to the guest OS, the guest OS does not need any modification, but VMM's design is more complex, the system's overall performance is obviously damaged.
Challenges facing
When designing pure software VMM, the following challenges need to be addressed:
ensures that VMM controls all system resources.
The x86 processor has 4 privileged levels, ring 0-Ring 3, and only the processor can access privileged resources or execute privileged instructions when running at ring 0 ~ 2, and the processor can access all privileged states when running at ring 0. The operating system on the x86 platform typically uses only the two levels of ring 0 and 3, and the operating system runs at the level 0, and the user process runs at the level 3. To satisfy the first sufficient condition-resource control, VMM itself must run at ring level 0, and in order to avoid guest OS control system resources, the guest OS has to lower its level of operation, running at Ring 1 or ring Level 3 (Ring 2 not used).
privilege level compression (ring Compression).
VMM protects access to physical memory using pagination or paragraph limits, but the 64-bit mode has no effect on the section, and pagination does not distinguish between ring 0, 1, 2. To unify and simplify the design of the VMM, the guest OS can only run at ring level 3 like the guest process. The VMM must monitor the guest OS settings for privileged resources such as GDT, IDT, and prevent the guest OS from running at ring 0, while protecting the degraded guest OS from active or unintentional destruction by the guest process.
privilege level name (Ring Alias).
The privilege level name refers to the level of the Guest OS running in the virtual machine is not what it expects. VMM must ensure that the Guest OS is not aware of the fact that it is running in a virtual machine, or it may break the equivalence condition. For example, the privilege level of the x86 processor is stored in the CS code segment register, and the Guest OS can use the unprivileged push command to push the CS register to the stack and pop out to check the value. For example, the Guest OS reads privileged registers GDT, LDT, IDT, and TR at a low privilege level, and does not have an exception, and may find that these values are not what they expect. To address this challenge, VMM can use dynamic binary translation techniques, such as replacing the "push%%cs" instruction with a shadow CS register value on the stack, or, for example, reading the GDT register operation "Sgdt dest" to "Movl Fake_gdt, dest ”。
address space Compression.
Address space compression means that the VMM must retain part of the guest OS's address space for its use. For example, the Interrupt Description table register (IDT Register) holds the linear address of the interrupt description table, and if the guest OS is running with an external interrupt or triggers a processor exception, the runtime must be guaranteed to move immediately to VMM, so the VMM needs to have the guest OS A part of the linear address space maps to its own interrupt description table's host physical address. VMM can be fully run in the Guest OS address space, or it can have a separate address space, in which case the VMM occupies only a small address space on the guest OS and is used to store important privileged states such as interrupt description tables and Global descriptor Table (GDT). In any case, VMM should prevent the Guest OS from directly reading and modifying this portion of the address space.
handles page faults in the Guest OS.
Memory is a very important system resource, VMM must be fully managed, the guest OS understands the physical address is only the client physical address (guest physical addresses), not the final host physical address (host physical addresses). When the guest OS has a page fault, the VMM needs to know the reason for the missing page error, is that the guest process is trying to access an address that does not have permissions, or the client's linear address (guest Linear addresses) has not yet been translated into guest physical, Or is the client physical address not yet translated into the host physical address. A workable solution is for the VMM to construct a shadow page table for each Process page table of the guest OS, to maintain the mapping of guest Linear address to host physical addresses, and host CR3 registers to store the physical memory of the shadow page table. VMM maintains a guest OS global guest Physical address to the Host Physical Address mapping table. The address of the page fault is always guest Linear address,vmm first go to the Guest OS table to check the cause, if the page table entry has been established, that is, the corresponding Guest physical addresses exist, indicating that has not yet been established to Host physical Address mapping, the VMM allocates a page of physical memory, updates the shadow page table and the mapping table, otherwise, VMM returns to the guest OS, handling the exception by the guest OS itself.
handles system calls in the Guest OS.
System calls are service routines provided to the user by the operating system and are used very frequently. The latest operating system typically uses the SYSENTER/SYSEXIT directive pair to implement fast system calls. The sysenter instructions go directly to ring 0 through the 3 MSR registers (Model specific register) of IA32_SYSENTER_CS,IA32_SYSENTER_EIP and Ia32_sysenter_esp; Sysexit instructions are not executed at the ring level 0 will trigger an exception. Therefore, if the VMM can only take the 2 instructions in a trap-and-emulate manner, overall performance will be greatly compromised.
forwarding virtual interrupts and exceptions.
All external interrupts and host processor exceptions are directly taken over by VMM, which constructs the required virtual interrupts and exceptions and then transfers them to the Guest OS. VMM needs to simulate the complete process of hardware and operating systems for interrupts and exceptions, such as VMM pressing some information on the current kernel stack of the guest OS, then finding the address of the corresponding processing routine for the guest OS, and jumping past. VMM must be more aware of the internal workflows of the different Guest OS, which adds to the difficulty of VMM's implementation. At the same time, the Guest OS may frequently mask interrupts and enable interrupts, both of which access the privileged register eflags, which must be done by the VMM simulation, and thus impair performance. When the Guest OS re-enable interrupts, the VMM needs to be informed of the situation in a timely manner and forward the accumulated virtual interrupts.
Guest OS frequently accesses privileged resources.
Each access to a privileged resource by the Guest OS triggers the processor exception, which is then simulated by VMM, and the overall performance of the system can be greatly compromised if the access is too frequent. For example, for blocking and enabling interrupts, the CLI (clear interrupts) instruction takes 60 clock cycles (cycle) on the Pentium 4 processor. Also, the processor-local advanced Programmable Interrupt processor (APIC) has an operating system-modifiable task-priority register (task-priority register), io-apic forwarding external interrupts to TPR The lowest-value processor, which expects the processor to perform a low-priority thread, optimizes the processing of interrupts. TPR is a privileged register, and some operating systems are set up frequently (Linux kernel only set the same value for TPR per processor during the initialization phase).
The challenge with the software VMM is essentially because the Guest OS is not running at the highest level of privilege it expects, and the traditional trap-and-emulate approach, while essentially addressing these challenges in a transparent manner, brings great design complexity and performance degradation. Currently more advanced virtualization software combined with binary translation and hyper-virtualization technology, the core idea is to dynamically or statically change the Guest OS on privileged state access, minimizing unnecessary hardware anomalies and simplifying VMM design.
INTEL-VT Hardware-assisted virtualization technology
In the winter of 2005, Intel brought in the industry's first desktop-oriented hardware-assisted virtualization technology INTEL-VT and associated processor products, thus opening up the new era of IA architecture virtualization technology applications. Processors that support virtualization technology have specially optimized instruction sets to automate the virtualization process, greatly simplifying the design of the VMM, and greatly improving VMM's performance. Where the virtualization technology of IA-32 processors is called Vt-x, the virtualization technology of Itanium processors is called Vt-i. AMD has also launched its own virtualization solution, known as amd-v. Although INTEL-VT and amd-v are not exactly the same, the basic idea and data structure are similar, this article only discusses the Intel-vt-x technology.
Two new modes of operation
Vt-x adds two modes of operation for the IA 32 processor: VMX Root twist and VMX non-root twist. VMM runs itself in VMX root twist mode, VMX non-root twist mode is used by the Guest OS. Both modes of operation support the 4 privilege levels of ring 0-ring 3, so the VMM and Guest OS are free to choose the level of operation they expect.
These two modes of operation can be converted to each other. The VMM running in VMX root twist mode switches to VMX non-root Twist mode by explicitly invoking the vmlaunch or vmresume instructions, and the hardware automatically loads the context of the guest OS, so the guest OS gets run, This transformation is called VM entry. The guest OS runs with events that require VMM processing, such as external interrupts or page faults, or when an unsolicited call to the vmcall instruction calls VMM's service (similar to system calls), the hardware automatically suspends the Guest OS, switches to VMX root twist mode, restores VMM, this transformation is called VM exit. The behavior of the software in VMX root twist mode is basically consistent with that of the processor without the vt-x technology, while the VMX Non-root Twist mode is very different, the main difference being that VM exit occurs when some instructions are run or some events are encountered.
Virtual Machine control block
The
MM and Guest OS share the underlying processor resources, so the hardware requires a physical memory area to automatically save or restore the context to which each is executing. This area is referred to as the virtual machine control block (VMCS), including the client status area (the Guest state region), the host State area (the local region), and the execution control area. When a VM entry, the hardware automatically loads the context of the Guest OS from the client's state area. There is no need to save the context of the VMM because it is similar to an interrupt handler, because VMM will not be disturbed by the guest OS if it starts to run, and only the VMM can switch to the guest OS when it is completely done. The next run of VMM is necessarily to handle a new event, so every time VMM entry, VMM starts with a common event handler; When VM exit, the hardware automatically saves the context of the Guest OS in the client state area and loads VMM from the host State area The address of the generic event handler function for the VMM to begin executing. The execution control area holds the flag bits that can manipulate VM entry and exit, such as what kind of interrupts are ready to automatically "plug" into the Guest OS when a VM EXIT,VM entry is flagged.
Both the client state area and the host State area should contain information about some of the physical registers, such as control registers Cr0,cr3,cr4;esp and EIP (Rsp,rip if the processor supports 64-bit extensions); Cs,ss,ds,es,fs,gs, etc. registers and their descriptors; TR, GDTR,IDTR registers, IA32_SYSENTER_CS,IA32_SYSENTER_ESP,IA32_SYSENTER_EIP and Ia32_perf_global_ctrl MSR registers. The client state area does not include the contents of the universal registers, and VMM decides whether to save them at the VM exit, thereby improving system performance. The client state area also includes the contents of a non physical register, such as a 32-bit active state value that indicates the active status of the processor at the time the Guest OS executes, if the normal execution instruction is in an active state, if a triple fault is triggered (Triple Fault) or other serious errors are in the Shutdown state, and so on.
As mentioned earlier, the execution control area is used to store the flags that can manipulate VM entry and VM exit, including:
External-interrupt exiting: Used to set whether an external interrupt can trigger VM exit, regardless of whether the Guest OS masks interrupts. Interrupt-window exiting: If set, the VM exit is triggered when the Guest OS unlocks the interrupt mask. Use TPR Shadow: When you access Task Priority Register (TPR) by CR8, you can avoid triggering VM exit by using Shadow TPR in Vmcs. There is also a set of TPR thresholds in the execution control area, which triggers VM exit only if the TR value set by the Guest OS is less than the threshold value. CR Masks and Shadows: Each control register has a corresponding mask that controls whether the guest OS can write the corresponding bit directly or trigger VM exit. While the VMCS includes a shadow control register, the hardware returns the value of the shadow control register to the guest OS when the guest OS reads the control register.
Vmcs also includes a set of bitmaps to provide better adaptability:
Exception Bitmap: Select which exceptions can trigger VM exit, I/O bitmap: Which 16-bit I/O ports are being accessed to trigger VM exit. MSR bitmaps: Similar to the control register mask, each MSR register has a set of "read" Bitmap masks and a set of "write" Bitmap masks.
Every time VM exit occurs, the hardware automatically deposits rich information in the VMCS, facilitating the type and cause of the VMM screening event. When VM entry, VMM can easily inject events (interrupts and exceptions) into the guest OS because VMCS contains the address of the Guest OS Interrupt description table (IDT), so the hardware can automatically invoke the handler for the guest OS.
For more information, see the Intel development manual.
Solve the challenge of pure software virtualization technology
First, due to the introduction of new operational patterns, the implementation of VMM and Guest OS is automatically isolated by hardware, and any critical event can automatically transfer system control to VMM, so VMM can fully control the entire system's resources.
Second, the guest OS can run at the highest level of privilege it expects, so the problem of privilege-level compression and privilege-level names is solved, and system calls in the guest OS do not trigger VM exit.
The hardware uses physical addresses to access the virtual machine control blocks (VMCS), while the Vmcs preserves the respective IDTR and CR3 registers of VMM and guest OS, so the VMM can have a separate address space, and the Guest OS can fully control its own address space, and the problem of address space compression does not exist.
The problem of interrupt and exception virtualization is also well addressed. VMM simply sets the virtual interrupts or exceptions that need to be forwarded, and when the VM entry, the hardware automatically invokes the interrupt and exception handlers of the Guest OS, greatly simplifying the design of the VMM. At the same time, the Guest OS's blocking and unblocking of interrupts can improve performance without triggering VM exit. And VMM can also set the VM exit to be triggered when the Guest OS unlocks the interrupt screen, so it can forward accumulated virtual interrupts and exceptions in a timely manner.
The development of virtualization technology in the future
We can see that the hardware-assisted virtualization technology is bound to be the future direction. INTEL-VT is still in the early stages of processor-level virtualization technology and needs to be developed in the following areas:
improves the conversion speed between operating modes.
The conversion between the two modes of operation occurs so frequently that if the conversion speed is not effectively reduced, the overall performance of the virtual machine will be compromised even if the hardware features are fully utilized. The early Pentium 4 processor, which supports hardware-assisted virtualization technology, takes 2,409 clock cycles to process VM entry, and costs a significant amount of 508 clock cycles to handle VM exit, which is triggered by a page fault exception. As Intel technology continues to evolve, the corresponding time has been reduced to 937 and 446 clock cycles on the new Core architecture. Future hardware manufacturers also need to further improve the mode of conversion speed, and provide more hardware features to reduce unnecessary conversion.
optimizes the performance of the translation backup buffer (TLB).
Every time VM entry and VM exit occur, the TLB (translation lookaside Buffer) is completely emptied because of the need to reload the CR3 register. The conversion of operation mode in the virtualization system is very high, so the whole performance of the system is obviously damaged. One possible solution is to assign the VMM and each virtual machine a globally unique id,tlb for each entry that appends the ID information to index the translation of the linear address.
provides hardware support for memory management Unit (MMU) virtualization.
Even with INTEL-VT technology, VMM has to use the old ways to handle the page faults in the guest OS and the translation of the guest OS's client physical address to the host physical address, essentially because VMM has full control of the host's physical memory, so the Guest OS The translation of the linear addresses in the also involves the address space of the VMM and Guest OS, and the hardware sees only one of them. Intel and AMD have proposed their respective solutions, called EPT (Extended Page Table) and Nested paging respectively. The basic idea of the two technologies is that whenever a client physical address is encountered, the hardware automatically searches for a page table provided by VMM about the Guest OS, translates it into a host physical address, or generates an exception to trigger the VM exit.
supports efficient I/O virtualization.
I/O virtualization needs to consider a variety of factors such as performance, availability, scalability, reliability, and cost. The simplest way is for the VMM to simulate a common I/O device for the virtual machine, which is implemented by VMM using software or the method of reusing host I/O devices. Virtual PCs, for example, provide an older S3 Trio64 graphics card. This approach improves compatibility and leverages the device drivers that are brought in by the Guest OS, but the virtual I/O devices are limited and poorly performing. To improve performance, VMM can assign host I/O devices directly to virtual machines, which can lead to two major challenges: 1. If multiple virtual machines can reuse the same device, VMM must ensure that their access to the device does not interfere with each other. 2. If the guest OS accesses the I/O device using DMA, the VMM must ensure that the address is converted correctly before the DMA operation is initiated because the address given by the guest OS is not a host physical address. Intel and AMD respectively proposed their respective solutions, known as Direct I/O (vt-d) and IOMMU, and want to use hardware to solve these problems, reducing the difficulty of VMM implementation.
Concluding
Aiming at the x86 platform, this paper introduces the basic knowledge of virtualization technology, hoping to benefit readers ' work and study.