This is a creation in Article, where the information may have evolved or changed.
"Editor's note" This share is carried out in the following 4 areas:
- Introduction to the mainstream technology of virtualization
- Introduction to the Forefront technology of virtualization
- Docker Technology Introduction
- Mixsan Technology Introduction
Hello, I am Xu An, a virtual veteran. 2010 began in the Century Interconnection (Cloud Express) exposure to cloud computing and virtualization technology, should be regarded as an earlier group of people in the country. Currently in the Han Bai Technology Co., Ltd., responsible for server virtualization and desktop virtualization products technical work.
I will be from the mainstream of virtualization technology introduction, cutting-edge technology introduction, Docker technology Introduction, Mixsan technology Introduction Four aspects of today's sharing. Due to the author's level and knowledge constraints, there is inevitably an incorrect understanding of the place, please Daniel criticism.
First let's look at what mainstream virtualization technology is, nothing more than CPU virtualization, memory virtualization, NIC virtualization, disk virtualization.
KVM is currently the most mainstream virtualization technology, and has been integrated into major Linux distributions since Linux 2.6.20. KVM is divided into four modes, namely customer (virtual machine) user mode, customer (virtual machine) kernel mode, host Linux user mode, host Linux kernel mode. There is no explanation for the virtual machine's user mode and kernel mode corresponding to the operating system before virtualization. QEMU-KVM is a host Linux User Configuration program, which is a process that represents a virtual machine. The QEMU-KVM is primarily responsible for simulating hardware for virtual machines and QEMU-KVM controlling KVM kernel modules via the IOCTL.
A virtual machine is a process, and that one vcpus is a thread that is created by the main thread. The advantage of this design is that there is no need to design the scheduling algorithm for the Vcpus alone, which can be done with Linux thread scheduling. So, for a vcpus, it is to fight the runing, see the figure can see two while loop, the second while is to let the KVM kernel run the Vcpus context, once the KVM kernel is not running, just to see why, QEMU-KVM perform the next action based on the reason, such as the need to read and write disk, then use QEMU-KVM to open the disk corresponding to the file, perform read,write operation.
The virtual address of the entire virtual machine to the actual physical memory address of the access (conversion) process is: GVA (virtual machine virtual address), through the VM page table converted to GPA (virtual machine Physical address), and then through a data structure mapped to HVA (physical machine virtual address), and then through the Host page table, Convert to HPA (physical machine Physical address).
As you can see from the previous page, this conversion is too slow and needs to be accelerated. There are two ways to speed up the translation process on the previous page. 1) Shadow page table, 2) ept function. Which one does KVM use? If the CPU supports the EPT function, then take the EPT.
EPT is the use of hardware to automatically translate GPA to HPA address mapping: When building a page table for a VM, the EPT hardware feeds the translated HPA address back to the VM page table, and the VM uses the physical address directly.
Using the Software "cheat VM" directly build GVA to HPA page table, VM also normal page table translation process, when the need to load the page table of physical memory, KVM take over, and then replace the contents of the page directly into the real HPA address.
The left side is the normal network card virtualization, first QEMU-KVM will simulate the "E1000" physical network card, the Guest OS through the simulation of interrupts and DMA operation and E1000 interaction, E1000 in the QEMU-KVM with internal calls to the "tap agent" to forward these requests, The tap agent is actually a tap device on the open host OS, so the process of sending and receiving packets translates into read,write operations for tap devices. The tap device is connected to bridge, and the network packet is sent and received via the TCP/IP protocol stack or the physical network card on bridge.
On the right is the Virtio NIC, which differs from the normal NIC in that the DMA emulation operation becomes a memory share.
This is a specific function call process, emphasizing: tap operation will pass through the QEMU-KVM process poll processing. On the left is the package process, the right is the contract process.
This is the call flow for the Virtio-net transceiver package.
VRIO-BLK is similar to network card virtualization, first simulating a hardware device, the read and write operations of the hardware device are converted to read and write operations of the disk agent. This disk agent may be a file, a block device, a network device, or just a simulation.
Interested can be discussed privately, the time relationship precedes.
This is a bit complicated to say, in short, you can set the virtual machine a disk buffer policy, Write-back,none,write-thru. In fact, QEMU-KVM open files are different types. Write-back means that the data is written to the page cache, none of the words will be written to buffer cache (device buffering), Write-thru then the device buffers will be crossed and written directly to the device.
The so-called cutting-edge technology, just a personal understanding, not necessarily correct, please criticize.
The so-called FT technology is the creation of two VMs, one master and one standby. At intervals, the primary VM pauses, makes a checkpoint, and then restores the VM, before the next checkpoint point, the difference (CPU, memory) data between the last two checkpoint is passed to the backup VM, and the backup VM synchronizes these differential data updates.
The difficulty here is that the time between the checkpoint is as short as possible, if the memory data changes relatively large, the transmission of the data flow will be huge, that the main standby network bandwidth pressure will be relatively large.
Three-level buffer is this, in order to widows system example, the mother of the C disk in memory, C-drive link clone out of the sub-disk in the SSD, D disk into the mechanical hard drive. Such a VM boot speed, we test the result is 60 desktop, 90 seconds can be started.
Pre-read mechanism refers to the virtual machine let me read 1k of data, I automatically help you to read 1M of data in memory.
If the VM boot is too slow, then there is a way, we look at a virtual machine boot process is nothing like this: the CPU goes up and reads the data from disk into memory. When you start a number of identical virtual machines can not go through this process, directly to the memory and CPU data from VM A to virtual machine B, then the virtual machine B in theory, the boot speed is equal to the memory of the time of the copy. can also accelerate, if the virtual machine basically similar, some memory data does not change at all, the memory copy can not just do a link, when the memory of a piece has changed, and then copy on write from the parent memory.
Nic performance acceleration The first direction is to use Vhost-net, network packet forwarding can throw away QEMU-KVM, go directly to the Vhost device, forward to the TAP device in the kernel, and then walk out of the host Core Bridge. To put it bluntly, when creating a VM, the NIC device file for QEMU-KVM is vhost. DPDK can accelerate the transfer efficiency of the switching module OvS, which we have not studied in depth, just a stroke.
The VGPU program for KVM is relatively small, time-related, and also a stroke.
About Docker, we are just beginning to study, is a study note it. If there is anything wrong with you, please criticize me.
On the left is a common contrast chart, which clearly shows that Docker is less than the guest OS kernel layer, so it's denser and starts faster. But users use Docker and virtual machines and do not use commands directly, typically using a system. So, the correct contrast is to compare their systems. On the right is the contrast I've learned from so many aspects. It can be seen that Docker in the boot speed, density, update management occupies a relatively large advantage, other aspects, such as support line for Windows, stability, security, monitoring maturity, high availability, management platform maturity, are lower than virtualization technology. When let, again emphasize this is my personal understanding, not necessarily correct. Welcome criticism.
As can be seen, this is a Zhuolu era, mainly OpenStack, Mesos, Kubernetes, Docker Company (community) four players. They are, of course, the cloud era, the share of open source platforms, the control of the cloud era.
The author has always believed that the cloud consumer does not care, you this service provider is using the virtualization technology or Docker, more do not care about your own writing or based on the open source platform to change. His concern is whether your service is reliable, stable, inexpensive or safe. Therefore, according to the characteristics of your team, choose your own best technology, for the cloud consumers to provide competitive services, is the future we can be based on the core.
The author believes that the integration of Docker and virtualized cloud platform should include three levels: resource management layer, virtualization layer, service layer. Of course, Docker and virtualization will share most of the modules, which is what I think the benefits of building a converged platform. Resource management includes at least: compute resource management, storage resource management, network management, security management. The virtualization layer must include: The Virtualization engine (typically KVM), the container engine (typically Docker). The service layer includes at least: high availability, orchestration management, disaster Recovery backup, service discovery, Application release, application upgrade, self-inductance expansion, orchestration management, platform high availability, lifecycle management, user rights authentication, monitoring alarm, log audit, load balancing, image management, system maintenance and other modules.
Kubernetes's advantage is that it is the first Docker cluster management platform, the first to put forward and implement the concept of pod,replication,services discovery. About the technical details, I do not do too much introduction, please Baidu, Google, or build a platform experience. The other scenarios below are also the same.
Swarm is a new Docker cluster management tool released by Docker in December 2014. Swarm can manage Docker clusters, manage and distribute compute resources, and include service Discovery (ETCD, ZooKeeper, Consul), container orchestration, and more. The advantage of swarm is that it is unified with the Docker interface API.
Mesos's goal is the next-generation data center operating system (DCOS), its core functions are cluster management, computing resource management, task distribution, originally used for distributed task management such as Hadoop. Starting with version 0.20, Mesos supports Docker-style task scheduling (the main focus is on Docker's task isolation, resource constraints, and release with the image). On top of the Mesos, Marathon can be used as Docker orchestration and life cycle management.
To ensure a leading position in cloud computing, OpenStack presented the idea of an "integration engine" at the May 2015 Vancouver Summit. In fact, it is in the kubernetes, Swarm and mesos on the upper layer, with OpenStack Magnum interface to manage them. The advantage of OpenStack lies in the accumulation of multi-tenancy, orchestration, storage, and networking capabilities in virtualization. I think this scheme is a bit different, OpenStack will not be willing to just do an "integrator". The author boldly guesses that with strong endogenous capabilities, OpenStack Magnum will learn and digest the advantages of kubernetes, Swarm and Mesos, and eventually replace them with another "Nova", the "Nova" operation and Management Docker.
A comprehensive comparison of four options, according to 0 (without this function), 1 (yes, but not perfect), 2 (yes, and relatively well). The result is that kubernets wins slightly, and personal knowledge is not necessarily correct.
About storage
- Run Docker's Rootfs and volume in the Ceph cluster. It is said that the Docker community is about to implement Ceph RBD Graph Driver.
- Ceph modules are Docker-like, unified with Docker nodes for hyper-converged concepts.
About the network
- DHCP is implemented as a whole with virtualization, and the actual IP assignment is set by the--ip parameter above Docker 1.9.
- Use OvS to enable Docker to interconnect with VMS.
Service discovery enables an application or component to discover information about its operating environment and other applications or components. When a service starts, it registers its own information, for example, where a MySQL database service registers its running IP and port. The load balancing strategy can be Ngnix and DNS based on the load, rather than finding a load that is too high, Ngnix or DNS to another Docker server.
- Static migration &docker Ha:ceph shared storage + IP setting = = Enables Docker migration.
- Dynamic migration: Hot migration Criu (freezes the process--save on storage--read and recover).
A set runs a business consisting of multiple Docker, which supports grayscale upgrades, and can increase the number of Docker in set if the monitor is too high.
The Mixsan here is ceph.
Ceph is a unified design for excellent performance, reliability, and scalability (while providing three functions for object storage, block storage, and file system storage), distributed storage systems (high reliability; highly automated; high scalability).
The system logical structure of the Rados is shown in the diagram on the right:
- The cluster map of the system is obtained by transmitting the node state information between the OSD and monitor together.
- Client program through the interaction with the OSD or monitor to obtain cluster map, and then directly in the local calculation, the object storage location, then directly with the corresponding OSD communication, complete the various operations of the data.
POOL,RBD volume, the relationship between Pg,osd. Set up the RBD volume in the pool, according to 4M is divided into many objects, an object corresponding to a PG, a PG distributed to different OSD to store data, usually stored in a file.
On the left, it can be seen that a write can be written three times, so there is a problem of writing amplification.
Read the words into several algorithms, one is to find primary, one is to find a random on the three OSD to read, the other is as close to their own to read.
In this case, we have tested performance equivalent to Gigabit iSCSI storage performance in the million-trillion scenario.
The benefit of this: hot and cold data separation, using relatively fast/expensive storage devices such as SSD disks, to form a pool as the cache layer, the back end with relatively slow/inexpensive devices to form a cold data storage pool.
There are two modes of the tier cache policy:
- Write-back mode
The Ceph client writes the data directly to the Cache-pool and returns immediately after writing, and the tiering agent flush the data to base-pool in time. The tiering agent is responsible for migrating data from Base-pool to Cache-pool when the data the client is reading is not cache-pool.
- Read-only mode
The Ceph client writes directly to Base-pool on write operations. Read the data first read from the Cache-pool, if not then read from the Base-pool, while the data buffered in Cache-pool.
The cache policy we are using is Write-back mode.
When our VMs are allocated, consider the primary Node of Ceph's first object, which, by calculation, finds that this is about 30% faster than the non-specified situation. is primarily an object that stores the QCOW2 header data.
The diagram is the state migration diagram of the OSD, the general look will find the giant complex, we also found that there are many pits, are currently being filled.
Q&a
Q: Is there a virtual machine migration involved? Whether it is a load trigger or a disaster recovery trigger or a business logic trigger.
A: The load trigger has a DRS, power saving mode. The basic consideration is to migrate to low loads when the load is high, while the power-saving mode is the opposite, bringing the virtual machines together and shutting down the unwanted VMS.
Q:KVM the host kernel parameter to turn on the role of IP forwarding, if the use of the bridge network is the same need to open?
A: Normally it needs to be turned on if the physical machine needs to communicate with the virtual machine network.
Q: Is it really just a matter of time before containerized virtualization is replaced?
A: Virtualization technology from the CPU as the core of the hardware capacity overflow, a virtual more than enough. Results: 1. Reduce the difference of the x86 hardware platform, the VM randomly drift; 2. The pooling of resources can be managed uniformly to improve the efficiency of management and utilization; 3. VMS of different guest OS can be run on the same host at the same time; 4. At all levels, various classes of software move up nearly 0 of the cost.
The nature of the container is "relatively independent and better managed program operating environment" on Linux. Results: 1. Appropriately reduce the difference of Linux operating system; 2. Unified management "Application", improve the efficiency of deployment and operation; 3. " Eminence's ideal world: Download any desired application from the library, fast and high-density boot; 4. There is a cost for the application to move up before it is unified.
Therefore, the short-term future is more likely: virtualization and containers are responsible for their own good parts, because it is easier to "compatible" with the existing world, virtualization accounted for the market share may be greater. The long-term future, when everything is available in the unified world, when it can be run, virtualization, container, you, I may no longer exist.
Q: Is the high availability and migration of Docker specific Ceph mentioned later, or is it just an idea?
A: Imagine that it was planned to do so, but there are other priorities that have not been done.
Q: What is the pain point, or the difficulty of desktop virtualization?
A: The pain points in business:
The technical difficulties are two:
1) terminal especially the compatibility of USB devices work, too torture.
2) analysis, research, and improvement of the Protocol.
what is the specific role of the two modules KVM and Kvm_intel of the Linux kernel in Q:KVM virtualization?
A: I haven't touched a specific code for a while, and KVM and Kvm_intel are combined to do the KVM kernel-related work.
Kvm_intel is mainly for the specific implementation of Intel hardware, such as CPU context structure and other information. KVM completes relatively high-level work, such as QEMU-KVM issued ioctol instruction parsing, memory virtualization and so on.
The above content is organized according to the May 24, 2016 night group sharing content. Share people
Xu An, 2010 began to do cloud computing, is currently the technical leader of technology virtualization products. Dockone Weekly will organize the technology to share, welcome interested students add: Liyingjiesz, into group participation, you want to listen to the topic or want to share the topic can give us a message.