KVM Performance Optimization Solution

Source: Internet
Author: User
Tags passthrough

KVM Performance Optimization Solution

KVM performance optimization, mainly concentrated in the CPU, memory, disk, network, 4 aspects, of course, the optimization of this, but also to sub-scene, different scenes of its optimization direction is different, the following specific talk about these 4 aspects of the optimization details.

Cpu

Before introducing the CPU, it is necessary to understand the concept of NUMA, and it is recommended to refer to the following two articles first

CPU topology

Play Turn Cpu-topology

To view the CPU information script:

#!/bin/bash# Simple Print CPU topology# author:kodangofunction get_nr_processor () {grep ' ^processor '/proc/cpuinfo | w C-l}function Get_nr_socket () {grep ' physical id '/proc/cpuinfo | awk-f: ' {print $ | "Sort-un"} ' | Wc-l}function get_nr_siblings () {grep ' siblings '/proc/cpuinfo | awk-f: ' {print $ | "Sort-un"} '}function Get_nr_cores_of_socket () {grep ' CPU cores '/proc/cpuinfo | awk-f: ' {print $ | "Sort-un"} '}echo ' ===== CPU topology Table ===== ' Echoecho ' +--------------+---------+-----------+ ' echo ' | Processor ID | Core ID | Socket ID | ' Echo ' +--------------+---------+-----------+ ' while read line; do if [-Z "$line"]; Then printf ' | %-12s | %-7s | %-9s |\n ' $p _id $c _id $s _id echo ' +--------------+---------+-----------+ ' continue fi if echo "$line" | Grep-q "^processor"; Then p_id= ' echo ' $line | Awk-f: ' {print $} ' | Tr-d "fi if Echo" $line "| Grep-q "^core id"; Then C_id= ' echo ' $line | Awk-f: ' {print $} ' | Tr-d "fi if Echo" $line "| Grep-q "^physical id"; Then s_id= ' echo ' $line | Awk-f: ' {print $} ' |        Tr-d ' Fidone </proc/cpuinfoechoawk-f: ' {if ($/processor/) {gsub (//, "", $ $);    p_id=$2;        } else if ($ ~/physical ID/) {gsub (//, "", $);        s_id=$2; ARR[S_ID]=ARR[S_ID] "p_id}} end{for (i in arr) printf" Socket%s:%s\n ", I, Arr[i];} '/proc/cpuinfoechoe Cho ' ===== CPU Info Summary ===== ' echonr_processor= ' get_nr_processor ' echo ' Logical processors: $nr _processor "Nr_socket = ' Get_nr_socket ' echo ' physical socket: $nr _socket ' nr_siblings= ' get_nr_siblings ' echo ' siblings in one socket: $nr _ Siblings "nr_cores= ' Get_nr_cores_of_socket ' echo" cores in one socket: $nr _cores "let Nr_cores*=nr_socketecho" cores in Total: $nr _cores "If [" $nr _cores "=" $nr _processor "]; Then echo "Hyper-threading:off" Else echo "Hyper-threading:on" Fiechoecho ' ===== END ===== '

I believe that through the above two articles, the basic can be made clear node, socket, core, logic processor relationship, you can know the memory, L3-cache, L2-cache, L1-cache and the relationship between the CPU.

For KVM optimization, the general situation is that the CPU on the VM is bound to a node by a pin, allowing it to share the l3-cache and prioritize the memory on node, the Bind method can be virt-manage Processor inside the pinning dynamic binding. This binding is effective in real time.

Because there is no download to speccpu2005, so write a large number of CPU and memory programs to verify the performance of the binding CPU, the program is as follows:

#include <stdio.h> #include <pthread.h> #include <stdlib.h> #define Buf_size 512*1024*1024#define MAX 512*1024#define COUNT 16*1024*1024char * buf_1 = Null;char * buf_2 = null;void *pth_1 (void *data) {char * p1 = Null;char *    P2 = NULL;    int value1 = 0;    int value2 = 0;    int value_total = 0;    int i = 0;    int j = 0;    for (i = 0; I <=COUNT; i++) {value1 = rand ()% (MAX + 1);    value2 = rand ()% (MAX + 1);    P1 = buf_1 + value1*1024;    P2 = buf_2 + value2*1024;    for (j = 0; J < 1024x768; J + +) {value_total + = P1[j];    Value_total + = P2[j]; }} return NULL;}    void *pth_2 (void *data) {char * p1 = Null;char * P2 = NULL;    int value1 = 0;    int value2 = 0;    int value_total = 0;    int i = 0;    int j = 0;    for (i = 0; I <=COUNT; i++) {value1 = rand ()% (MAX + 1);    value2 = rand ()% (MAX + 1);    P1 = buf_1 + value1*1024;    P2 = buf_2 + value2*1024;    for (j = 0; J < 1024x768; J + +) {value_total + = P1[j]; Value_total + = P2[j]; }} return NULL;} int main (void) {buf_1 = (char *) calloc (1, buf_size), buf_2 = (char *) calloc (1, buf_size), memset (buf_1, 0, buf_size); Memset (    Buf_2, 0, buf_size);    pthread_t th_a, Th_b;    void *retval;    Pthread_create (&th_a, NULL, pth_1, 0);    Pthread_create (&th_b, NULL, pth_2, 0);    Pthread_join (Th_a, &retval);    Pthread_join (Th_b, &retval); return 0;}

On my experiment machine, even CPU on node 0, odd CPU on Node 1, VM has 2 CPUs, program has 2 threads, bind VM to 8, 9, 10, 12, run program by Time command, time./test, test result is as follows
8,9real1m53.999suser3m34.377ssys0m3.020s10,12real1m25.706suser2m49.497ssys0m0.699s
As you can see, binding to the same node is less time consuming than tying it to a different node. During the test, it is also found that if the 8, 9, 10, 11 CPUs are available, the system will select 8, 10, and 9, and 11 at most of the time, so guess that KVM on the CPU bind may have been optimized to bind as much as possible to the same node.

One thing to note here is that by virt-manage the PIN CPU, only CPU bind, will share the L3-cache, and there is no limit to the memory of a certain node, so there will still be a case of using memory across node.


Memory

Optimizations include ept, transparent large pages, memory defragmentation, KSM, and one of the following to introduce

EPT

For memory use, there is a translation of the logical address and physical address, which is done through the page table, and the conversion process is accelerated by the CPU VMM hardware, and the speed is very block.

However, after the VM is introduced, the VM vaddr----->vm padddr--------->host paddr, first the VM needs to convert the logical address and physical address, but the physical address of the VM is the logical address of the host machine. Therefore, the conversion of the logical address to the physical address is needed again, so this process has 2 address conversions, which is very inefficient.

Fortunately, Intel has provided EPT technology that transforms the two-time address transformation into one. This EPT technology is enabled in the BIOS, with the VT technology turned on.

Transparent Large page

The logical address to the physical address of the conversion, when doing the conversion, the CPU maintains a translation backup buffer tlb, used to cache the conversion results, and TLB capacity is very small, so if the page is very small, the TLB is easy to fill, it is easy to cause the cache miss, the opposite page becomes larger, The TLB needs to save fewer cache entries, reducing cache miss.

Transparent large page opening: echo Always >/sys/kernel/mm/transparent_hugepage/enabled

Memory defragmentation turned on: Echo always>/sys/kernel/mm/transparent_hugepage/defrag

Ksm

Article http://blog.chinaunix.net/ Uid-20794164-id-3601786.html, the introduction of a very detailed, simple understanding is that the content of the same memory can be merged to save memory usage, of course, this process will have performance loss, this need to consider the use of scenarios, if you do not focus on VM performance, and focus on host memory utilization, you can consider open, and Vice- Under/etc/init.d/, the service name is KSM and ksmtuned.


Disk

Disk optimizations include: VIRTIO-BLK, Cache mode, AIO, block device IO Scheduler

Virtio

Semi-virtualized IO device, for CPU and memory, KVM is full virtualization device, and for disk and network, there is a semi-virtualized IO device, the purpose is to standardize the data exchange interface between guest and host, reduce the exchange process, reduce memory copy, improve VM IO efficiency, can be in Libvirt Set in XML, add <target dev= ' Vda ' bus= ' virtio '/> in disk

Cache mode

Write disk from VM, there are 3 buffers, guest FS page cache, Brk Driver writeback cache, host FS page cache, settings on host, cannot change guest FS page cache, but can change later Cache, the following 5 kinds of cached mode, when using the host FS page cache, there will be a write synchronization, the cache data in real-time flush to disk, of course, it is more secure, do not lose data. See the specific pattern.

1th and 5th, few people use, are more extreme, the middle of 3, the performance comparison is as follows:


It can be seen that writeback mode in the mail server, small files high IO server its performance is very poor, none mode most of the situation is slightly better than writethrough performance, select None.

Enabled in Libvirt XML disk add <driver name= ' qemu ' type= ' qcow2 ' cache= ' None '/>

Aio

Asynchronous read and write, including native Aio:kernel Aio and threaded aio:user space Aio emulated by POSIX thread workers, the kernel mode is slightly better than the user mode performance, so the general situation is Select native, open mode <driver name= ' qemu ' type= ' qcow2 ' cache= ' none ' aio= ' native '/>

Block Device Scheduler

Cfq:perprocess IO queue, better fairness, lower aggregate throughput

Deadline:per-device IO queue, good real-time, better aggregate throughput, not fair enough, easy to appear VM starvation

The present author has not done the test, but to see NetEase and the cloud of the program, it is set to CFQ.

Open mode: Echo CFQ >/sys/block/sdb/queue/scheduler


Internet

The optimization items include Virtio, Vhost, Macvtap, Vepa, Sriov network cards, and there are several articles written in very good

Https://www.redhat.com/summit/2011/presentations/summit/decoding_the_code/wednesday/wagner_w_420_kvm_ Performance_improvements_and_optimizations.pdf

Https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html-single/Virtualization_Tuning_and _optimization_guide/index.html#chap-virtualization_tuning_optimization_guide-networking

Http://www.openstack.cn/p2118.html

http://www.ibm.com/developerworks/cn/linux/1312_xiawc_linuxvirtnet/

http://xiaoli110.blog.51cto.com/1724/1558984

Virtio: change the type of virtual network card, by a fully virtualized NIC e1000 rtl8139 virtio virtio Span style= "Color:black" > requires qemu VM kernel virtio driver

vhost_net:vhost_net virtio backend handlers are handled by the user space transfer to kernelspace cpu cpu utilization

Macvtap: Instead of the traditional Tap+bridge, there are 4 modes, bridge, Vepa, Private, passthrough
1, bridge, complete similar functions with bridge devices, data can be exchanged and forwarded between sub-devices belonging to the same parent device. The current Linux implementation has a flaw, in this mode Macvtap sub-device can not communicate with the Linux host, that is, the virtual machine can not communicate with the host, and the use of traditional bridge equipment, by the bridge set IP can be completed. However, this limitation can be removed by using the VEPA mode. This bridge pattern of Macvtap is equivalent to the traditional Tap+bridge model.
2, VEPA, is a part of the 802.1QBG standard in the VEPA mechanism of the software implementation, working in this mode MACVTAP device simply forward data to the parent device, complete the data aggregation function, usually requires the external switch support hairpin mode to work properly.
3, private, private mode and Vepa mode are similar, the difference is the sub-macvtap between each other isolated.
4, Passthrough, can be used directly with the Sriov network card, the kernel of the Macvlan data processing logic is skipped, the hardware determines how data processing, thereby freeing the host CPU resources. The Macvtap Passthrough concept differs from the PCI Passthrough concept in that PCI Passthrough is for any PCI device, not necessarily a network device, to allow the guest OS to directly use the PCI hardware on host to improve efficiency. Macvtap passthrough only for MACVTAP network equipment, the purpose is to bypass the kernel of the MACVTAP part of the software processing process, and turn to hardware processing. In summary, for a Sriov network device, you can use it in two modes: Macvtap Passthrough and PCI Passthrough

pcipass-through: Direct, equipment exclusive.

So-iov: The advantage is that the virtual network card work by the hostCPU to the physical network card to achieve, reduce the hostCPU utilization, The disadvantage is that it requires network card, motherboard,hypervisor support.

The relationship of these optimization schemes can be represented by the following diagram.

In summary, there are three levels of network virtualization:

1, users can 0 local use of Linux software to achieve Bridge, VLAN, Macvtap device customization and real-world similar virtual network;
2, can also use very low cost in accordance with the VEPA model in 802.1QBG to create an upgraded version of the virtual network, the virtual machine network traffic, reduce host server load;
3, when there is a network card supporting Sriov, you can use Passthrough technology to reduce the host load in one step


Summary: The article describes a total of CPU, memory, disk, network performance optimization scheme, mostly through the KVM parameters and system kernel parameter modification to achieve.



KVM Performance Optimization Solution

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.