from:http://blog.aliyun.com/1750
September 13, the first phase of the Aliyun class in Beijing on time, the presence of a large number of participants, the atmosphere is very warm. Aliyun three lecturers for you to give a wonderful speech, participants have actively participated in the on-site interaction, through the exchange of questions and answers, the harvest is quite abundant. At the request of the vast number of users, we will be the Cloud classroom lecturer to share the content of the full text, for your reference. Aliyun class will continue to be open throughout the country, welcome to continue to support.
The following are shared by Lecturer Wu Jiaming (Pulin):
1, SLB overall structure
LVS itself is open source, we have made a variety of improvements to it, and has been open source-https://github.com/alibaba/lvs.
Next we look at the location of the LVS throughout the SLB, and the entire diagram is the SLB architecture diagram. SLB function is relatively simple, mainly to do load balancing, the main two modules, one is four-tier load-balanced-lvs, and seven-tier load are-tengine, two software are open source; the rear end is ECS.
In general, a business deployment in two or more than two ECS-VM above, we recommend that you choose SLB to do load balancing.
Whether it is lvs-four layer or tengine seven layer, we load balance is cluster, there will be redundancy, a downtime for the user has no impact. SLB in Hangzhou region also has a lot of idc-data centers, the same VIP can be in IDC1 and IDC2, once the IDC1 down to switch to IDC2, that is, to achieve the redundancy between IDC. For highly reliable businesses, it is recommended that you deploy VMS in the two available zones of ECS, and that an IDC outage can be redundant.
In addition to our SLB total availability of 5 9, why do we do IDC redundancy, it is said that the availability of the best foreign data center is 5 9,SLB in the data center, must rely on redundancy between the data center to do 5 9.
2. LVS History
LVS is Dr. Zhangwensong 1998 years to do, LVS is the Linux virtual server abbreviation;
Zhangwensong is currently the Aliyun technical director.
3, the LVS share the main content
The content of this share is as follows: Why the introduction of LVS. What are the problems when used in a large scale network? In response to these problems, we have made some improvements: fullnat,synproxy, cluster deployment; Next, introduce LVS performance optimization technology, these technologies are not only used in LVS, you can use in your own network business inside; Finally, we will introduce the next LVS to do what things.
4, lvs-why
For example, a user visit Taobao station, Taobao front-end a total of 5 Apache servers, how to decide which one to visit Apache. The common way is to use DNS to do load balancing, the IP addresses of 5 Apache servers are added to the domain name www.taobao.com.
But DNS has some shortcomings, the first drawback: for example, the second Apache outage, Operation Dimension hurriedly put DNS in the IP address of the Apache removed, but many local DNS does not necessarily adhere to the TTL protocol, so delete when the operation takes effect, you are not controllable; especially in mobile networks, The problem is even more pronounced, and I remember that 10 years ago, some areas of mobile network local DNS were updated only one day.
The second disadvantage: the Service scheduling algorithm only supports WRR. If you have a limited range of users, there is a problem of unbalanced load. Third drawback: attack defense is very weak, each attack by a machine against.
For the lack of DNS, introduced the concept of virtual server, that is, the most front-end has a portal equipment to balance the flow of traffic to the back-end of the Apache, whether it is the LVS soft load or F5 hard load balance is this concept.
5, Lvs-what
The basic concept of LVS is the 4-layer load balance, which requires port information for the transport layer in the OSI network model.
LVS support WRR, WLC scheduling algorithms, WRR is weighted polling, WLC is a weighted minimum connection scheduling policy, is about to request to the least connected back-end server.
LVS supports 3 modes of work: NAT, DR, Tunnel, which is related to your IDC network deployment methods.
The transport protocol supports two kinds of TCP and UDP.
The first is NAT mode, in and out of the data stream are through the LVS equipment. In the time to change the purpose of IP to the actual back-end server Ip-dnat, when going out to do snat. Generally bought F5 and other commercial devices, all using NAT mode, because the NAT mode can prevent DDoS attacks, the attack defense function relies on access to data through the device.
The second is tunnel, this is the flow through the LVS, go out of the time does not pass. Tunnel is added to the original IP header to encapsulate an IP. It is said that Tencent adopts the mode of IP tunnel; The biggest problem with tunnel mode is that each packet needs to add an IP header, and if the packet received is already up to the maximum length of the Ethernet frame, the 1.5K,IP header is not added. This is the common practice is to return an MTU caused by the destination can not reach the ICMP packet to the client, if the client support PMTU, it will be a packet into small packets sent over.
One way to solve the above problem is for the switch to open a giant frame. Another method is to change the MSS on the back-end server, my IP header is 20 bytes, the default MSS is 1460, change it to 1440 can. Yes, most users can support normally, but there is always a million, it does not support the standard MSS negotiation protocol, you even if the MSS is very small, but the client will still send a large packet out.
The third is that the DR,DR performance is the highest in all modes, it only needs to modify the destination Mac, but the deployment must require LVS and back-end servers to be in the same VLAN.
Dr is ideal for small-scale networks, for example, Ali's CDN is used in DR Mode, the size of dozens of servers, especially for Dr This efficient mode, so if your business size is small, recommend the use of Dr.
6, lvs-Application
We talked about the basic characteristics of LVS in the front. LVS itself is only a kernel module: IP_VS, this module is to do load balancing, you only use this module to do engineering applications is not enough. For example, what to do if a realserver is down. The LVS itself is down and what to do.
In view of the above problems, we need to have assistive software to help us manage LVS, commonly used now is keepalived;keepalived support health testing, 4-layer and 7-layer health check detection to solve realserver downtime problem.
In addition, the keepalived supports the VRRP heartbeat protocol, which can realize the LVS main standby redundancy to solve the LVS itself single point of failure.
Finally, keepalived supports the configuration file to manage LVs;
With the above work done, we still lack a monitoring-how the service works, how the traffic is, and how the CPU loads. Most companies have their own set of monitoring systems, LVS monitoring is basically integrated into their own monitoring system. Of course, you can also use open source components, for example, SNMP patch-can access the information of LVS with the same interface as traditional network.
This figure is I talked about CDN network topology, LVS two sets to achieve the main redundancy, at the same time to the back end Realserver do Healthchech.
7, lvs-Problems & Solutions
The previous introduction of the official LVS some basic knowledge;
But in the large-scale network, in Taobao's business, the official LVS can not meet the demand; There are 3 reasons.
1 just said three kinds of forwarding mode, deployment cost is higher;
2 and commercial load balance ratio, LVS no DDoS defense attack function;
3 The main standby deployment mode, performance can not be extended; a VIP under the traffic is particularly large how to do.
1th-LVs Forwarding mode deficiencies, below to expand the description;
Shortage of DR: LVs must be required to be placed on the same VLAN as all reply on the back end. Of course, some people will put forward to divide a few districts, each district cloth a LVS, but a zone VM resources are gone, can only use other areas of the VM, and users need these VMs to hang under the same VIP, this is not possible.
Nat's weakness: The most important problem with NAT is that your configuration is very complicated; when Ali bought a commercial device, it was necessary to configure Policy routing on the switch, and out direction policy routing; Because redundancy considerations deploy multiple sets of load balancing, the default route can only reach a set of load balancing.
Tunnel: The problem of the tunnel is also more complex configuration, realserver need to load a IPIP module, while doing some configuration.
To solve the above problems, our solution is as follows: LVS each forwarding mode operation cost is high
– The new forwarding mode Fullnat: To achieve lvs-realserver across VLAN communication, and in/out flow through LVS; Lack of attack defense module
–synproxy:synflood Attack Defense Module
– Other TCP FLAG DDoS attack defense policy performance cannot be linearly extended
–cluster Deployment Mode
Below we introduce the above solution respectively;
8, Lvs-fullnat forwarding mode
Fullnat,fullnat forwarding packets are like NAT mode, in and out packets are LVS; the only difference is that the backend realserver or switch does not require any configuration.
The main principle of fullnat is to introduce local addresses (intranet IP address), CIP-VIP conversion to Lip->rip, and lip and RIP are IDC intranet IP, can communicate across VLAN;
The following from the IP address translation point of view, Nat and Fullnat differences;
Nat Mode
Fullnat mode
As shown in the figure, compared to the NAT mode, Fullnat a local IP,IP address conversion, the source and destination IP are changed, that is, Snat+dnat.
In the Fullnat mode, the hook point of the Ipvs in the Netfliter frame has also changed;
Nat Mode
Fullnat mode
This diagram is the kernel netfilter five hook points, the original NAT mode, in Local_in and forward two points, and Fullnat mode, the in/out direction of IP is the IP of LVS, therefore, only in local_in this point.
Compared with the Nat,session table management, there are 1 index tables, which become in and out 2; This is because the NAT mode only needs to use the client address as the hash key, but the Fullnat can only use 5 Yuan group;
Fullnat one of the biggest problems is that realserver can't get user IP; To solve this problem, we propose TOA concept, the main principle is: Put the client address in the TCP option to the back-end Realserver, The GetName function is hack on the realserver via the Toa kernel module, which returns the client IP in TCP option to the user state.
On the digression, the world's largest CDN manufacturer Ackerman also used TCP option to carry ancillary information;
Here to introduce the Fullnat development, encountered several pits, these pits for Linux network application development is also useful, for example, realserver kernel open tcp_tw_recycle, this parameter opening will cause some NAT gateway out of the user access failed;
9, Lvs-synproxy
LVS can defend against DDoS 4-level flag-bit attack, in which Synproxy is the module for defending against Synflood attack;
The main principle of synproxy: referring to the idea of Syncookies in Linux TCP protocol stack, lvs-construct synack packet of special seq, verify that ACK_SEQ is legal in ACK packet-implement TCP three times handshake agent;
Simplified point, that is, the client and the LVS set up 3 times to shake hands, after success, LVS and RS to establish 3 times handshake;
10. lvs-Cluster
We introduced the introduction of a new model called Fullnat, convenient for everyone to deploy, the following we can have a cluster deployment model to do horizontal expansion.
Who put the flow of balance into the LVS-switch, LVS and switch between the operation of the OSPF protocol, the switch on the formation of the VIP equivalent route-ecmp;
In addition, this kind of deployment can be used in many places, such as DNS, domain Name system we need each company, the deployment of DNS system, not recommended to add a layer of LVS, but the DNS server directly and switch running OSPF protocol.
A problem with the switch ecmp is that the consistent hashing algorithm is not currently supported; For example, three LVS has one outage, the equivalent route becomes two, and your packets are all messed up. Like Cisco's switch chip, it also supports algorithms that resemble hash-like consistency, however, the switch with this function has not yet been produced; therefore, we have to do session synchronization in each of the LVS, that is, the connection table to make the overall situation, so that even if there is a LVS outage also does not matter, because the table is the overall, This request comes over. Other LVS can also be forwarded correctly backwards.
Note: The current Fullnat mode cannot support synchronization.
11, lvs-performance optimization
These performance optimization methods are also useful for all network services.
1th, a multiple-queue network card, that is, a queue bound to a CPU core, so that multi-core processing network packets simultaneously. If the NIC does not support multiple queues, it can be integrated by default with the soft multi-queue-rps,linux provided by Google;
Secondly, the keepalived is optimized, and the network mode is changed from select to Epool.
Third, if you buy the server, we recommend that the network card LRO, Gro function off, especially Broadcom network card, we have stepped on a lot of pits.
12, Lvs-todo List
Next I'll explain what we're going to do next.
We will focus on the following: control system; LVs in May or June a series of failures, very unstable, in fact, is not LVS, is not tengine, but the control system; The first step is to simplify the control system and separate the user operation logic from the operational dimension. The next point is to improve the performance of the control system.
function, we will support UDP and HTTPS.
There is also a session synchronization, Fullnat case is difficult to support synchronization, this problem we will solve.
Behind the performance we are also trying to 40G network card, we are also evaluating to see.
If we can do it in the future, we want to do one of the 4-level 7 layers.