Architecture Design: Load Balancing layer Design (4)--lvs principle

Last Update:2015-07-18 Source: Internet

Author: User

Tags knowledge base

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Before we took two articles, we explained in detail the nginx principle, installation and feature components. See Load Balancing Layer Design (2)--nginx installation (http://blog.csdn.net/yinwenjie/article/details/46620711) and architecture design: Load Balancing layer Design (3)-- Nginx Advanced "(http://blog.csdn.net/yinwenjie/article/details/46742661) two articles. Although it does not include all of Nginx's knowledge (and it cannot all be included), enough readers to apply nginx to the actual production, and to optimize the important features, there is time later we will return to the nginx explanation. Starting with this article, we will begin to introduce LVS technologies, including basic concepts, simple use, and advanced use.

1, LVs Introduction

Please own Google or Baidu.

2. Basic knowledge of network protocol

According to the official document LVS supports three load modes of operation: Nat mode, Tun mode, and Dr mode. To illustrate how these three methods work, we first need to understand the basic IP/TCP messages (note that IP packets and TCP messages are two different message formats) and how the link layer encapsulates the IP data. Then we use the way of looking at the picture, in a combination of text and graphics to introduce you to the three ways of working on the message or rewrite or encapsulation process.

In order to clarify the basic knowledge that we will be explaining, we should mention the OSI7 layer network model.

Dr. Zhangwensong and his team are my idol. The LVS system is efficient because it directly modifies or encapsulates link beginning, IP packets, and TCP messages. So to really understand the three ways of the LVS, you can not like the copy on the network on the same turn a few words selves here text will be finished, you must link beginning text, network beginning text, transmission beginning text to understand. Let's take a brief overview.

In order to ensure that this series does not deviate from the blog, we only explain the need to use the attributes and meanings, if the reader is interested in the core principles of the network, you can read the "TCP/IP detailed, Volume 1: Protocol" this book.

2.1, Link beginning Wen

The data format of the link layer has a common feature, including the destination MAC address and the source MAC address. The following figure illustrates the message format of our most commonly used Ethernet frames (Ethernet frames):

Destination MAC address/source MAC address: 00:00:00:00:00:00--ff:ff:ff:ff:ff:ff This range is the available range for global MAC addresses. A physical NIC must have a unique MAC address. In fact, the network layer commonly used IP protocol, is based on the MAC address. The MAC address of an IP in a subnet range is queried by the ARP query protocol from the NAT device (which may be a router, switch, or network proxy device).
Upper layer protocol type: The link layer of the message is to host the network layer of the Protocol, so the data format of the link layer need to have a property to explain the link layer of the upper-level protocol is what type.
- ipv4:0x0800
- arp:0x0806
- pppoe:0x8864
- Ipv6:0x86dd
Encapsulated upper-layer data: can have up to 1500 bytes.

Keep in mind the data format of this link layer, because when it comes to LVS-DR, it is mainly about modifying the data format of the link layer without modifying the IP data packets and the TCP data packets.

2.2, Network layer IP message

The TCP protocol and the IP protocol are two different kinds of protocols. The corresponding, that is, two different description formats.
IP protocol is the network layer protocol, as the name implies, is used to describe the entire network structure, the TCP protocol is a communication layer protocol, is used to indicate how the points on two or more networks communicate and the current state of communication.
These two protocols have many common features, such as the two protocols are divided into "head" and "data", for the IP protocol, the description of the TCP protocol is stored in its "data department."

Let's start by looking at how IP protocols are described:

Well, I am directly from the Baidu Encyclopedia directly glued to, because I want to draw, the structure is so ^_^. There are several important attributes we want to use later, to give you a little bit:

Header.version:IP protocol version number, you guessed the right is IPV4 or IPV6. 4 represents the IPV4 version, and 6 represents IPV6. You want to ask me what is IPV4, or ipv6:192.168.220.141, this is IPV4 format; fe80:0000:0000:0000:aaaa:0000:00c2:0002, this is the format of IPV6.
Header. Total length: This total length is the total length of the IP header and IP data two regions. It is mainly used for generating header verification code and for operating system processing convenience.
Header. IP Flags: This position has 3 bits 000, but actually only the latter two bits have the value 010, this is the "d" bit = 1, this time indicates that because the entire data to be transmitted is not small, so the data part of this IP datagram has described the entire data description, do not need to do the IP datagram shard; "D" = 0, the entire data to be transferred is relatively large, so the IP datagram is split, this time will need to use the last one: the last "M" in the "1" means that there is a subsequent shard; 0 means that the datagram is the last data slice of the IP shard.
Header. PROTOCOL:IP protocol is the Network layer protocol, above the network layer is the Transport layer protocol. TCP, UDP, ICMP, and IGMP are transport layer protocols. The 8 bits in this location describe which upper layer protocol the IP protocol data part carries.
Header. Source Address: This is, of course, the source of the IP datagram.
Header. Destination Address: This is, of course, the destination of the IP datagram.
Header.checksum: The first checksum value. This value verifies the transmission integrity of the IP datagram Header (note that the checksum does not include the data portion of the IP datagram). This means that when the NAT device rewrites the source or destination IP of the datagram, the checksum is recalculated. The Source address, Destination address, and checksum are the main rewriting properties of various NAT devices. And many times the NAT device only overwrites these three values to enable the forwarding of the IP datagram (of course, the port in the TCP message will also be overwritten for port translation in case of port mapping).

2.3. Transmission Layer TCP Message

As mentioned above, TCP message information is loaded into the data portion of the IP packet, as the data transmitted on the network from Srouce address to destination address. The following is information about TCP messages:

header. Source port number: The port number of the TCP information source.
header. Destination port number: The target ports for TCP information data.
header. Status bit (Urg/ack/psh/rst/syn/fin): If you've seen the architectural layering of standard web systems that I wrote earlier (http://blog.csdn.net/yinwenjie/article/details/ 46480485) Section 3.2 of this article, then you are the ACK, SYN, FIN, the three marks are not unfamiliar, because TCP three handshake and connection interruption will need to use these three tags, note that SYN seq and ack seq is the confirmation number of the TCP datagram In addition, explain the PSH and RST two status tokens. The TCP datagram in the application layer has a buffer, which means that multiple correct TCP datagrams are first placed in this buffer, and then pushed to the upper layer of the application layer protocol, such as HTTP, after a certain condition is reached. When the PSH is 1, it indicates that it is not necessary to wait for the subsequent TCP data packets, and then the data segment of the TCP datagram in the receiver cache is pushed to the upper layer protocol and the buffer is emptied, and the RST indicates the reset, You can understand that all TCP data packets that are not sent to the upper layer are discarded in the current cache, which is generally a problem with TCP message transmission.
header. TCP checksum: The checksum of a TCP message is slightly more complex than the IP header check. The input to the TCP checksum consists of three parts: TCP pseudo-header, TCP first ministerial, and TCP data ministerial degree. The TCP pseudo-header is a virtual concept that includes part of the IP packet that hosts the TCP data message, and a portion of the TCP header data (source IP, Destination IP, protocoly in the IP packet, and the length of the message header of the TCP message and the data length of the TCP message).

As can be seen from the above description, the TCP message verification information will change once the source IP and destination IP in the IP message have changed.

3, LVS three ways to work 3.1, Lvs-nat work mode

A NAT method is a method by which the LVS Master service node receives the datagram and then forwards it to the underlying real server node, which is sent back to the LVS master node and then forwarded out by the LVS master node when the real server finishes processing. LVS Management Program Ipvsadmin is responsible for binding forwarding rules, and completes the IP data packets and TCP data messages in the property rewrite. Take a few minutes to take a closer look (for simplicity, only a real Server is drawn in the figure.) If you can't see clearly, you can click on the right button "view original"):

1, in the formal computer room environment, there are generally two ways for a machine to assign an extranet address: On the core switch to directly bind the external network address to the host network card, so use the ifconfig command to see the IP address as an external network address, on the core switch using the mapping rules, an external network address map to the intranet address, This uses the Ifconfig command to see the IP address as an intranet address. We are using the latter mapping rule. If you use the allocation rules of the previous extranet IP, it does not affect how LVS NAT works, because this IP is limited to LVS NAT work. Just eth1 IP from 192.168.100.10 to 100.64.92.199.
2, we describe the conversion rule in Chinese: usually sent to the "192.168.100.10:80" of the datagram, the target address is all rewritten as "192.168.220.121:8080", so from the 100.64.92.199 : 80 of the messages have been rewritten. Properties that are rewritten include: IP.header.destinationIP, IP.header.checksum, TCP.header.sourcePort, TCP.header.targetPort, TCP.header.checksum. Note that the source IP of the IP message does not change, or an "Internet IP".
3. The packet was eventually sent to the 192.168.220.121 8080 port for processing, and the returned datagram was generated by the lower-level real server (the real server is not really a "true real server", LVS does not care). You have to ask how it was sent in the past, please refer to the ARP query protocol.
4, note: Because the LVS server and real server (there may be more than one), the formation of a closed LAN. In addition to the LVS node, any node of this subnet is inaccessible to the extranet. So ask 192.168.220.121 this real server directly to the data to the "Internet some IP" This network address, obviously not, because in the local area network can not find this IP. Real server can only return datagrams to the gateway, and the gateway will look for the extranet address. Only the LVS node in the entire server can find this extranet address, which is why all the real server nodes must set their own gateway for the LVS node in Lvs-nat working mode .
5, after receiving the data message from "192.168.220.121:8080", Ipvs again to rewrite the data message. The rewrite rule is: all the data from "192.168.220.121:8080", the target address is all rewritten as "192.168.100.10:80". The source IP and source port of the data message are then rewritten as "192.168.100.10:80". In the outer core switch (or the requesting party outside the engine room), LVs accepted the datagram and processed it, returning the result. It does not know what is on the lower level of the LVS node.

The advantages of Lvs-nat are:

Configuration management is simple. Lvs-nat's working style is the easiest to understand, easiest to configure, and the easiest to manage in LVS three working modes.
Save the external IP resources, the general computer room assigned to users of the number of IP is limited, especially the number of racks you purchased. Lvs-nat works to encapsulate your system architecture in a local area network, only the LVS has an extranet address or an extranet address mapping to achieve access.
The system architecture is relatively closed. In the intranet environment, we can not set the firewall requirements very high, but also relatively easy to carry out the physical server operations. You can set up requests that originate from the extranet and require firewall filtering, while intranet requests are open for access.
In addition, after rewriting the data message to real server, real server does not care about its authenticity, as long as the TCP checksum IP checksum can pass, real server can be processed. So lvs-nat working mode real server can be any operating system, as long as it supports the TCP/IP protocol.
Of course, as a loyal supporter of Linux systems, I do not recommend using the Windows Server. But if your real server is a. NET system with a business scenario that requires LVS, then Lvs-nat may be a good choice.

The disadvantage of Lvs-nat is due to the fact that this forwarding mode itself is caused by:

The forwarding point is the bottleneck point. You can imagine a scenario where 100 real servers would go all the processing results to a LVS for sending. In fact, the ultimate load of lvs-nat is less than 100 real servers.

3.2. LVS-DR working mode

LVs of the DR Working mode, is the most commonly used in the production environment of a working mode, the online data is the most, and some of the work of the Dr model of the explanation is more thorough. Here we will introduce you to the DR Mode of operation (again, if not clear, right-click "View Original"):

Reflects the entire working process of Dr Mode, and for simplicity, the real server here only draws one. In the case of multiple real servers, LVS will use the scheduling algorithm to determine which real server to send to. Several key points of the LVS-DR working mode are:

The response messages formed by the real server are no longer sent back to the LVS node, but are routed directly to the hub switch and sent out. The LVS postback process in Lvs-nat mode is omitted.
The LVS node will only overwrite the message encapsulation of the link layer, and the network layer and the transmission beginning text are not rewritten.
There is a network post that Dr Working mode, can not cross the subnet, that is, the LVS node and the individual real server nodes must be in the same network segment. What is this for? Is that really the case? Many online posts do not answer this question, and this article immediately answers (in fact Mr. Zhangwensong has answered this question).
When using Dr Mode, the real server is required to set the VIP on the LVS for its own loopback IP, otherwise the package will be discarded. Why is that? A lot of stickers also didn't answer this question, OK, we'll answer it right away.

The first thing to say is how it works:

1, similarly, we have to demonstrate the entire production environment, from the computer Room Center switch received a data message after the start of the explanation. The hub switch also takes the IP mapping method. But unlike the Lvs-nat approach, Real server also needs to bind an extranet map on the center switch of the room. This ensures that the response message of the real server postback can be sent to the extranet.
2, when the LVS node receives the request message, it will overwrite the data link layer format of the message. The target Mac is rewritten as a real server Mac, but the network layer and transport beginning are not overwritten and then sent back to the switch. Here is a problem, now Target mac and destination IP correspondence of the wrong, this data message to the switch, because of this dislocation of the relationship, is not able to three layer exchange, only two layer exchange (once the IP exchange, The validation of the data message will be faulted and discarded). So the Lvs-nat method requires that both the real server and the LVS node must be in the same LAN, or more precisely: the LVS node needs to find a two-layer link that sends a message that overwrites the MAC address to real server, not a three-layer interchange checksum . In this case, the LVS node and the real server interface do not necessarily have to be on the same subnet, and it is possible to transmit the message using a standalone network card.
3, through two layer exchange, the data is sent to the real server node. So how does the real server node judge the correctness of this package? First of all, of course, the Transport layer TCP/IP message verification is not a problem, Lvs-nat does not overwrite TCP/IP, of course, the checksum is not a problem (unless the message itself is problematic), and then the link layer of the MAC address can be recognized, this is the credit of the loopback IP. For real server nodes, the VIP of 192.168.100.10 is his loopback IP, and the bound Mac is the target Mac replaced by LVS. then real server will think that this package is running in a native application through the loopback IP sent to itself, so this package can not be discarded, must be processed .
4. The generated response message after being processed is sent directly to the network management. This does not have much explanation, just make sure that the default route of real server is set to the 192.168.100.1 of the core switch is OK. In addition, it should be explained that because the LVS-DR mode does not change the original IP packets and TCP messages, so the LVS-DR mode itself is not support port mapping , in fact, in daily use practice, we generally use nginx do port mapping, because: spirit.

The advantages of the LVS-DR working mode are:

It solves the problem of forwarding bottleneck in Lvs-nat operation mode, and can support the larger load balancing scenario.
Compared to the consumption of external IP resources, the computer room of the network IP resources are limited, if in the formal production environment does exist this problem, can be used Lvs-nat and LVS-DR mixed use of the way to alleviate.

LVS-DR, of course, also has drawbacks:

Configuration work is a little more troublesome than lvs-nat, you need to understand at least the basic work of LVS-DR mode in order to better guide yourself to the configuration of LVS-DR mode and solve problems during operation.
Because of the message rewriting rules of the LVS-DR mode, the LVS node and the real server node must be in a network segment because the two-layer interchange is not able to cross the subnet. But for most system architecture scenarios, there is virtually no intrinsic limitation.

3.3. Lvs-tun working mode

Many of the articles on the web provide the reader with a similar way of working with Dr and Tun, either by directly explaining the installation configuration of Dr Mode and Tun mode, and then summarizing the two modes similar. Then why do we need tun mode after having Dr mode? Why does Ipvsadmin have different configuration parameters for both modes?

In fact, the LVS-DR mode and the Lvs-tun mode work completely differently, and the working scene is completely different. Dr Based on data message rewriting, Tun mode is based on IP tunneling, which is the re-encapsulation of data packets. Let's take a look at how Lvs-tun mode works.

First, we introduce a concept Ipip tunnel. Encapsulates a complete IP message into the data portion of another new IP packet and transmits it to the specified location via the router. In this process the router does not care about the contents of the original protocol being encapsulated. After arriving at the destination point, rely on your own computing power and support for the Ipip Tunnel protocol, open the Encapsulation protocol, and obtain the original protocol. Such as:

It can be said that the tunnel protocol is to solve the cross-subnet preparation, in the production environment due to business needs, technical needs or security needs, may use the Switch VLAN isolation (that is, to form a number of virtual independent LAN), we may need the LVS node in LAN A, Multiple MySQL read servers that need to be loaded may be in LAN B. At this point, we are going to configure the LVs tunnel mode. The Lvs-tun mode is shown (note that the target node is capable of unlocking the tunneling Protocol, and the good news is that Linux supports the Ipip Tunneling Protocol):

The line in the advantages of many, you only need to pay attention to the "with arrows" dotted line can be.

1. Once the LVS node discovers a request with a target of 192.168.100.10VIP, the request packet will be encapsulated using the Ipip Tunneling protocol. Instead of Mac information like LVS-DR mode rewriting data messages. If more than one real server is configured, LVS uses the set scheduling algorithm to determine a real server (for simplicity, only one real server node is drawn).
2, the re-encapsulated IPIP Tunnel protocol message will be sent back to the router, the router (or layer three switch) will be based on the set of Lvan mapping situation, to locate the target server, and send this IPIP tunnel message past.
3. When the Real server receives this IPIP tunnel message, it will unpack the packet. Note here that, in general, IPIP tunnel messages will be fragmented, just like IP message shards, just for convenience, we assume that this message does not require fragmentation. The data message obtained after decompression is the request message that was originally sent to the VIP.
4, Real server set the loopback IP, let real server think the original request message is sent from one of their own local application, after the completion of the original message verification, it will process the message. The rest of the process is the same as LVS-DR, and there is no repetition.

It can be said that Lvs-tun way basically has the advantage of LVS-DR. On this basis, it also supports cross-subnet penetration. Such a scenario can give us architects enough system design scenarios.

4. LVS dispatching mode

In the 3rd section of this article, in order to concentrate on the three operating modes of LVS, we draw a real Server for LVS in three images. In practical applications, however, it is common to have multiple real servers. LVS uses a variety of scheduling algorithms to determine which real server the "Current data message" is handled by . In my previous article, "Architecture Design: Load balancer layer Design (2)--nginx Installation" (http://blog.csdn.net/yinwenjie/article/details/46620711), More space has been spent on the scheduling in Nginx, and clearly explained that these scheduling methods are original Aim .

The article includes the hash algorithm, and shows that any property can do hash, including IP, user name, etc. also introduced polling and weighted polling, weighted polling can also be based on a variety of properties as weights, such as the CPU usage of the node, memory usage, or the administrator set a fixed weight value. The same is true for LVS scheduling.

Scheduling with consistent hash algorithm
- Target Address hash (DH): The scheduling algorithm according to the requested destination IP address, as a hash key (hash key) from the static distribution of the hash list to find the corresponding server, if the server is available and not overloaded, send the request to the server, otherwise return empty.
- Original address hash (SH): According to the requested source IP address, as a hash key (hash key) from the static distribution of the hash list to find the corresponding server, if the server is available and not overloaded, send the request to the server, otherwise return empty.
Polling scheduling
- Minimal polling (RR): The scheduling algorithm distributes external requests sequentially to the real servers in the cluster, and treats each server equally, regardless of the actual number of connections and system load on the server.
- Minimum connection polling (LC): Note the difference between the two scheduling algorithms, "minimum connection polling" and "least connection weighted polling". The scheduler dynamically dispatches network requests to the server with the fewest number of links established through the "least connection" scheduling algorithm. Note that the request will definitely be assigned to the real server with the least number of connections currently, regardless of the odds.
Weighted Polling schedule:
- Performance-weighted polling (WRR): The scheduling algorithm dispatches access requests based on the different processing capabilities of the real server. This ensures that the processing capacity of the server can handle more traffic. The scheduler can automatically inquire about the load of the real server and adjust its weights dynamically.
- Weighted polling (WLC) with a minimum number of connections: Servers with higher weights will withstand a large percentage of active connection loads. The scheduler can automatically inquire about the load of the real server and adjust its weights dynamically. Note that there is a higher probability of allocation per scale, rather than an LC "definitely assigned".

The more complete scheduling algorithm is mentioned in the official Chinese data of LVS. Can be referenced (http://zh.linuxvirtualserver.org/node/2903). But it must be remembered that the various scheduling algorithms must not escape the hash consistency, polling, weighted polling this big idea .

5, the following article introduction

This article focuses on how LVS works, and introduces in detail the working process, advantages and disadvantages, and application scenarios of the three working modes of LVS. As long as you understand the principle, then the installation and configuration of the LVS described in the next article is a piece of cake. In the next article, we will describe how to install and configure LVS three operating modes, and then we'll show you how to install and configure LVS + keepalived. With this knowledge base, we will return to Nginx, introducing the installation and configuration of LVS + keepalived + nginx.

Architecture Design: Load Balancing layer Design (4)--lvs principle

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More