LVS cluster system network core principle analysis

Last Update:2013-12-24 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The rapid growth of the Internet allows the multimedia network server to rapidly increase the number of accesses, and the server must be able to provide a large number of concurrent access services. Therefore, for servers with large loads, CPU and I/O processing capabilities will soon become bottlenecks. Because the performance of a single server is always limited, simply improving the hardware performance cannot really solve this problem. Therefore, multi-server and load balancing technologies must be used to meet the needs of a large number of concurrent accesses. Linux Virtual Servers (LVS) Use Server Load balancer technology to make up multiple Servers into one Virtual server. It provides an easy-to-expand load capacity and a low-cost solution to meet the rapidly growing network access needs.

1. LVS structure and working principle

The LVS structure 1 shows that it is composed of the Load Balancer (LB) of the frontend and the Real Server (RS) of the backend. RS can be connected through a LAN or WAN. This LVS structure is transparent to users. Users can only see one Virtual Server (Virtual Server) serving as LB, but not the RS group that provides services.

When a user's request is sent to the virtual server, LB forwards the user request to RS Based on the set packet forwarding policy and the load balancing scheduling algorithm. RS then returns the user request results to the user. Like a request packet, the response packet return method is also related to the packet forwarding policy.

There are three packet forwarding policies for LVS:

NAT (Network Address Translation) mode. After LB receives the user request packet, LB converts the IP address of the virtual server in the request packet to the IP address of a selected RS and forwards it to RS. RS sends the response packet to LB, LB converts the rs ip address in the response packet to the IP address of the virtual server and sends it back to the user.
IP Tunneling mode. After receiving the user request packet, LB encapsulates the packet according to the IP tunnel protocol, and then sends it to a selected RS. RS solves the request information and directly sends the response content to the user. In this case, both RS and LB must support the IP tunneling protocol.
DR (Direct Routing) mode. After receiving the request packet, LB converts the target MAC address in the request packet to the MAC address of a selected RS and forwards the packet. After RS receives the request packet, you can directly send the response content to the user. In this case, LB and all RS must be in one physical segment, and LB shares a virtual IP address with the RS group.

2. IPVS software structure and implementation

The core of LVS is IPVS running on LB, which uses the IP layer-based load balancing method. As shown in figure 2, IPVS consists of three modules: IP packet processing, Server Load balancer algorithm, system configuration and management, and a linked list of virtual servers and real servers.

2.1 lvs ip Packet Handling Mode

IP packet processing is completed using the Linux 2.4 kernel Netfilter framework. The process of a data packet passing through the Netfilter framework:

In general, the netfilter architecture is to place some detection point hooks in several locations throughout the network flow. Some processing functions are registered on each detection point for processing, such as packet filtering, NAT, or even user-defined functions ).

Copy from ):

NF_IP_PRE_ROUTING: the packet that has just entered the network layer has just completed version number, checksum, and other checks through this point. The source address is converted to this point;
NF_IP_LOCAL_IN: This checkpoint is sent to the local machine after the route query. The INPUT package is filtered at this point;
NF_IP_FORWARD: indicates the packet to be forwarded. The FORWORD packet is filtered at this point;
NF_IP_LOCAL_OUT: indicates the packet sent by the local process. The OUTPUT packet is filtered at this point;
NF_IP_POST_ROUTING: all packets that need to be sent out of the network device immediately pass this detection point. The built-in destination address conversion function includes address disguise.

In the IP layer code, there are some statements with NF_HOOK macro. For example, IP Forwarding functions include:

<-Ipforward. c ip_forward ()-> NF_HOOK (PF_INET, NF_IP_FORWARD, skb, skb-> dev, dev2, ip_forward_finish); // The NF_HOOK macro is defined as follows: <-/include/linux/netfilter. h-> # ifdef CONFIG_NETFILTER # define NF_HOOK (pf, hook, skb, indev, outdev, okfn) (list_empty (& nf_hooks [(pf)] [(hook)])? (Okfn) (skb): nf_hook_slow (pf), (hook), (skb), (indev), (outdev), (okfn) # else /*! CONFIG_NETFILTER */# define NF_HOOK (pf, hook, skb, indev, outdev, okfn) (skb) # endif/* CONFIG_NETFILTER */

If netfilter is not configured during kernel compilation, it is equivalent to calling the last parameter. In this example, the ip_forward_finish function is executed; otherwise, the function enters the HOOK point and runs through nf_register_hook) the registration function may be vague. It actually enters the nf_hook_slow) function and then executes the registration function ).

NF_HOOK macro parameters are:

Pf: protocol family name. The netfilter architecture can also be used outside the IP layer. Therefore, this variable can also have names such as PF_INET6 and PF_DECnet.
Hook: the name of the HOOK point. For the IP layer, obtain the preceding five values;
Skb: As the name implies
Indev: the incoming device, represented in the struct net_device structure;
Outdev: indicates the device that is going out, in the structure of struct net_device;
Okfn: A function pointer. This process is used when all the registration functions of the HOOK point are called.

These points have been defined in the kernel. Unless you are the maintainer of this part of the kernel code, you do not have the right to add or modify them. The processing of this detection point can be specified by the user. Features such as packet filter, NAT, and connection track are also provided in this way. Just as netfilter's original design goal-to provide a sound and flexible framework for the convenience of extended functions.

If we want to add our own code, we need to use the nf_register_hook function. Its prototype is:

Int nf_register_hook (struct nf_hook_ops * reg) struct nf_hook_ops: // structure struct nf_hook_ops {struct list_head list;/* User fills in from here down. */nf_hookfn * hook; int pf; int hooknum;/* Hooks are ordered in ascending priority. */int priority ;};

In fact, similar to LVS is to generate an instance of the struct nf_hook_ops structure and HOOK it with nf_register_hook. Among them, the list item should be initialized to {NULL, NULL}. Since it generally works at the IP layer, pf always PF_INET; hooknum is the HOOK point; one HOOK point may contain multiple processing functions, after who is the first, it depends on the priority, that is, the priority is specified. Netfilter_00004.h specifies the priority of the built-in processing function with an enumeration type:

enum nf_ip_hook_priorities {NF_IP_PRI_FIRST = INT_MIN,NF_IP_PRI_CONNTRACK = -200,NF_IP_PRI_MANGLE = -150,NF_IP_PRI_NAT_DST = -100,NF_IP_PRI_FILTER = 0,NF_IP_PRI_NAT_SRC = 100,NF_IP_PRI_LAST = INT_MAX,};

Hook is the processing function provided, that is, our main work. Its prototype is:

unsigned int nf_hookfn(unsigned int hooknum,       struct sk_buff **skb,       const struct net_device *in,       const struct net_device *out,       int (*okfn)(struct sk_buff *));

The five parameters will be passed in by the NFHOOK macro.

The above are some basic usage of NetFillter when writing its own modules. Next, let's take a look at how LVS is implemented.

3. Netfiler implementation in LVS

Using Netfilter, LVS processes the datagram and enters the system from the left. After IP address verification, the datagram goes through the first hook function NF_IP_PRE_ROUTING [HOOK1] for processing. Then, it selects a route, determines whether the datagram needs to be forwarded or sent to the local machine. If the datagram is sent to the local machine, the data is processed by the hook function NF_IP_LOCAL_IN [HOOK2] and then transmitted to the upper-layer protocol; if the datagram should be forwarded, it will be processed by NF_IP_FORWARD [HOOK3]. The forwarded datagram will be transmitted to the network after being processed by the final hook function NF_IP_POST_ROUTING [HOOK4. The locally generated data is processed by the hook function NF_IP_LOCAL_OUT [HOOK5] And then routed to the network after being processed by NF_IP_POST_ROUTING [HOOK4.

When IPVS is started to load the ip_vs module, the module's initialization function ip_vs_init () registers NF_IP_LOCAL_IN [HOOK2], NF_IP_FORWARD [HOOK3], and handler [HOOK4] Hook Functions to process incoming and outgoing data packets.

3.1 NF_IP_LOCAL_IN Processing

The user initiates a request to the virtual server. The datagram goes through NF_IP_LOCAL_IN [HOOK2] and enters ip_vs_in () for processing. If an icmp datagram is passed in, ip_vs_in_icmp () is called; otherwise, the system continues to judge whether it is a tcp/udp datagram. If it is not a tcp/udp datagram, the function returns NF_ACCEPT (to allow the kernel to continue processing the datagram). The rest is to process the tcp/udp datagram. First, call ip_vs_header_check () to check the header. If an exception occurs, the function returns NF_DROP (discard the datagram ). Then, call ip_vs_conn_in_get () To Go To The ip_vs_conn_tab to check whether such a connection exists: the IP address, port number, and protocol type of the client and the virtual server are consistent with the corresponding information in the datagram. If no connection exists, it means that the connection has not been established. If the datagram is a tcp sync message or udp datagram, find the corresponding virtual server. If the corresponding virtual server exists but is already full, NF_DROP is returned. If the corresponding virtual server exists and is not fully loaded, call ip_vs_schedule () to schedule an RS and create a new connection. If the scheduling fails, call ip_vs_leave () continue to pass or discard the datagram. If a connection exists, first determine whether the RS on the connection is available. If not, process the related information and return NF_DROP. After finding an existing connection or establishing a new connection, modify the information recorded by the system, such as the number of incoming data packets. If the connection is bound to a specific datagram transport function when it is created, call this function to transmit the datagram; otherwise, NF_ACCEPT is returned.

Ip_vs_in () calls ip_vs_in_icmp () to process icmp packets. Check the length of the datagram when the function starts. If an exception occurs, return NF_DROP. The function only processes icmp messages that are inaccessible due to tcp/udp message transmission errors, the source is disabled, or time-out. In other cases, the kernel processes the messages. Check and verify the preceding three types of packets. If the check result is incorrect, NF_DROP is returned. Otherwise, analyze the returned icmp error information and check whether the connection exists. If the connection does not exist, NF_ACCEPT is returned. If the connection exists, modify the IP address and port number of the error message header and the IP address of the ICMP data packet header Based on the connection information, re-calculate and modify the checksum in each packet header. Then, find the route that calls ip_send () to send the modified datagram and return NF_STOLEN (exit the datagram processing process ).

The ip_vs_in () function ip_vs_schedule () Schedules available RS for the virtual server and establishes corresponding connections. It allocates an RS according to the scheduling algorithm bound to the virtual server. If it succeeds, ip_vs_conn_new () is called to establish a connection. Ip_vs_conn_new () will perform a series of initialization operations: Set the connection protocol, IP address, port number, protocol timeout information, bind the application helper, RS and datagram transmission function, and finally call ip_vs_conn_hash () insert the connection to the ip_vs_conn_tab of the hash table. A data transmission function bound to a connection can be divided into ip_vs_nat_xmit (), ip_vs_tunnel_xmit (), and ip_vs_dr_xmit () According to IPVS's working method (). For example, the main operation of ip_vs_nat_xmit () is to modify the destination address and destination port of the packet to RS information, re-calculate and set the verification, and call ip_send () to send the modified datagram.

3.2 NF_IP_FORWARD processing process

After the datagram enters NF_IP_FORWARD, it enters ip_vs_out () for processing. This function is called only in NAT mode. First, it judges the datagram type. If it is an icmp datagram, it directly calls ip_vs_out_icmp (). Second, it determines whether it is a tcp/udp datagram. If not, it returns NF_ACCEPT. The rest is the process of tcp/udp datagram. First, call ip_vs_header_check () to check the header. If an exception occurs, NF_DROP is returned. Second, call ip_vs_conn_out_get () to determine whether a connection exists. If no connection exists, call ip_vs_lookup_real_service () to check whether the rs of the sent datagram still exists in the hash table. If the RS exists and the message is a tcp non-Reset message or udp message, call icmp_send () send a destination icmp packet to RS and return NF_STOLEN. In other cases, NF_ACCEPT is returned. If a connection exists, check the datagram check and return NF_DROP if an error occurs. If yes, modify the datagram to change the source address to the Virtual Server IP address and the source port to the virtual server port number, recalculate and set the verification and return NF_ACCEPT.

The process of ip_vs_out_icmp () is similar to that of ip_vs_in_icmp (), but it is different when you modify the data report: the source address of the ip header and the destination address of the udp or tcp Header in the error message are changed to the virtual server address, in the error message, modify the destination port number of the udp or tcp header to the port number of the virtual server.

3.3 NF_IP_POST_ROUTING Process

The NF_IP_POST_ROUTING hook function is only used in NAT mode. After the datagram enters NF_IP_POST_ROUTING, ip_vs_post_routing () processes it. It first checks whether the datagram passes through IPVS. If not, it returns NF_ACCEPT; otherwise, the datagram is transmitted immediately. The function returns NF_STOLEN to prevent the datagram from being modified by iptable rules.

4. LVS system configuration and management

The IPVS module registers setsockopt/getsockopt () during initialization. The ipvsadm command calls these two functions to pass the system configuration data of the ip_vs_rule_user structure to the IPVS kernel module to complete system configuration, add, modify, and delete virtual servers and RS addresses. The system manages virtual servers and RS linked lists through these operations.

The virtual server is added by ip_vs_add_service (). This function adds a new node to the virtual server hash table based on the hash algorithm, find the scheduling algorithm set by the user and bind it to the node. The modification to the virtual server is completed by ip_vs_edit_service (). This function modifies the scheduling algorithm of the specified server; deleting a virtual server is completed by ip_vs_del_service (). before deleting a virtual server, you must delete all RS of the virtual server and remove the scheduling algorithm bound to the virtual server.

Similarly, the ADD, modify, and delete operations of RS are completed by ip_vs_add_dest (), ip_vs_edit_dest (), and ip_vs_edit_server.

4. Load Balancing Scheduling Algorithm

As mentioned above, you need to bind a scheduling algorithm when adding a virtual service, which is completed by ip_vs_bind_scheduler () and by ip_vs_scheduler_get. Ip_vs_scheduler_get () according to the scheduling algorithm name, call ip_vs_sched_getbyname () to find the scheduling algorithm from the scheduling algorithm queue. If not, load the corresponding Scheduling Algorithm Module to search for the algorithm and return the search result.

Currently, the system has eight Load Balancing scheduling algorithms:

Rr: Round Robin distributes requests to different RS in sequence, that is, evenly distributes requests in RS. This algorithm is simple, but it is only suitable for the case where the RS processing performance is not much different.
Wrr: Weighted Round-Robin, which distributes tasks based on the weights of different RS. RS with a higher weight will give priority to tasks, and the number of connections allocated will be more than RS with a lower weight. RS with the same weight gets the same number of connections.
Dh: Destination Hashing searches for a static hash table with the target address as the keyword to obtain the required RS.
Sh: Source Address hash Scheduling (Source Hashing) searches for a static hash table with the Source address as the keyword to obtain the required RS.
Lc: the Least-Connection s table stores all active connections. Send new connection requests to the RS with the minimum number of connections.
Wlc: Weighted Least-Connection. Assume that the weights of each RS are WiI = 1 .. n), the current number of TCP connections is TiI = 1 .. n), and Ti/Wi is selected as the minimum RS for the next allocation.
Lblc: Address-Based Least connections Scheduling (Locality-Based Least-Connection) distributes requests from the same destination address to the same RS. If this server is not fully loaded, otherwise, the RS with the minimum number of connections will be allocated, which is the first consideration for the next allocation.
Lblcr: Address-Based scheduling of Least connections with duplicates (Locality-Based Least-Connection with Replication) has a subset of RS for a specific destination address. For this address request, assign RS with the smallest number of connections in the subset. If all servers in the subset are fully loaded, select a server with a smaller number of connections from the cluster, add it to this subset and allocate connections. If this subset is not modified for a certain period of time, the node with the largest load in the subset is deleted from the subset.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

LVS cluster system network core principle analysis

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

LVS cluster system network core principle analysis

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support