In earlier years, I wrote a load balancer scheduling algorithm module, which calculates a value based on a fixed length of data starting at any offset of the Application layer protocol packet, and then hashes the value to a different server. There was no use at that time, and there was no continuation until a period of reflection and a pre-study a few days ago. I decided to write my composition to remember that it might be useful later.
Load balancing for 1.UDP services is rarely used for UDP services, although HTTP does not say it must be TCP, but virtually no HTTP on UDP. However, with the increase of network reliability, the network centralized control mechanism and distributed optimization technology are becoming more and more mature, the use of UDP more and more occasions.
Using UDP means that you have to do the transfer control at the application layer, which is not the main issue, but the main problem is that there is no known UDP service, such as you can't expect to build a load balancer on the load balancer on the OpenVPN service. However, TCP-based HTTP services are almost always built inside any gateway. Because as a well-known Application layer protocol, HTTP has its own set of standards at all levels, and everyone recognizes these standards. Using UDP You must implement a common-sense connection expiration mechanism, because at the UDP level it is impossible to identify a "connection" disconnection, which means that either the application layer is recognized, such as sending a special UDP packet to "disconnect", or a UDP "connection" set a timeout.
Although there are so many problems, in the mobile era, some problems really have to use UDP as a transport protocol to solve.
2. Mobile network problems If you use a mobile phone or pad to access the service, because these mobile terminals are moving, their IP address will change (please do not consider Lisp, this is an ideal), if using TCP as the service's hosting protocol, That means TCP will continue to disconnect and re-connect-tcp and IP is relevant, if using UDP, there is no problem, the price is only at the application layer to record the connection information. This is a conversation layer is missing, although some people do not agree, but after all, the keyboard party jet said no more and no use, to achieve a mechanism like this run out of effect is kingly. In view of this, I did the operation for OpenVPN.
OpenVPN also uses a 5-tuple to identify a particular client, but due to the problem of terminal mobile IP address changes, this causes the OpenVPN server to disconnect frequently and the client connection and then wait for reconnection, although this is not caused by TCP, but it is the nature of a problem, As long as the connection is identified with a 5-tuple, the change in the IP address causes the connection to be broken. So I added a server-only 4-byte so-called SessionID in the head of the OpenVPN protocol to complement the missing session layer. After the OpenVPN server is no longer using a 5-tuple to identify a client connection, but use this unique sessionid to identify, so for the case of UDP, even if the client's IP address changes, the service side will not be disconnected, because SessionID has not changed. Note that this service for the TCP mode is useless because TCP is in the transport layer, and TCP itself disconnects before OpenVPN recognizes SessionID, unless the TCP connection (TCP connection) is disconnected/re-connected, even though a layer is encapsulated on the accept call , but the OpenVPN connection (session-level connection) always does not break. But because the workload is relatively large, forget it.
In the presence of powerful functions, any whine is pale. This is the power of UDP by introducing a small field (4 bytes or 2 bytes) that perfectly solves the problem of "UDP long Connection" (not really TCP, unless Lisp is introduced) when IP address switches. OpenVPN So, why else not. In fact, any application layer protocol can be encapsulated with UDP, the connection control (connection, sequencing, retransmission, disconnection, etc.) to standardize the operation of the upper layer can be. However, if the IP address of the client is constantly changing, can the load balancer also do load balancing based on the source IP?
It is obvious that it is possible, but it is problematic. Since it is possible for the same client to change the IP address, the load balancer distributes it to different servers, but in fact, their sessionid does not change, because load balancing will no longer be based on the source IP address. So what? The answer is load balancing based on SessionID.
3. The SessionID of the application layer based on the UDP protocol do load balancing step-by-step, we have come here, and now the question to be answered is how to do it. What is SessionID? It is not part of the standard protocol. First you must ensure that the data packet must have this field, this is generally guaranteed, I definitely know what I am configuring something, secondly, the question is where is this sessionid? This must not be imposed by force. In fact, the so-called SessionID is the part of the packet that does not change in a single connection, only this. So the best way to do this is to let the Configurator decide where it is and how long it is.
With the relative application layer start offset and length, take the field and calculate the hash is like Tannangquwu, almost like the source IP, just a few more calculations, Ipvs code as follows:
NET/NETFILTER/IPVS/IP_VS_OFFH.C:
/* * Ipvs:layer7 payload Hashing Scheduling Module * * Authors:zhaoya * Based on IP_VS_SH/DH modified, detailed comments please refer to See: * NET/NETFILTER/IPVS/IP_VS_SH.C * net/netfilter/ipvs/ip_vs_dh.c * * #include <linux/ip.h> #include <linux/tcp.h> #include <linux/module.h> #include <linux/kernel.h> #include <linux/ skbuff.h> #include <linux/ctype.h> #include <net/ip.h> #include <net/ip_vs.h>struct ip_vs_offh_ Bucket {struct Ip_vs_dest *dest; };struct ip_vs_offh_data {struct Ip_vs_offh_bucket *tbl; U32 offset; U32 Offlen;}; #define Ip_vs_offh_tab_bits 8#define ip_vs_offh_tab_size (1 << ip_vs_offh_tab_bits) #defi Ne ip_vs_offh_tab_mask (ip_vs_offh_tab_size-1)/* * global variable * OFFSET:LAYER7 calculates the payload offset of the hash value (relative to the LAYER7 header) * of Flen:layer7 calculates the payload length of the hash value */static u32 offset, offlen;static int skip_atoi (char **s) {int i=0; while (IsDigit (**s)) i = i*10 + * ((*s) + +)- ' 0 '; return i;} Static inline struct ip_vs_dest *ip_vs_offh_get (struct ip_vs_offh_bucket *tbl, const char *payload, u32 length) {__be32 V_fold = 0; /* algorithm needs to be optimized */V_fold = (payload[0]^payload[length>>2]^payload[length]) *2654435761ul; Return (Tbl[v_fold & Ip_vs_offh_tab_mask]). dest;} Static intip_vs_offh_assign (struct ip_vs_offh_bucket *tbl, struct ip_vs_service *svc) {int i; struct Ip_vs_offh_bucket *b; struct List_head *p; struct Ip_vs_dest *dest; b = tbl; p = &svc->destinations; for (i=0; i<ip_vs_offh_tab_size; i++) {if (List_empty (p)) {b->dest = NULL; } else {if (p = = &svc->destinations) p = p->next; Dest = List_entry (p, struct ip_vs_dest, n_list); Atomic_inc (&DEST->REFCNT); B->dest = dest; p = p->next; } b++; } return 0;} static void Ip_vs_offh_flush (struct ip_vs_offh_bucket *tbl) {intI struct Ip_vs_offh_bucket *b; b = tbl; for (i=0; i<ip_vs_offh_tab_size; i++) {if (b->dest) {atomic_dec (&b->dest->refcnt); B->dest = NULL; } b++; }}static int ip_vs_offh_init_svc (struct ip_vs_service *svc) {struct Ip_vs_offh_data *pdata; struct Ip_vs_offh_bucket *tbl; pdata = kmalloc (sizeof (struct ip_vs_offh_data), gfp_atomic); if (pdata = = NULL) {Pr_err ("%s (): No memory\n", __func__); Return-enomem; } tbl = Kmalloc (sizeof (struct ip_vs_offh_bucket) *ip_vs_offh_tab_size, gfp_atomic); if (tbl = = NULL) {kfree (pdata); Pr_err ("%s (): No memory\n", __func__); Return-enomem; } pdata->tbl = TBL; Pdata->offset = 0; Pdata->offlen = 0; Svc->sched_data = pdata; Ip_vs_offh_assign (TBL, SVC); return 0;} static int ip_vs_offh_done_svc (struct ip_vs_service *svc) {struct Ip_vs_offh_data *pdata = svc->sched_data; StrucT Ip_vs_offh_bucket *tbl = pdata->tbl; Ip_vs_offh_flush (TBL); Kfree (TBL); Kfree (pdata); return 0;} static int ip_vs_offh_update_svc (struct ip_vs_service *svc) {struct Ip_vs_offh_bucket *tbl = svc->sched_data; Ip_vs_offh_flush (TBL); Ip_vs_offh_assign (TBL, SVC); return 0;} static inline int is_overloaded (struct ip_vs_dest *dest) {return dest->flags & ip_vs_dest_f_overload;} static struct ip_vs_dest *ip_vs_offh_schedule (struct ip_vs_service *svc, const struct Sk_buff *skb) {struct ip_vs_dest *dest; struct Ip_vs_offh_data *pdata; struct Ip_vs_offh_bucket *tbl; struct IPHDR *iph; void *transport_hdr; Char *payload; U32 Hdrlen = 0; U32 _offset = 0; U32 _offlen = 0; IPH = IP_HDR (SKB); Hdrlen = iph->ihl*4; if (Hdrlen > Skb->len) {return NULL; } TRANSPORT_HDR = (void *) Iph + Hdrlen; Switch (iph->protocol) {case Ipproto_tcp:hdrlen + = ((struct tcphdr*) transport_hdr)->doff); Break Case Ipproto_udp:hdrlen + = sizeof (struct UDPHDR); Break Default:return NULL; } #if 0 {int i = 0; _offset = offset; _offlen = Offlen; Payload = (char *) IPH + Hdrlen + _offset; PRINTK ("begin:iplen:%d \ n", Hdrlen); for (i = 0; i < _offlen; i++) {PRINTK ("%02x", Payload[i]); } printk ("\nend\n"); return NULL; } #endif pdata = (struct Ip_vs_offh_datai *) svc->sched_data; TBL = pdata->tbl; _offset = offset;//pdata->offset; _offlen = offlen;//pdata->offlen; if (_offlen + _offset > Skb->len-hdrlen) {ip_vs_err_rl ("offh:exceed\n"); return NULL; } payload = (char *) IPH + Hdrlen + _offset; Dest = Ip_vs_offh_get (TBL, payload, _offlen); if (!dest | |!) ( Dest->flags & ip_vs_dest_f_available) | | Atomic_read (&dest->weight) <= 0 | | Is_overloaded (dest)) {IP_VS_ERR_RL ("Offh:no destination available\n"); return NULL; } return dest;} static struct Ip_vs_scheduler ip_vs_offh_scheduler ={. Name = "Offh",. refcnt = Atomic_init (0),. Mo Dule = This_module,. n_list = List_head_init (ip_vs_offh_scheduler.n_list),. Init_service = Ip_vs_off H_init_svc,. Done_service = Ip_vs_offh_done_svc,. Update_service = ip_vs_offh_update_svc,. Schedule = I P_vs_offh_schedule,};static ssize_t ipvs_sch_offset_read (struct file *file, char __user *buf, size_t count, loff_t *ppos) {int ret = 0; ret = sprintf (buf, "offset:%u;offlen:%u\n", offset, offlen); return ret;} /* * Set offset/offset length * echo offset: $value 1 Offlen: $value 2 >/proc/net/ipvs_sch_offset */static int ipvs_sch_offs Et_write (struct file *file, const char __user *buf, size_t count, loff_t *ppos) {int ret = count; Char *p = buf, *pstart; if (p = strstr (P, "offset:")) = = NULL) {ret =-einvaL Goto out; } p + = strlen ("offset:"); Pstart = p; if (p = strstr (P, "")) = = NULL) {ret =-einval; Goto out; } P[0] = 0; offset = Skip_atoi (&pstart); if (offset = = 0 && strcmp (pstart, "0")) {ret =-einval; Goto out; } p + = strlen (";"); if (p = strstr (P, "Offlen:")) = = NULL) {ret =-einval; Goto out; } p + = strlen ("Offlen:"); Pstart = p; Offlen = Skip_atoi (&pstart); if (Offlen = = 0 && strcmp (pstart, "0")) {ret =-einval; Goto out; }out:return ret;} /* * Because do not want to modify the user Configuration interface, or feel procfs this way compared to **/static const struct File_operations ipvs_sch_offset_file_ops = {. Owner = This_module,. Read = Ipvs_sch_offset_read,. write = ipvs_sch_offset_write,};struct net *net = &init_net;static int __init ip_vs_offh_init (void) {int ret =-1; if (!proc_create ("Ipvs_sch_offset", 0644, Net->proc_net, &ipvs_sch_offSet_file_ops) {PRINTK ("offh:create proc entry failed\n"); Goto out; } return Register_ip_vs_scheduler (&ip_vs_offh_scheduler); out:return ret;} static void __exit Ip_vs_offh_cleanup (void) {Remove_proc_entry ("Ipvs_sch_offset", net->proc_net); Unregister_ip_vs_scheduler (&ip_vs_offh_scheduler);} Module_init (Ip_vs_offh_init); Module_exit (Ip_vs_offh_cleanup); Module_license ("GPL");
In fact, many of the large load balancer implementations are not based on the kernel protocol stack, they are either directly with the hard card, or the user-state protocol stack, so the principles of this article can be used in those aspects, but I can weili and simple only Linux IPVS, After all, it's much better to run the code than a long-winded, at least I think.
4. Where is the problem-connection cache I think the Ipvs mechanism should be changed, and I think nf_conntrack should be changed.
We know that in Ipvs, it is possible that the first packet of only one stream will call the Conn_schedule callback of the "specific Protocol", and after selecting a destination, the real server, the information will be stored in the conn cache of the "specific protocol." If you look at this so-called "specific protocol", you will find that it is in fact the "Fourth Protocol", that is, the Transport layer protocol, TCP or UDP, and in this layer, it is clear that a connection is a 5-tuple. So, even if I select a real server for the first packet, the first package of a stream, and put it in the conn cache, the client of that stream will obviously not be found in the conn cache after the IP address has changed, so it automatically enters Conn_schedule. Due to the use of fixed offset paylaod for schedule, then it is certainly the original real server is selected, this time will add a new entry in the Conn cache for future matches, the old conn cache is useless, waiting for expiration, As long as the client does not change the IP address and the new Conn cache entry does not, this cache will always hit, and once the client has changed the IP address, everything starts again. Visible, this is an automatic and correct process. However, it is best to have a delete notification mechanism for the old five-tuple instead of waiting for it to expire itself.
If you wait for it to expire, think of a time-out that is too long.
Client A five tuples for tuple1 use sessionID1 match to a real Server1, set the conn cache conn1, after some time, client a replaced the IP address, at this time, it is natural that it will not match to CONN1, Cache misses, Depending on the sessionID1, it selects the same real server1 in the Conn_schedule, sets the new CONN2 cache entry, and then conn1 it into a zombie, waiting for the timeout to be removed. After a long time, the client 2 took over the client 1 of the old IP address and UDP port, access to the same UDP service, at this time the client 2 of the five-tuple is tuple1, carrying sessionID2, the sessionID2 computed by the real server should be real Server2, but due to hit the zombie conn1, it will be loaded to real server1, at this time the client 2 changed the IP address, its five-tuple became Tuple2 ', after the conn_schedule calculation, it matched to the real server2, Because the initial real server that is serving it is real server1, this causes the connection to switch. This is the problem caused by the absence of a notification mechanism.
The solution of the problem seems to be relatively simple, there are two, the first method is to add a five-tuple change notification callback function for the IP_VS_PROTOCOL structure, for example, called Conn_in_update/conn_out_update, or directly add a conn_in_ Delete/conn_out_delete, or a more thorough solution, is to record the connection directly with the SessionID. And this latter is a surgical procedure that I am preparing for the nf_conntrack of Linux.
Of course, I'm not going to be obsessed with completely abandoning the five-tuple connection tracking method, I just added a choice for nf_conntrack, just as Conntrack added zone support. I believe that even if I don't do it, a a year will certainly be done, and the facts of the past portend that.
Ipvs load Balancing algorithm based on the arbitrary offset field hash value of the application layer