As we want to build a packet forwarding module similar to LVS and study the architecture and code of LVS, the following series will make a summary. First of all, we recommend this blog http://yfydz.cublog.cn in the face of LVS, IPSec is very good to explain
Several important data structures are as follows:
Ip_vs_conn: A connection consists of N tuples, including caddr (client address CIP), vaddr (Service virtual address VIP), daddr (destination RealServer address DIP), cport (Client Connection port ), vport, dport, and Protocol)
Ip_vs_service: represents a virtual service. In LVS, a virtual service represents a virtual IP address and port. As the service entry, some realservers are followed to balance the load between these realservers. Ip_vs_service includes protocol, ADDR, and port. Struct list_head destinations, _ u32 num_dests represents the linked list and number of realservers.
Ip_vs_dest: represents a RealServer. ADDR, port, and weight represent the IP address, port, and weight of the RealServer respectively. Struct dst_entry * dst_cache represents the route cache items from LVS to RealServer. In my opinion, this should only be effective for Nat and tunnel modes. Vport, vaddr, protocol represents the virtual service address, port, and Protocol
Ip_vs_scheduler: The base class of all schedulers to schedule ip_vs_service. The most important method is struct ip_vs_dest * (* Schedule) (struct ip_vs_service * SVC, const struct sk_buff * SKB ), select one from the ip_vs_dest Array under ip_vs_service and return
Static int _ init ip_vs_init (void) is used to initialize ipvs. Ko, which is the core module of LVS:
Ip_vs_control_init calls nf_register_sockopt to register the struct nf_sockopt_ops structure, ip_vs_genl_register registers the struct genl_ops ip_vs_genl_ops [] array, which is a command structure controlled through Netlink.
Ip_vs_protocol_init registers ip_vs_protocol_tcp, ip_vs_protocol_udp, ip_vs_protocol_ah, and ip_vs_protocol_esp.
Ip_vs_conn_init first calls vmalloc to allocate a large array of keys for storing the connected hash table in the memory (64 K) region, that is, there are 4096 and list_head.
LVS finally calls nf_register_hooks to register its own hook structure with netfilter. LVS has four hooks (not IPv6 ),
Static struct nf_hook_ops ip_vs_ops [] _ read_mostly = {
/* After packet filtering, forward packet through VS/DR, VS/TUN,
* Or VS/NAT (Change destination), so that filtering rules can be
* Applied to ipvs .*/
{
. Hook = ip_vs_in,
. Owner = this_module,
. PF = pf_inet,
. Hooknum = nf_inet_local_in,
. Priority= 100,
},
/* After packet filtering, change source only for VS/NAT */
{
. Hook = ip_vs_out,
. Owner = this_module,
. PF = pf_inet,
. Hooknum = nf_inet_forward,
. Priority= 100,
},
/* After packet filtering (but before ip_vs_out_icmp), catch ICMP
* Destined for 0.0.0.0/0, which is for incoming s connections */
{
. Hook = ip_vs_forward_icmp,
. Owner = this_module,
. PF = pf_inet,
. Hooknum = nf_inet_forward,
. Priority = 99,
},
/* Before the netfilter connection tracking, exit from post_routing */
{
. Hook = ip_vs_post_routing,
. Owner = this_module,
. PF = pf_inet,
. Hooknum = nf_inet_post_routing,
. Priority = NF_IP_PRI_NAT_SRC-1,
},
};
LVS no matter which mode is VS/DR, VS/TUN, VS/NAT, because the VIP is configured on LVS, the traffic to access the VIP will first go to nf_inet_local_in, thus calling ip_vs_in
Static unsigned int
Ip_vs_in (unsigned int hooknum, struct sk_buff * SKB,
Const struct net_device * In, const struct net_device * Out,
INT (* okfn) (struct sk_buff *))
{
...
// LVS ip_vs_in only processes packets sent to the Local Machine
If (unlikely (SKB-> pkt_type! = Packet_host )){
Ip_vs_dbg_buf (12, "packet type = % d proto = % d daddr = % s ignored \ n ",
SKB-> pkt_type,
IPH. Protocol,
Ip_vs_dbg_addr (AF, & iph. daddr ));
Return nf_accept;
}
...
/*
* Check if the packet belongs to an existing connection entry
*/
// Conn_in_get is implemented by the protocol itself. For TCP, call tcp_conn_in_get to get an ip_vs_conn
CP = PP-> conn_in_get (AF, SKB, PP, & IPH, IPH. Len, 0 );
If (unlikely (! CP )){
Int V;
/* For local client packets, it cocould be a response */
CP = PP-> conn_out_get (AF, SKB, PP, & IPH, IPH. Len, 0); // you can check whether a connection exists.
If (CP)
Return handle_response (AF, SKB, PP, CP, IPH. Len); // mainly executes SNAT
If (! PP-> conn_schedule (AF, SKB, PP, & V, & CP) // run tcp_conn_schedule. The scheduling of TCP is to find a RealServer for the client, save the conn and forward it based on ip_vs_conn next time.
Return V;
}
...
/* Check the server status */
If (CP-> DEST &&! (CP-> DEST-> flags & ip_vs_dest_f_available )){
/* The destination server is not available */
If (sysctl_ip_vs_expire_nodest_conn ){
/* Try to expire the connection immediately */
Ip_vs_conn_expire_now (CP );
}
/* Don't restart its timer, and silently
Drop the packet .*/
_ Ip_vs_conn_put (CP); // if the following RealServer is invalid, drop this ip_vs_conn
Return nf_drop;
}
Ip_vs_in_stats (CP, SKB );
Restart = ip_vs_set_state (CP, ip_vs_dir_input, SKB, pp); // call tcp_state_transition to change the state of the connected Automatic Machine
If (CP-> packet_xmit)
Ret = CP-> packet_xmit (SKB, CP, pp); // call different sending methods according to different modes. e.g. Nat calls ip_vs_nat_xmit, Dr calls ip_vs_dr_xmit
/* Do not touch SKB anymore */
Else {
Ip_vs_dbg_rl ("Warning: packet_xmit is null ");
Ret = nf_accept;
}
...
}
LVS vs VS/DR, VS/TUN are in single-arm mode, and only VS/NAT is in dual-arm mode. In VS/NAT mode, LVS acts as the next hop of the RealServer return package. Therefore, ip_vs_out is registered on nf_ip_forward to process the return packet in Nat mode.
Static unsigned int
Ip_vs_out (unsigned int hooknum, struct sk_buff * SKB,
Const struct net_device * In, const struct net_device * Out,
INT (* okfn) (struct sk_buff *))
{
Struct ip_vs_iphdr IPH;
Struct ip_vs_protocol * PP;
Struct ip_vs_conn * CP;
Int AF;
....
Ip_vs_fill_iphdr (AF, skb_network_header (SKB), & iph); // fill in the IP header of ip_vs_iphdr
If (unlikely (iph. Protocol = ipproto_icmp) {// This part of the code is used to process ICMP packets. The main logic is ip_vs_out_icmp. This function is used to process ICMP in the outgoing direction.
Int related, verdict = ip_vs_out_icmp (SKB, & related );
If (related)
Return verdict;
Ip_vs_fill_iphdr (AF, skb_network_header (SKB), & iph );
}
....
If (unlikely (ip_hdr (SKB)-> frag_off & htons (ip_mf | ip_offset )&&! PP-> dont_defrag) {// if it is an IP sharded package, call ip_vs_gather_frags to first integrate it into a complete package. For details, refer to the related code of frag/defrag in the kernel IP layer.
If (ip_vs_gather_frags (SKB, ip_defrag_vs_out ))
Return nf_stolen;
Ip_vs_fill_iphdr (AF, skb_network_header (SKB), & iph );
}
/*
* Check if the packet belongs to an existing entry
*/
CP = PP-> conn_out_get (AF, SKB, PP, & IPH, IPH. Len, 0); // you can check whether an existing connection exists.
If (unlikely (! CP )){
If (sysctl_ip_vs_nat_icmp_send &&
(PP-> protocol = ipproto_tcp |
PP-> protocol = ipproto_udp )){
_ Be16 _ ports [2], * pptr;
Pptr = skb_header_pointer (SKB, IPH. Len,
Sizeof (_ ports), _ ports );
If (pptr = NULL)
Return nf_accept;/* not for me */
If (ip_vs_lookup_real_service (AF, IPH. Protocol,
& Iph. saddr,
Pptr [0]) {// check whether the RealServer is in the hash table of LVS. If it is a real RealServer, an ICMP inaccessible result is returned.
/*
* Using y the Real Server: there is no
* Existing entry if it is not RST
* Packet or not TCP packet.
*/
If (iph. protocol! = Ipproto_tcp
|! Is_tcp_reset (SKB, IPH. Len )){
Icmp_send (SKB,
Icmp_dest_unreach,
Icmp_port_unreach, 0 );
Return nf_drop;
}
}
}
Ip_vs_dbg_pkt (12, PP, SKB, 0,
"Packet continues traversal as normal ");
Return nf_accept;
}
Return handle_response (AF, SKB, PP, CP, IPH. Len); // handle_response actually performs SNAT
}
Static unsigned int
Handle_response (int af, struct sk_buff * SKB, struct ip_vs_protocol * PP,
Struct ip_vs_conn * CP, int IHL)
{
If (! Skb_make_writable (SKB, IHL)/* If You Want To modify SKB, you must first judge skb_make_writable */
Goto drop;
/* Mangle the packet */
If (PP-> snat_handler &&! PP-> snat_handler (SKB, PP, CP)/* for TCP, tcp_snat_handler is called here. The main function is to modify the TCP header and then perform checksum */
Goto drop;
Ip_hdr (SKB)-> saddr = CP-> vaddr. IP;/* SNAT, replace the source IP of the package with virtual IP */
Ip_send_check (ip_hdr (SKB);/* perform checksum on the IP header */
/* For Policy Routing, packets originating from this
* Machine itself may be routed differently to packets
* Passing through. We want this packet to be routed
* If it came from this machine itself. So re-compute
* The routing information.
*/
If (ip_route_me_harder (SKB, rtn_local )! = 0)/* because the source IP address is changed to a local IP address, instead of the previous forwarding packet, you need to re-calculate the route */
Goto drop;
Ip_vs_out_stats (CP, SKB );
Ip_vs_set_state (CP, ip_vs_dir_output, SKB, pp);/* for TCP, call tcp_state_transition */
Ip_vs_conn_put (CP );
SKB-> s_s_property = 1;/* mark that this SKB has been processed by LVS */
Leavefunction (11 );
Return nf_accept;
Drop:
Ip_vs_conn_put (CP );
Kfree_skb (SKB );
Return nf_stolen;
}
LVS also registers a hook function ip_vs_post_routing with a priority of nf_ip_pri_nat_src-1 on the nf_inet_post_routing chain. This function is executed before iptables SNAT and checks whether LVS has processed this SKB. If so, skip the following netfilter hook points.
/*
* It is hooked before nf_ip_pri_nat_src at the nf_inet_post_routing
* Chain, and is used for VS/NAT.
* It detects packets for VS/NAT connections and sends the packets
* Immediately. This can avoid that iptable_nat Mangles the packets
* For VS/NAT.
*/
Static unsigned int ip_vs_post_routing (unsigned int hooknum,
Struct sk_buff * SKB,
Const struct net_device * In,
Const struct net_device * Out,
INT (* okfn) (struct sk_buff *))
{
If (! SKB-> ipvs_property)
Return nf_accept;
/* The packet was sent from ipvs, exit this chain */
Return nf_stop;
}
In the netfilter framework, the nf_hook macro calls nf_hook_slow and then nf_iterate. That is, for a specific hooknum under a specific PF, all hook functions registered above are traversed by priority, nf_accept is returned only when all functions return nf_accept, or if any function returns nf_stop. Unlike nf_accept, nf_stop ignores other priority functions under the mount point.