This topic is a bit big, but in fact I just want to describe some of my personal has been more concerned about the characteristics, and not too detailed, as usual, mainly to help clarify the idea, will not analyze the source code. This is mainly for the day suddenly forget the time, Yimushihang sweep a glance can remember the understanding at that time, otherwise write too details, I can not understand.
Lockless TCP Listener
Starting with TCP's Syncookie, if you can use the Syncookie mechanism is good, but not, because it will lose a lot of option negotiation information, which is critical to TCP performance. TCP Syncookie is mainly to prevent the semi-connected SYN flood attack, super many nodes send a large number of SYN packets, and then regardless of, and the attack stack received a SYN will create a request, Bind on the request queue of the Syn-directed listener. This consumes a lot of memory.
But think about it, throw away the option to negotiate not to say, only for TCP syn,synack, in fact, TCP in the 3 handshake process, only need to find a listener, As long as it exists, you can directly according to the SYN Packet Construction Synack package, there is no need to listener, to remember the 2 handshake packet Information, there are two ways, the first way is Syncookie mechanism to encode and echo back, and so on the 3rd time handshake ack came, TCP will decode this ACK serial number information, constructs the sub socket, inserts the listener the Accept queue, also has one method is allocates the memory locally, records this connection client's information, waits for the 3rd time Handshake Packet Ack arrives, finds this request, Constructs a child socket and inserts an accept queue for listener.
Before 4.4, a request belonged to a listener, that is to say, a listener has a request queue, each request is constructed to operate the Listner itself, but the 4.4 kernel gives a breakthrough method, which is based on this request to construct a The new socket! Inserted into the global socket hash table, this socket simply records a light reference to its listener. Wait until the 3rd handshake Packet ack arrives, query the socket hash table, found will not be listnener itself, but the SYN packet arrived when the new socket constructed, so the traditional logic can be listener liberated:
The traditional TCP stack receives
SK = lookup (SKB); Lock_sk (SK); if (SK is Listener); Then Process_handshake (SK, SKB); else Process_data (SKB); Endifunlock_sk (SK);
As you can see, the lock period of SK will be a bottleneck and all the handshake logic will be handled during lock. 4.4 The kernel has changed all this, and here's the new logic:
sk = lookup_form_global (SKB);if (Sk is listener); then rv = process_syn (SKB); new_sk = build_synack_sk ( SKB, RV); new_sk.listener = sk; new_sk.state = synrecv; insert_sk_into_global (SK); send_synack (SKB ); goto done;else if (SK.STATE == SYNRECV); then listener = sk.lister; child_sk = build_child_sk ( SKB, SK); remove_sk_from_global (SK);     ADD_SK_INTO_ACCEPTQ ( Listener, child_sk) Filock_sk (SK);p rocess_data (SKB); Unlock_sk (SK);d one:
In this logic, you only need to fine-grained lock the specific queue can be, do not require lock the entire socket. For Syncookie logic more simple, even synrecv socket do not have to construct, as long as guaranteed to have listener can!
This is the Thursday morning when squatting in the toilet of the 4.4 new features, then shocked, this is what I accidentally thought of in 2014, but later because there is no environment to follow up, and now has been and in mainline, I have to say this is a good thing. At the time, my idea was to construct synack according to a SYN packet that could completely ignore listner, and the information needed to be negotiated could be saved elsewhere without having to be bound to listner, freeing listener's duties. But I didn't think of constructing a socket parallel to all sockets into the same socket hash table.
I think that the logic before 4.4 is straightforward, whether it's a handshake packet and a packet, the processing logic is exactly the same, but 4.4 complicates the code and separates so many if-else ... But this is inevitable. In fact, the request of the SYN construct itself should be bound to the listener, but if you think of optimization, the code will become complex, but if the code itself, the code will look good, but I do not have the ability to write bad code.
This lockless thought is similar to Nf_conntrack's thought, but I think Conntrack can play with related conn logic.
CPU affinity and reuseport of TCP listener
The lockless TCP listener is followed by the optimization of the Accept queue! As we all know, a listener only one accept queue, in a multi-core environment this single queue is absolutely a bottleneck, a high-performance server how can endure this!
In fact, this problem has long been reuseport solved. Reuseport allows multiple independent sockets to listen to the same ip/port pair at the same time, which is an absolute boon for today's multi-queue network cards. However, although the road is wide, more lanes, no rules, but the performance of the decline, but the degree of congestion is degraded!
The 4.4 kernel introduces a so_incoming_cpu option for the socket, and if the option of a socket is set to n, it means that only the execution stream of the protocol stack logic on the N CPU can be inserted into the socket. Embodied in the code, that is, in the Compute_score to add points, that is, in addition to the target IP, destination port, source IP, source port, the CPU has become a matching project.
As Patch notes say, this feature, combined with Reuseport, multi-queue NIC, must be a gourmet dish!
New flow-based route routing with multiple paths
Previously, there was a route cache, a route cache entry is an n-tuple with source information, each packet after matching to the FIB entry will establish a cache entry, the subsequent lookup first to find the cache, so are based on the stream. However, after the route cache class, multipath routing becomes packet-based, which is sure to cause the problem of chaos in the TCP protocol. This problem is avoided by the introduction of the source information in the hash calculation when the 4.4 kernel is in multipath routing. As long as the calculation method is constant, a stream of data is always hashed to one DST.
Socket route cache with version number
This is not a feature that is carried by the 4.4 kernel, it is some of my own ideas. Early_demux has been introduced to the kernel, is designed to eliminate the local traffic to the route lookup, after all, after the routing to find the socket again, why not directly socket to find it? Finds the result cache routing information. For devices that provide services natively, turn this option on.
But for outbound traffic, there is still a lot of overhead wasted on routing lookups. Although the IP is not connected, but the TCP socket or a connected UDP socket can be clearly labeled a 5-tuple, if the routing information stored in the socket, is not better. All right! Many people will ask, how to solve the synchronization problem, the routing table changed how to do, to notify socket? If you are guided to design an "efficient synchronization protocol," you lose! The simple approach is to introduce two counter-cache counters and global counters, and the socket's route cache is as follows:
Sk_rt_cache {atomic_t version; Dst_entry *DST;};
The global counters are as follows:
atomic_t gversion;
Each time the socket sets the route cache, the global gversion value is read, the cache version is set, and the global gversion counter increments whenever any change occurs to the route. If the value of the cache counter is the same as the global counter value, it is available, otherwise it is unavailable, and of course, DST itself is protected by reference counting.
New features of the network from the Linux kernel version 4.4