The abstract extension of Linux route table is applied to nf

The abstract extension of Linux route table is applied to nf_conntrack.

Last Update:2014-04-21 Source: Internet

Author: User

Tags define null

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The process of searching for standard IP addresses provides us with an excellent "matching-action" routine. That is, match a route entry, and then send the packet to the next hop indicated by the route entry.
If we abstract the process of searching for IP routes up to a level, we will find that it can be used in other ways. The abstract statement is as follows: query a table with the source or target address of the data packet as the key value, and execute an action indicated by the result item after the result item is found. A result item is:

struct result_node {       uint32 network;       uint32 netmask;       void *action;};

Thanks to the "Longest prefix matching" principle in Route Search, this principle is implicit, but this principle ensures the most accurate matching.
In my special scenarios, I want a data stream from a network segment (a large network, a small network, a host) to be associated with a string description, this string can be a description, a user name, or even anything else... in the past, this situation was considered not in line with the UNIX philosophy. In the past, the memory usage was too stingy and the memory was too luxurious. But now, we don't need to worry about memory, so we can plug anything in the kernel, as long as it is properly designed to conform to the UNIX philosophy.
In general, I want to retrieve a pre-configured string from a arriving packet.
Why does Linux fail to be implemented? In fact, my idea is wrong at the beginning. I know that I am wrong. Why have I been "Patching" Linux kernel protocol stacks and various poor or incomplete implementations of Netfilter extensions, such as "effective NAT now", such as bidirectional static NAT, for example, incomplete conntrack confirm mechanisms are not enough. Are these defects invisible to the Linux kernel and the Netfilter community? This is definitely not the case, because they follow the Worse is better principle. The core of this principle is to simply take over everything. In order to be simple, they can discard everything that should be abandoned.
My practice, that is, implementing many things not included in the protocol stack, seems to be against the Worse is better principle. I replaced the Simplicity (Simplicity) principle with the Completeness (Completeness) principle, the correct expression of these two principles should be:

Simplicity: The design must be simple, both in implementation and interface. It is more important for the implementation to be simple than the interface. Simplicity is the most important consideration in a design.

Completeness: The design must cover as your important situations as is practical. All reasonably expected cases shoshould be covered. Completeness can be sacrificed in favor of any other quality. In fact, completeness must be sacriiced whenever implementation simplicity is complete. Consistency can be sacriiced to achieve completeness if simplicity is retained; especially worthless is consistency of interface.
Therefore, community developers in Linux strictly follow this principle and do not introduce a "complex and not commonly used" mechanism. Why do I have to implement them? My idea is that everything is to satisfy my needs, and I have to do so based on the Simplicity priority principle, because I do so to avoid introducing more complex mechanisms.

Since the analysis is to bind a string to the data stream, it is clear that the expansion of nf_conntrack is an excellent choice. I have already elaborated on how to extend it. Next, we need to consider how to set rules. Obviously, writing an INFO iptables module is an option:
Iptables-t...-A...-j INFO -- set-info "aaaaaaaaaa"
The idea of the INFO module is to retrieve the conntrack struct from skb and then the acct extension (I always like to use the account extension ), copy the information indicated by the set-info parameter to the acct extension.
However, if there are 10000 info information to be set, I will create 10000 rules. Each packet must traverse all the above rules in sequence in the kernel. There are too many iptables rules (more than 5000) it will seriously affect the network performance. Of course, you can flexibly arrange the order of rules so that data packets can quickly end the matching process of a chain, such as the following:
Iptables...-m state -- state ESTABLISHED-j ACCEPT)
...-J INFO -- set-info aaaaa
...-J INFO -- set-info bbbbb
...-J INFO -- set-info ccccc
This is indeed a skill. Anyone who has ever played iptables knows it, but I have to find other solutions due to the following four problems:
1. When other targets except INFO and INFO do not agree to jump out of the rule matching chain through the above method, another custom chain should be arranged to solve the problem.
2. due to the frequent operations at the above 1st points, it will eventually become programming for iptables, And the rule set itself will become more and more complex, just like the poorly written C ++ code, it is extremely difficult to adjust the order of a rule.
3. speaking back in step 2, for a NEW State package, it really needs to traverse the rule (note that iptables is traversal) to match, I need a ratio of O (n) efficient Algorithms (nf-HiPAC will be much better in the case of more than 5000 rules ).
4. Netflter's nf-HiPAC project may be a pioneering project, but I don't have time to study it in detail.
The UNIX philosophy teaches us to split the problem into small issues that are independent from each other, and then let them work together to solve the final big problem, but mutual cooperation itself sometimes becomes a new problem, huge management costs are required. Not every type of problem can be split into cat file | grep key.
When I found that the IP route search process is actually a "Match-action" mode, I changed the "Action" explanation, change "send to Next Hop" to "retrieve info Information". Note that the info information is only retrieved and is not set to conntrack. Why not do these two tasks together? The relationship between the two things is similar to that between cat file | grep key. I answered the following four questions in the previous section:
1. the routing search module is completely independent of iptables, and has no relationship with iptables with a half dime;
2. You do not need to use any other module interfaces to use the routing search module. You only need to call the following interfaces:
Int nf_route_table_search (const u_int32_t src, char * info, int len );
If src is found, the info information associated with the network that matches the longest prefix of the address is obtained. If no information is associated with it, the system returns non-0
3. efficiency, not to mention, is the hash routing search algorithm of the Linux kernel. It has been used for more than a decade and runs in various environments.
4. Although I don't have time to study the nf-HiPAC project, I have long been proficient in routing search algorithms. One idea is to always use the technology I am most familiar.
The routing search module can be used to quickly match a "route entry" associated with a data packet. Is it too wasteful to store only one string in this route entry? Can I make the information carried by the route entry scalable? The answer is definitely yes, because I have changed the "Next Hop" to the string information, and the next step is to cancel the definition of this type, adding a void * pointer to a route entry is undoubtedly desirable, but adding a 0-length array is a better choice to make the memory more compact.
In the kernel, struct fib_node indicates a route entry. to distinguish it from the standard, I added an nf _ prefix to indicate that it is a "route entry" associated with Netfilter and used by nf_conntrack ", remove the definition of the Data Type:

/** The following struct indicates a "route node" that can carry an extra_data * and you can define it as needed. For example, you can define it as follows: * struct my_extra {* char info [INFO_SIZE]; * int policy; // NF_ACCEPT, NF_DROP, or others? * //... Extra of extra ?? *}; * It provides unlimited scalability. You can use this "Route" node to store any data. */Struct nf_fib_node {struct hlist_node nf_fn_hash; u_int32_t fn_key; int len; // extra_len indicates the length of the extra data you pass in. int extra_len; // an array of 0 lengths, you can define it as char extra [0];};

PS:Chapter 15th of JAVA programming ideology, section 15.16: Describes the idea of writing code that can be used as widely as possible. To achieve this, we need a variety of ways to relax the restrictions on the types that our code will use, without losing the benefits of static type checks.
The expanded ACL is called the access control list. Note that the word "list" indicates that its structure is one-dimensional linear. In many systems, the specific Matching Behavior of ACL on data packets depends on the Rule Configuration sequence of the Administrator. Note that the above is the ACL user configuration interface. Of course, for this, the Linux iptables and Cisco ACL are consistent, but for internal implementation, each system is different. Taking Linux iptables as an example, consider the following rule sequence:

I = 1; # Add massive iptables rules for (; I <255; I ++); do j = 1; for (; j <255; j ++); do iptables-a forward-s 192.168. $ I. $ j-j DROP; donedone

Now, a data packet whose source address is 172.16.0.0/24 is sent. Does it have to pass more than 60 thousand rules? Of course, you can put an ACCEPT rule for non-192.168.0.0/16 segments in the first one, but the focus here is on the implementation level rather than the configuration level: in the packet matching process, how to quickly avoid rules that clearly do not match. This is also one of the purposes of the nf-HiPAC project. This idea completely reverses the previous rule to match the dominant rule of data packet matching by the address in the rule, and then to match the rule by the source/Target IP address of the data packet. The idea of using Route Search is undoubtedly very convenient!
According to the above section, I have extended struct nf_fib_node. In this sense, its extra definition is already quite clear:

Struct my_extra {// replace matches in iptables struct list_head matches_list; // Replace the target int policy of iptables; // NF_ACCEPT, NF_DROP, or other ?};

It is expressed as "querying a route table" using the source IP address or target IP address of the data packet. If it is found, the extra data of the data packet is taken out, indicating how the data packet is processed.
The idea described in the above "further expansion" section only points out a feasibility, and I have not come up with a proper position for it. The purpose of writing this article was initially to implement a tool called conntrack_table to operate on nf_conntrack, for example:
1. Bind two-direction routes to the conntrack to search for the conntrack only, instead of querying the route table for each data packet;
2. use the IP route lookup mechanism that matches the longest prefix instead of iptables to match the first packet of A conntrack to a route entry, then, the information is retrieved from the route entry and set to the conntrack struct;
Due to the limited weekend time, it is too late to go home at ordinary times. Therefore, the implementation of this version only contains the string information that uses the routing mechanism to save the network segment, and then sets it in the conntrack struct.
The kernel mechanism implementation idea kernel state requires two parts. The first part is to copy the standard routing search algorithm to the nf_conntrack_ipv4 module, the second part is to query the route table where nf_conntrack is confirm. The info information stored in the result route entry is retrieved and stored in conntrack. This operation only targets the first packet of a stream. The route table will not be queried for subsequent data packets. The porting code of the kernel's Route Search module will be provided later. Now we will first give the modifications to the confirm. Only the 4_confirm function is modified, which is a HOOK function attached to the INET IPv4 protocol family, in this way, I do not have to consider other protocols, thus omitting the Protocol judgment:

Static unsigned int limit 4_confirm (unsigned int hooknum, struct sk_buff * skb, const struct net_device * in, const struct net_device * out, int (* okfn) (struct sk_buff *)) {... out:/* We 've seen it coming out the other side: confirm it */ret = nf_conntrack_confirm (skb); if (ret = NF_ACCEPT) {# include "nf_conntrack_info.h" # define MAX_LEN 128 struct nf_conn_priv {struct nf_conn_counter ncc [IP_CT_DIR_MAX]; Char * info;}; struct nf_conn_priv * ncp; struct nf_conn_counter * acct; acct = nf_conn_acct_find (ct); if (acct) {char buf [MAX_LEN] = {0 }; int len = MAX_LEN; struct iphdr * iph = ip_hdr (skb); // query the "route table" and obtain the result: int rv = nf_route_table_search (iph-> saddr, buf, & len ); if (! Rv) {ncp = (struct nf_conn_priv *) acct; if (ncp-> info = NULL) {ncp-> info = (char *) kcalloc (1, len + 1, GFP_ATOMIC);} // copy the obtained result to conntrack memcpy (ncp-> info, buf, len); }}return ret ;}

Note: I have carefully selected the position in this 4_confirm. A good insertion point will reduce a large number of if statements, whether in the old language or OO language, if statements are used only when they have to be used. When the program is executed at a location, it should know its own context information. questions such as who am I should not be asked.
Now, we need to demonstrate the porting of the Linux route search algorithm. I think the hash algorithm has been able to withstand the performance test for so many years, so it has not been transplanted with the trie algorithm, at the same time, I cut down the rehash mechanism. This porting is very simple. Basically, it is the modification operation of net/ipv4/fib_hash.c, excluding the typed definition of fib_node, that is to say, it is upgraded to the "generic" level. The Code is as follows:
Header file: nf_conntrack_rtable.h

# Include <linux/types. h> # include <linux/list. h> # define SIZE 128 struct nf_fn_zone {struct nf_fn_zone * fz_next;/* Next not empty zone */struct hlist_head * fz_hash;/* Hash table pointer */int fz_nent; /* Number of entries */int fz_divisor;/* Hash divisor */u32 fz_hashmask;/* (fz_divisor-1) */# define FZ_HASHMASK (fz) -> fz_hashmask) int fz_order;/* Zone order */u_int32_t fz_mask; # define FZ_MASK (fz)-> fz_mask)}; struct nf_fn_hash {struct nf_fn_zone * nf_fn_zones [33]; struct nf_fn_zone * nf_fn_zone_list ;}; /** the following struct indicates a "route node" that can carry an extra_data * and you can define it as needed. For example, you can define it as follows: * struct my_extra {* char info [128]; * int policy; // NF_ACCEPT, NF_DROP, or others? * //... Extra of extra ?? *}; * It provides unlimited scalability. You can use this "Route" node to store any data. */Struct nf_fib_node {struct hlist_node nf_fn_hash; u_int32_t fn_key; int len; // extra_len indicates the length of the extra data you pass in. int extra_len; // an array of 0 lengths, you can define it as char extra [0];}; // search for the int nf_route_table_search interface (const u_int32_t dst, void * res, int * res_len ); // Add the int subnet (u_int32_t network, u_int32_t netmask, void * extra, int extra_len) interface; // int subnet (u_int32_t network, u_int32_t mask) of the Node Deletion interface ); // clear interface void nf_route_table_clear (void );

File C: nf_conntrack_rtable.c

# Include <linux/types. h> # include <linux/inetdevice. h> # include <linux/slab. h> # include <linux/kernel. h> # include "nf_conntrack_info.h" # ifndef NULL # define NULL 0 # endif // It is always safe to use lock. Two elements of kernel programming: 1. security; 2. efficient static DEFINE_RWLOCK (nf_hash_lock); struct nf_fn_hash * route_table = NULL; static inline u_int32_t evaluate (distinct dst, struct nf_fn_zone * fz) {return dst & FZ_MASK (fz );} static inline u32 nf_fn_hash (u_int32_t key, struct nf_fn_zone * fz) {u32 h = key >>( 32-fz-> fz_order); h ^ = (h> 20 ); h ^ = (h> 10); h ^ = (h> 5); h & = FZ_HASHMASK (fz); return h;} static struct hlist_head * fz_hash_all Oc (int divisor) {unsigned long size = divisor * sizeof (struct hlist_head); return kcalloc (1, size, GFP_ATOMIC);} static struct nf_fn_zone * fn_new_zone (struct nf_fn_hash * table, int z) {int I; struct nf_fn_zone * fz = kcalloc (1, sizeof (struct nf_fn_zone), GFP_ATOMIC); if (! Fz) return NULL; if (z) {fz-> fz_divisor = 16;} else {fz-> fz_divisor = 1 ;} fz-> fz_hashmask = (fz-> fz_divisor-1); fz-> fz_hash = fz_hash_alloc (fz-> fz_divisor); if (! Fz-> fz_hash) {kfree (fz); return NULL;} fz-> fz_order = z; fz-> fz_mask = inet_make_mask (z ); /* Find the first not empty zone with more specific mask */for (I = z + 1; I <= 32; I ++) if (table-> nf_fn_zones [I]) break; write_lock_bh (& nf_hash_lock); if (I> 32) {/* No more specific masks, we are the first. */fz-> fz_next = table-> nf_fn_zone_list; table-> nf_fn_zone_list = fz;} else {fz-> fz_next = table-> nf_f N_zones [I]-> fz_next; table-> nf_fn_zones [I]-> fz_next = fz;} table-> nf_fn_zones [z] = fz; write_unlock_bh (& nf_hash_lock ); return fz;} // route table operation interface: 1. search; 2. delete. Too many parameters, similar to Win32 API, with poor style, but convenient int nf_route_table_opt (const u_int32_t dst, const u_int32_t mask, int del_option, void * res, int * res_len) {int rv = 1; struct nf_fn_zone * fz; struct nf_fib_node * del_node = NULL; if (NULL = route_table) {printk (""); return 1 ;} read_lock (& nf_hash_lock); for (fz = route_table-> nf_fn_zone_list; fz = fz-> fz_next) {struct hlist_head * head; struct hlist_node * node; st Ruct nf_fib_node * f; u_int32_t k = nf_fz_key (dst, fz); head = & fz-> fz_hash [nf_fn_hash (k, fz)]; terminate (f, node, head, nf_fn_hash) {if (f-> fn_key = k) {if (1 = del_option & mask = FZ_MASK (fz) {del_node = f ;} else if (0 = del_option) {// copy the extra data imported by the user to the caller. Memcpy (res, (const void *) (f-> extra), f-> extra_len); * res_len = f-> extra_len;} rv = 0; goto out ;}}rv = 1; out: read_lock (& nf_hash_lock); if (del_node) {write_lock_bh (& nf_hash_lock); _ hlist_del (& del_node-> nf_fn_hash ); kfree (del_node); write_unlock_bh (& nf_hash_lock);} return rv;} static inline void fib_insert_node (struct nf_fn_zone * fz, struct nf_fi_node * f) {struct hlist_head * head = & fz-> fz _ Hash [nf_fn_hash (f-> fn_key, fz)]; hlist_add_head (& f-> nf_fn_hash, head);} int nf_route_table_search (u_int32_t dst, void * res, int * res_len) {return nf_route_table_opt (dst, 32, 0, res, res_len);} int nf_route_table_delete (u_int32_t network, u_int32_t mask) {return packet (network, mask, 1, NULL, NULL);} int nf_route_table_add (u_int32_t network, u_int32_t netmask, void * extra, int extra_len ){ Struct nf_fib_node * new_f; struct limit * fz; new_f = kcalloc (1, sizeof (struct nf_fib_node) + extra_len, GFP_ATOMIC); new_f-> len = netmask ); new_f-> extra_len = extra_len; new_f-> fn_key = network; memcpy (new_f-> extra, extra, extra_len); if (new_f-> len> 32) {return-1;} INIT_HLIST_NODE (& new_f-> nf_fn_hash); if (NULL = route_table) {route_table = kcalloc (1, sizeof (struct nf_fn_h Ash), GFP_ATOMIC); fz = fn_new_zone (route_table, new_f-> len);} else {fz = route_table-> nf_fn_zones [new_f-> len];} if (! Fz &&! (Fz = fn_new_zone (route_table, new_f-> len) {return-1;} fig (fz, new_f); return 0;} void nf_route_table_clear (void) {struct nf_fn_zone * fz, * old; if (NULL = route_table) {printk (""); return;} write_lock_bh (& nf_hash_lock ); for (fz = route_table-> nf_fn_zone_list; fz;) {if (fz! = NULL) {kfree (fz-> fz_hash); fz-> fz_hash = NULL; old = fz; fz = fz-> fz_next; kfree (old ); old = NULL ;}} kfree (route_table); route_table = NULL; write_unlock_bh (& nf_hash_lock); return ;}

For compilation, I put the above Code in the net/ipv4/netfilter directory, modify Makefile, and add nf_conntrack_rtable.o:

nf_conntrack_ipv4-objs  :=  nf_conntrack_l3proto_ipv4.o nf_conntrack_proto_icmp.o nf_conntrack_info.o

Then compile the program. The nf_conntrack_00004.ko has been added to its own extension.
After abandoning iptables, I chose procfs as the user interface, because it uses the standard file IO Interface, which can be conveniently operated in various scripts. There are three other ideas.
The first idea is that if there is a ready-made mechanism that can fulfill requirements, do not write C code. More and more content, fewer and fewer code!
Another idea is to use transactional operations rather than encapsulating transaction sequences by yourself. Similarly, if you use C or JAVA to write a file, you must complete open-readinfo-write... the write end-close sequence is not a file, but a programming language. If you use a bash script and an echo, the echo program encapsulates a complete sequence (strace echo something>. /a knows). Why do you need to implement a similar one on your own? Of course, this leads to the third idea.
Use text instead of binary for configuration. As long as it is not a pure machine operation, do not use binary, binary is the world of machines, as long as it involves human factors, text is the best way to exchange information.
Therefore, I chose procfs as the way to communicate with the kernel mechanism, and used it to configure policies or add/Delete "extended routes ".
Effect
When I perform the following operations:
Echo + 192.168.10.0 255.255.255.0 1234 abcd>/proc/nfrtable
If there is a packet from 192.168.10.30, you will see the 1234abcd information associated with the data stream in/proc/net/nf_conntrack. If there is a packet from 192.168.20.0/24, this information will not be available. In addition, when you execute iptables-save, you can no longer see settings about info. In fact, the INFO module does not need it at all.
The algorithm reuse Linux kernel is eclectic, and its algorithm is public, so there is no Copy right, so anyone can use it. The key point is that the algorithm for porting it has to pay a little price. As Linus said, the classic RTFS does not provide any portable interfaces for users in Linux source code, you need to do everything yourself! For example, you cannot compile Reusable Modules into a Library when compiling the kernel. If you need to, you must modify the Makefile by yourself.
The kernel API of the Linux kernel APILinux is variable. The biggest benefit of this kind of movement is flexibility. In other words, RTFS. The source code is in front of you and you can see what you need, but the kernel community does not provide any shipping support. Just like a gold mine, the manager is only managing it. Any gold miner can dig out the gold he needs, but the manager does not provide any vehicles for transportation, even if you have money to rent it, you know, gold is heavy!
The Linux gold mine also has another rule, that is, anyone can add their own code, tools, and anything else in it. Therefore, a problem is the instability of Linux kernel APIs. Don't expect a quality control team to check your code implementation and interface definition. What the community cares about is that as long as it solves the actual problem, it will be adopted. If you are using the 2.6.18 kernel, and you think it is not good, and then you modified some of its interfaces, it is indeed not scolded by Linus, these interfaces will be the 2.6.18-x or 2.6.19 standard interfaces. However, Linux provides the LKM mechanism, and LKM needs to call the kernel interface. Therefore, the version number of LKM must be the same as the kernel version number of its host, because the kernel interface can be changed at any time.
The Linux style determines its flexibility at the cost of LKM restrictions. Windows, on the contrary, provides the maximum compatibility interface, whether it is a kernel driver or a user application. Of course, there is a price, that is, the Windows Kernel and support library are becoming increasingly bloated, because a large number of adapters are required to be compatible with new and old interfaces. The API of Microsoft's system is constantly changing, but it only hides this change from the user. Implementing this hiding is exactly what Microsoft's in-service personnel are...

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More