Abstract extension of Linux routing table applied to Nf_conntrack

Source: Internet
Author: User
Tags define null

The thought standard IP routing lookup process provides us with an excellent "match-action" routine.

That is matched to a route item. The packet is then sent to the next hop indicated by the route entry.
Assuming that we abstract the above process of IP routing lookup up one level, we will find that in fact it can be used for other purposes. The abstract expression is: The source address of the packet or the destination address as the key value to query a table. An action indicated by running the result item after the result item is found. One result item is:

struct Result_node {       uint32 network;       UInt32 netmask;       void *action;};
This idea is implicit, thanks to the "longest prefix match" principle in routing lookups, but this principle guarantees the most accurate matching.


The requirements are in my special scenario. I want a network segment (a big network, a little bit of network.) A host) The data stream emitted is associated with a string descriptive narrative. The string can be descriptive narrative, can be username, it can even be something else at random ... In the past, such a situation would be felt to be incompatible with the UNIX philosophy. And in the past, memory usage is too stingy, memory is too extravagant.

But nowadays, there is no need to skimp on memory. We can plug into the kernel whatever it is that can be plugged in. Just do it properly and make it compatible with the philosophy of UNIX.
In general terms. I want to remove a pre-configured string from a packet that arrives.

Why Linux is not implemented actually my idea started out as a mistake. Correctness is that I know that I am wrong. Why have I been "patching" the Linux kernel stack and the various bad or incomplete implementations of NetFilter extensions? For example, "NAT is in effect immediately". For example, two-way static Nat. For example, incomplete conntrack confirm mechanism.

Are these flaws not recognized by the Linux kernel and the NetFilter community? This is definitely not the case, as they follow the worse is better principle. The core of this principle is simply to dominate everything, in order to simply be able to abandon all that is discarded.
And my approach, which is to implement very much of what the protocol stack does not include, looks like a violation of the worse is better principle, and I replaced the simplicity (simplicity) principle with the completeness (completeness) principle. and the correct formulation of these two principles should be:

Simplicity
The design must is simple, both in implementation and interface. It's more important for the implementation to being simple than the interface. Simplicity is the most important consideration in a design.
Completeness
The design must cover as many important situations as is practical. All reasonably expected cases should is covered. completeness can sacrificed in favorof any other quality. In fact, completeness must was sacrificed whenever implementation simplicity is jeopardized. Consistency can sacrificed to achieve completeness if simplicity is retained; Especially worthless is consistency of interface.
As a result, Linux community developers strictly follow this principle and do not introduce a "complex and infrequently used" mechanism. And why do I have to implement them? My idea is that everything in order to meet my needs, in accordance with the simplicity priority principle, I have to do this because I am doing this in order not to introduce more complex mechanisms.

Since the analysis is binding a string to the data stream, it is very obvious that the extended Nf_conntrack is an excellent choice and how to extend it I have already done in detail in the previous article. The next thing to consider is how to set the rules, and obviously, writing an info iptables module is an option:
iptables-t ...-a ....-j INFO--set-info "AAAAAAAAAAAA"
The idea of the info module is to remove the conntrack structure from the SKB and then take out the acct extension (I always like to expand the account with the account extension). It then copies the information indicated by the Set-info to the acct extension.
However, if you have 10,000 info information to set up, I'm going to set up 10,000 rules. Each packet has to traverse all of the above rules sequentially in the kernel, and the iptables rule is too many (more than 5,000), which can really seriously affect network performance. Of course you have the flexibility to arrange the order of the rules so that the packets end up at a high speed to the matching process of a chain, for example:
iptables.-M State--state Established-j ACCEPT (set on the first, because there is only one packet for the first new state of a stream to be set-info necessary)
.....-j INFO--set-info AAAAA
.....-j INFO--set-info bbbbb
.....-j INFO--set-info CCCCC

This is really a trick. Played iptables know, but because there are 4 of the following problems caused me to find other solutions:
1. The target other than info and info does not consistently allow you to jump out of the rule matching chain in the above way. There is a need to arrange for a self-defined chain to be resolved.
2. As a result of the frequent operation of the 1th above, the rule set itself becomes more and more complex, as is the case with poorly written C + + code, as the programming of iptables. It is extremely difficult to adjust the order of a rule.
3. Step back 10,000 steps to speak. For a new state package, it really needs to traverse the rules (note that iptables is traversal) and I need an algorithm that is more efficient than O (n) (Nf-hipac is much better in 5000+ rules).
The 4.Netflter Nf-hipac project may be a feat, but I don't have time to study it specifically.


Unix thought teaches us to split the problem into small independent problems, and then let them work with each other to solve the big problem, but the co-ordination itself sometimes becomes a new problem, and it has to pay a huge amount of management cost. Not every type of problem can be broken down into the same way as Cat File|grep key.
When I found that the IP route lookup process is actually a "match-action" mode, I changed the "action" of the explanation, "send the next jump" to "remove info information", note. Just take out info info and don't set it to Conntrack, why don't you do the two things together? Because the relationship between these two things is very similar to the relationship between Cat File|grep key. 4 Questions in the previous section. That's what I said:
1. Using the Routing lookup module is completely independent of iptables, and Iptables does not have a half-gross money relationship;
2. Using the Routing lookup module does not need to interface with any other modules, you simply need to invoke the following interface:
int nf_route_table_search (const u_int32_t SRC, char *info, int len);
Suppose Src was found. The info information for the network associated with the longest prefix that matches the address is taken out, assuming that there is no association with it, returns a non-0
3. Efficiency, not to mention, is completely the Linux kernel hash route lookup algorithm, used for more than 10 years. Run in a variety of environments.


4. Although I do not have time to study the NF-HIPAC project, I have already mastered the routing lookup algorithm, and one idea is: Always use the technology that I know best.


Extending again to say that it is possible to use a routing lookup module to match a "route entry" to a packet association at high speed, is it too wasteful to have only one string message inside the route item? Can the information carried by the routing item also be extensible? The answer is no doubt, because I have changed the "next Hop" into a string information, then it is possible to cancel the type definition. It is undoubtedly advisable to add a void * Pointer to the route entry, but in order to make the memory more compact. Adding a 0-length array is a better choice.


In the kernel, struct Fib_node indicates a route entry, and for the sake of the standard distinction, I add a nf_ prefix indicating that it is a "route entry" associated with NetFilter and used by Nf_conntrack. At the same time, remove the definition of the carrying data type:

/* *      The following struct indicates a "routing node". It can carry a extra_data *      you can define it arbitrarily, such as the following definition: *      struct My_extra {*              char info[info_size]; *              int  Policy ; Nf_accept,nf_drop or something? *              //.... Extra of extra?? *      }; *      it brings infinite extensibility, and you can use this "routing" node to store whatever data you have. */struct Nf_fib_node {        struct hlist_node       nf_fn_hash;        u_int32_t               Fn_key;        int                     Len;        Extra_len indicates the length of your incoming extra data        int                     Extra_len;        0 length array, you can arbitrarily define it        char                    extra[0];};

PS:In the 15th chapter of the Java Programming idea, section 15.16 begins with the idea of writing code that can be applied as widely as possible. To achieve this, we need a variety of ways to loosen the restrictions on the types that our code will work on, without losing the benefits of static type checking at the same time.
Further extended ACLs are all called access control lists. Notice the word "list". Indicates that its structure is one-dimensional linear.

On very many systems. The specific matching behavior of the ACL on the packet depends on the administrator's rule configuration order. Note that this is the ACL User Configuration interface, of course, for this, Linux iptables and Cisco ACL is consistent, but for the internal implementation, the system is different. Take Linux iptables as an example, consider the following sequence of rules:

I=1;# adds a huge number of iptables rules for ((; i < 255; i++)); Do        j=1;        for ((; J < 255; J + +)); Do                iptables-a forward-s 192.168. $i. $j-j DROP;        Donedone

At this point a source address is a 172.16.0.0/24 network segment of the packet, it has to go through more than 60,000 rules? Of course. You can put an accept rule for a non-192.168.0.0/16 segment in the first, but the focus here is on the implementation level rather than on the configuration level: How to quickly dodge those apparently mismatched rules during packet matching. This is also one of the purposes of the NF-HIPAC project.

This idea is completely upside down. In the past, the address in the rule matches the matching dominance of the packet, and becomes the source/destination IP to match the rule according to the packet.

The idea of using routing lookups is undoubtedly very convenient!


Follow the previous section of my extension to struct nf_fib_node. In this sense, the definition of its extra has been quite clear:

struct My_extra {        //replace matches        struct list_head matches_list in iptables;        Replace the target of the iptables        

It is stated that the source IP address or destination IP address of the packet is used to "look up a routing table". Assuming that it is found, the extra data is fetched. Indicates how the packet is handled.


The idea of implementing the "Further extensions" section above only points out a feasibility, and I haven't figured out where to put it. The purpose of writing this article is to implement a tool called Conntrack_table for Nf_conntrack. Example:
1. A route in the conntrack that binds two directions. Implementation only for Conntrack to find, no longer for each packet to go to Anza by the table;
2. Use the longest prefix matching IP routing lookup mechanism instead of using the iptables matching mechanism to match the first packet of a conntrack to a route entry, and then remove the information from the route entry into the conntrack structure.

Because the weekend time is limited, usually comes home too late, so the implementation of this version number only includes the use of routing mechanism to save the network segment string information. It is then set in the Conntrack structure body.
Kernel mechanism to realize the core state of the total need to implement two parts. The first part is to copy the standard routing lookup algorithm to the Nf_conntrack_ipv4 module, and the second part is to query the routing table where nf_conntrack is confirm. The info information saved in the fetch results route item is stored in the conntrack. This operation is only for the first packet of a stream.

Perhaps the packet no longer queries this routing table.

Porting code for the kernel's routing lookup module I'll give you a moment, and now I'll give you a change for confirm. Only the Ipv4_confirm function is changed. It is a hook function attached to the inet IPV4 protocol family, so I don't have to consider other protocols, thus omitting the protocol inference:

static unsigned int ipv4_confirm (unsigned int hooknum, struct sk_buff *skb,                                 const struct Net_device *in, const struct Net_device *out,    Int (*OKFN) (struct Sk_buff *)) {... out:/* We ' ve seen it coming out, the other side:confirm it */        ret = nf_conntrack_confirm (SKB);  if (ret = = nf_accept) {#include "nf_conntrack_info.h" #define Max_len 128struct nf_conn_priv {struct nf_conn_counter    Ncc[ip_ct_dir_max];        char *info;};        struct Nf_conn_priv *ncp;        struct Nf_conn_counter *acct;        Acct = Nf_conn_acct_find (CT);            if (acct) {char Buf[max_len] = {0};            int len = Max_len;            struct IPHDR *iph = IP_HDR (SKB);            Query "routing table" to get the result int rv = Nf_route_table_search (iph->saddr, buf, &len);                if (!RV) {NCP = (struct nf_conn_priv *) acct; if (ncp-> info = = NULL) {ncp->info = (char *) kcalloc (1, len+1, gfp_atomic);            }//Copy obtained results to Conntrack memcpy (Ncp->info, buf, Len); }}} return ret;}

Note that this position in this ipv4_confirm is carefully selected, and a good insertion point will reduce the number of if statements. Whether it is an old language or an OO language, the IF statement is used only when it has to be used. The program runs to a location. It should be aware of its own contextual information, and the Who am I should not ask questions.


Well, now it's time to show the porting of the Linux routing lookup algorithm, and I think the hash algorithm has been tested for so many years, and there is no porting of the trie tree algorithm. At the same time I cut off the rehash mechanism. This transplant is very easy. Basically is the net/ipv4/fib_hash.c change operation, removed the Fib_node the type definition. That is, it is promoted to a "generic" level.

The code is as follows:
Header file: nf_conntrack_rtable.h

#include <linux/types.h> #include <linux/list.h> #define SIZE 128struct nf_fn_zone {struct Nf_fn_zone       *fz_next;       /* Next not empty zone */struct hlist_head *fz_hash;        /* Hash table pointer */int fz_nent;     /* Number of entries */int fz_divisor;    /* Hash divisor */u32 fz_hashmask;       /* (FZ_DIVISOR-1) */#define FZ_HASHMASK (FZ) ((FZ)->fz_hashmask) int fz_order; /* Zone order */u_int32_t Fz_mask; #define FZ_MASK (FZ) ((FZ)->fz_mask)        };struct nf_fn_hash {struct Nf_fn_zone *nf_fn_zones[33]; struct Nf_fn_zone *nf_fn_zone_list;};/ * * The following struct indicates a "routing node", which can carry a extra_data * You can define it at will, for example, as defined below: * struct My_extra {* Char info[1 28]; * int policy; Nf_accept,nf_drop or something?   *           ... extra of extra?? *      }; * It brings infinite extensibility. You can use this "routing" node to store whatever data you have.        */struct nf_fib_node {struct Hlist_node nf_fn_hash;        u_int32_t Fn_key;        int Len;        Extra_len indicates the length of your incoming extra data int extra_len; 0-length array, you can arbitrarily define it char extra[0];};/ /Find interface int nf_route_table_search (const u_int32_t DST, void *res, int *res_len);//Add interface int Nf_route_table_add (u_int32_t net Work, u_int32_t netmask, void *extra, int extra_len);//node Delete interface int Nf_route_table_delete (u_int32_t network, u_int32_t Mask );//Clear interface void nf_route_table_clear (void);

C file: nf_conntrack_rtable.c
#include <linux/types.h> #include <linux/inetdevice.h> #include <linux/slab.h> #include <linux/ kernel.h> #include "nf_conntrack_info.h" #ifndef null#define NULL 0#endif//It is always safe to use lock. Kernel programming two elements: 1. Security; 2. Efficient static Define_rwlock (nf_hash_lock); struct Nf_fn_hash *route_table = null;static inline u_int32_t nf_ Fz_key (u_int32_t DST, struct Nf_fn_zone *fz) {return DST & Fz_mask (FZ);}        Static inline U32 Nf_fn_hash (u_int32_t key, struct Nf_fn_zone *fz) {u32 h = key>> (32-fz->fz_order);        H ^= (H&GT;&GT;20);        H ^= (H&GT;&GT;10);        H ^= (h>>5);        H &= fz_hashmask (FZ); return h;}        static struct hlist_head *fz_hash_alloc (int divisor) {unsigned long size = divisor * sizeof (struct hlist_head); Return Kcalloc (1, size, gfp_atomic);}        static struct Nf_fn_zone * Fn_new_zone (struct nf_fn_hash *table, int z) {int i;        struct Nf_fn_zone *FZ = kcalloc (1, sizeof (struct nf_fn_zone), gfp_atomic); if (!FZ)                return NULL;        if (z) {fz->fz_divisor = 16;        } else {fz->fz_divisor = 1;        } Fz->fz_hashmask = (fz->fz_divisor-1);        Fz->fz_hash = Fz_hash_alloc (fz->fz_divisor);                if (!fz->fz_hash) {kfree (FZ);        return NULL;        } Fz->fz_order = Z;        Fz->fz_mask = Inet_make_mask (z); /* Find The first not empty zone with more specific mask */for (i=z+1; i<=32; i++) if (table-&gt        ; nf_fn_zones[i]) break;        WRITE_LOCK_BH (&nf_hash_lock); if (i>32) {/* No more specific masks, we is the first. */Fz->fz_next = TABLE-&GT;NF                _fn_zone_list;        Table->nf_fn_zone_list = FZ;                } else {fz->fz_next = table->nf_fn_zones[i]->fz_next;        Table->nf_fn_zones[i]->fz_next = FZ; } table->NF_FN_ZONES[Z] = FZ;        WRITE_UNLOCK_BH (&nf_hash_lock); return FZ;} Routing Table Operation interface: 1. Find; 2. Delete. Too many of the parameters. Similar Win32 API, style is not good, but easy to use int nf_route_table_opt (const u_int32_t DST, const u_int32_t mask, int del_option, void *res, int *re        S_len) {int RV = 1;        struct Nf_fn_zone *fz;        struct Nf_fib_node *del_node = NULL;                if (NULL = = route_table) {PRINTK ("");        return 1;        } read_lock (&nf_hash_lock);                for (FZ = route_table->nf_fn_zone_list; FZ; FZ = fz->fz_next) {struct Hlist_head *head;                struct Hlist_node *node;                struct Nf_fib_node *f;                u_int32_t k = Nf_fz_key (DST, FZ);                Head = &fz->fz_hash[nf_fn_hash (k, FZ)];                                Hlist_for_each_entry (F, node, head, Nf_fn_hash) {if (F->fn_key = = k) {                               if (1 = = Del_option && Mask = = Fz_mask (FZ)) {         Del_node = f;                                        } else if (0 = = del_option) {//copies the user's incoming extra data to the caller.                                        memcpy (res, (const void *) (F->extra), F->extra_len);                                *res_len = f->extra_len;                                } rv=0;                        Goto out;        }}} RV = 1;out:read_lock (&nf_hash_lock);                if (Del_node) {write_lock_bh (&nf_hash_lock);                __hlist_del (&del_node->nf_fn_hash);                Kfree (Del_node);        WRITE_UNLOCK_BH (&nf_hash_lock); } return RV; static inline void Fib_insert_node (struct nf_fn_zone *fz, struct nf_fib_node *f) {struct Hlist_head *head = &fz        ->fz_hash[nf_fn_hash (F->fn_key, FZ)]; Hlist_add_head (&f->nf_fn_hash, head);} int Nf_route_Table_search (u_int32_t dst, void *res, int *res_len) {return nf_route_table_opt (DST, 0, Res, res_len);} int Nf_route_table_delete (u_int32_t network, u_int32_t mask) {return nf_route_table_opt (network, Mask, 1, NULL, NU LL);} int Nf_route_table_add (u_int32_t network, u_int32_t netmask, void *extra, int extra_len) {struct Nf_fib_node *new_f        ;        struct Nf_fn_zone *fz;        New_f = Kcalloc (1, sizeof (struct nf_fib_node) + Extra_len, gfp_atomic);        New_f->len = Inet_mask_len (netmask);        New_f->extra_len = Extra_len;        New_f->fn_key = network;        memcpy (New_f->extra, Extra, extra_len);        if (New_f->len > +) {return-1;        } init_hlist_node (&new_f->nf_fn_hash);                if (NULL = = route_table) {route_table = Kcalloc (1, sizeof (struct nf_fn_hash), gfp_atomic);        FZ = Fn_new_zone (Route_table,new_f->len); } else {FZ = ROUTE_TABLE-&GT;NF_FN_zones[new_f->len]; } if (!fz &&!) (        FZ = Fn_new_zone (Route_table,new_f->len))) {return-1;        } fib_insert_node (FZ, new_f); return 0;}        void Nf_route_table_clear (void) {struct Nf_fn_zone *fz,*old;                if (NULL = = route_table) {PRINTK ("");        Return        } write_lock_bh (&nf_hash_lock); for (FZ = route_table->nf_fn_zone_list; FZ;)                        {if (FZ! = NULL) {kfree (Fz->fz_hash);                        fz->fz_hash=null;                        Old = FZ;                        FZ = fz->fz_next;                        Kfree (old);                Old=null;        }} kfree (route_table);        Route_table=null;        WRITE_UNLOCK_BH (&nf_hash_lock); return;}

Compile related I put the above code into the Net/ipv4/netfilter folder, change the makefile, add NF_CONNTRACK_RTABLE.O:
Nf_conntrack_ipv4-objs  : =  nf_conntrack_l3proto_ipv4.o NF_CONNTRACK_PROTO_ICMP.O nf_conntrack_info.o
You can then compile it. Out of the Nf_conntrack_ipv4.ko has increased its expansion.
User interface After discarding the iptables, I chose Procfs as the user interface. Because it uses the standard file IO interface. Can be easily manipulated in a variety of scripts. There are also three ideas.
The first idea is that if you already have a ready-to-complete mechanism, don't write C code. More and more content. The code is getting less!
Another idea is to use transactional operations rather than encapsulate the transaction sequence yourself. Same for writing files. Assuming C or Java to write, then you have to complete open-readinfo-write-write-write...write end-close such sequences, you manipulate not files, many other is playing the programming language itself, assuming that the use of the bash script , an echo can. The Echo program encapsulates a complete sequence (strace echo sth >./a), why do you repeatedly implement a similar? This, of course, leads to a third thought.
Use text instead of binary to do the configuration.

Just not pure machine operation, do not use binary, binary is the world of machines, just to relate to human factors, text is the best way to exchange information.
Therefore, I chose Procfs as a way to communicate with the kernel mechanism to configure the policy, or to add/remove "extended routes".


Effect
When I run the following operation:
echo +192.168.10.0 255.255.255.0 1234ABCD >/proc/nfrtable
With a packet coming from 192.168.10.30, you will see the 1234ABCD information associated with the data stream in the/proc/net/nf_conntrack, whereas if there are packets from the 192.168.20.0/24 segment, there will be no such information. Also, when you run the Iptables-save. I can't see the settings for info anymore, in fact. The info module is not needed at all.

The reuse of the Linux kernel for algorithms is eclectic. Its algorithm is public. So there is no copy right, so no one can use it. The key point is that the algorithm for porting it has to pay a price, as Linus said of the classic Rtfs. Linux source code does not provide any portability interface for the user to use, everything needs you to do. For example, you cannot compile a kernel with a library of reusable modules. If you need to, you have to change makefile yourself.


About the Linux kernel Apilinux kernel API is changeable, this kind of movement gives it the biggest advantage lies in the flexibility. Or that sentence, Rtfs. The source code is right in front of you. See the need to take it yourself, just the kernel community does not provide whatever handling support.

Just as a gold mine, the manager is only managing it, no matter what gold digger can dig away the gold they need, but the manager does not provide no matter what the vehicle is responsible for handling. Even if you have the money to rent it. You know, the gold is very heavy oh!
Linux Gold There is also a rule that no one can add their own code, tools, or anything else. Therefore, the problem is that the Linux kernel API is not stable. You don't expect a QC team to check your code implementation and interface definitions. What the community cares about. Only if it overcomes the actual problem, it will be satisfied. Let's say you're using the 2.6.18 kernel. Then you feel bad, then you change a part of its interface, really not be Linus scold, then these interfaces will become 2.6.18-x or 2.6.19 standard interface.

Linux, however, provides the lkm mechanism, and lkm needs to invoke the kernel interface. Therefore, the LKM version number must be exactly the same as the kernel version number of the host. Because the kernel interface can be changed at all times.
The style of Linux determines its flexibility, at the cost of many restrictions on lkm. Windows is just the opposite. It provides the maximum compatibility interface, whether it is a kernel driver or a user application. There is also a price, of course, that the Windows kernel and the support library are becoming bloated. Because of the need for a large number of adapters to accommodate the new and old interface.

Microsoft's system of the API is also changing, just it to the user to hide such a change, the implementation of such a hidden is Microsoft's incumbent staff ...

Abstract extension of Linux routing table applied to Nf_conntrack

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.