Proposition Composition: Dimension tree interval lookup and IP packet classification

Source: Internet
Author: User
Tags benchmark

This topic is a bit large, and I have to strictly control the number of words, not like the "Proposition composition: in a IPv4 address tree to thoroughly understand the IP routing table of the various search process" so open. In fact, this composition is an extension of the interval finding section in the previous essay.
The 1.IP packet classification is the core of the IP packet classification based on several fields of the IP packet protocol header, also called the matching domain, which divides the packet into a category.
In fact, the process of IP routing lookup is a special case of IP packet classification, an extremely simple special case, at this time the matching domain is the destination IP address, and the class is the route item or simpler, the next hop. At this time consider the source address Policy Routing, and then add a matching domain, that is, the source IP address, then the whole process should do? This paper focuses on a multi-dimensional interval matching scheme, and does not talk about other such as hash matching or hardware algorithm.
2. Interval search I want to write the "HiPAC high performance rule matching algorithm discovery process" and "play to high performance super-strong firewall Nf-hipac" can bring some help, in fact, the IP packet classification process can be seen as a combination of multiple matching domain results of the interval lookup.
In the HIPAC general algorithm, although it is very good to perform efficiently, it can waste a lot of memory. It may be necessary to optimize the storage space of the grid of tries data structures to eliminate redundant data.
While I'm going to say that the core algorithms for IP packet classification should be over, the theory behind packet categorization is only just beginning. I will use an example to illustrate the diagram as clearly as possible, rather than using a uniform mathematical deduction to explain the problem, in order to provide a perceptual picture.
3. The problem often only knows how to make an IP packet in the multi-dimensional continuous interval collection is not enough, actually does not understand these also does not matter. If you do understand this, then you need to know what it is.
Now, I assume that through the above mentioned two articles about NF-HIPAC, you already know the multi-dimensional interval matching process, then we can be separated from the specific scene, it is abstracted into a general problem, first look at the abstract, in this diagram, I ignored the size of the interval, ignoring the rule of the permutation problem ( In the end I will return to this question):




In this diagram, we see a lot of "?" , which means that we don't know what ruleset is associated with that interval at this time? What is this for?
We look at the second layer, that is, the Dimension 2 layer, the layer has 4 nodes, corresponding to Dimension 1 split out of 4 intervals, if you want to let the match continue, that is, continue to match Dimension 3 that layer of nodes, the corresponding ruleset must have intersection in order to , so we get a clearer picture of the following, though, it's a little messy .... In writing this article, I do not know how to think of Harbin, want to cry ... :




We follow this diagram and discover the following important three key points:
1). Key operation: RuleSet needs to take the intersection of the ruleset of the current interval with the ruleset of the next dimension, and the result set is not empty to extend the branch into a child node. I added some constraints related to the scene, such as having at least one default rule for each interval, so all the results of the intersection would not be null, so the above sentence should be if the result set has an element number of 1, and the element is the default rule, The next matching dimension corresponding to the current interval is extended to the leaf node. This means that the next dimension will no longer expand the subtree, which means that the match is not necessary to continue, just take the default rule.
2). Key tradeoffs: Rapid success and rapid failure as shown in the key operation, as long as the match continues, i.e. the number of Rule in the remaining ruleset of the relevant current interval is not 1 or 1 but not the default rule, then we cannot determine the final result, so This means that rapid success is impossible. But we can quickly determine what will fail, that is, there are only 1 default Rule left in the ruleset of the current interval. This is the rapid failure.
3). Key questions: How the tree layout through the above diagram can be seen, because ruleset is to take the intersection, Dimension 1/2/3 who in the former who in fact ultimately does not affect the results, but this sort of different but can affect the occupancy of space, from can be seen, as long as the ruleset have intersection , creating a child node on the lower level is actually creating an interval lookup tree, and we know in quick failure that all subtrees may be different in height due to the ruleset layout. The question is, how do we arrange the lookup order of the dimensions?
4. The difficulty is not to expect any balance operation on the tree above, because the factors affecting its subtree height and width are interrelated, they are the number of rule, the number of intervals for each dimension, the internal layout of the ruleset, and so on. Any "spin" of this tree will cause it to be called mess, which is not a tripod.
Since there is no good way to calculate the optimal arrangement of these dimension matching order, it is better to find a way to "make this tree" more balanced in the sense of statistics.
5. From M fork tree to 2 fork tree so far, we have always felt that the so-called multidimensional interval match is a M-fork tree, where M is the number of intervals each layer is divided by that layer, that is, the dimension. From the above analysis, we find it difficult to find the best way to build in a tree so fat and highly indeterminate in each branch, because there are too many variables, and the number of rule in ruleset is a variable, and the number of intervals is also a parameter, The ruleset of each interval of each dimension is to be set, which involves a Cartesian product problem. There are too many mathematical calculations involved, and there is not a simple formula.
Does the multi-dimensional interval matching tree have to be so constructed as M-fork trees at the outset?
Before I go on, I'll show you the multidimensional matching problem in a more intuitive way, and then start with 0 until we meet the KD tree, and the rest, the data structure, the algorithm-related books are all there.
5.1. Multidimensional matching cube Fortunately, I have three-dimensional domain matching as an example, otherwise I really do not know how to draw the four-dimensional cube.
I described the three-dimensional domain matching problem above as a cube:




Notice that each black point along the axis of the line is joined by the segments formed by these segments to divide the cube into 4*3*2 cubes, our problem is that after the final match is completed, we will fall into a sub-cube, and the current ruleset in this cube is the result set, The biggest priority is the final result, and it's worth noting that the ruleset in these cubes were put in when the cube was constructed and, like the proposition composition for routing, I don't consider the time complexity of constructing events with relatively rare data structures. So I put the ruleset into the sub-cube, as shown in:




So here's the problem. How do I cut this cube and finally get the desired sub-cube? How to cut the first knife? Along the D1 axis? D2 axis? D3 axis? If the D1 axis is determined, what about the second knife? Continue the D1 axis? Or do you cut it in a different direction? ...
5.2. Cutting method This makes people think of cutting watermelon, but in fact, it is completely different, this and you use a large block of wood to take the center of the small block more similar. You have two ways, first cut it into pieces, then cut it into a stick, and then chop the stick into small piece, there is a way to compare similar machine tool milling, always changing direction cutting. We are faced with the same choice for multidimensional matching issues.
5.2.1. Dimensional depth Precedence this way resembles a piece-stick-block way, as shown in:



5.2.2. Dimensional breadth First This comparison is similar to the way the machine is machined, as shown in:




Well, it's time to give the data structure. I'm talking about the interval search in the composition of the routing lookup the structure and construction of the interval binary search tree are given, and the premise is assumed in this paper.
5.3. Depth first find tree depth first find tree structure is as follows:




Visible, this is a direct way, especially suitable for exact matching. I refer to this tree when I refer to Dimension tree or DimTree.
5.4. Breadth-first search tree breadth-first tree structure is as follows:




As you can see, this is a way of taking a chance, gradually (in the smallest possible granularity, gingerly) in each dimension, close to the final sub-cube space to fall. Of course this way is actually dimension tree after the rotation of the result of the start, the final can also get the answer, hit a dimension of the leaf node, the ruleset into a stack, and finally after the end of the tree tour, the stack inside the ruleset out after the intersection, Then take the best. Of course, you can also use a quick failure to cut a branch, such as reaching a leaf node that contains only the default rule. But how long will it take to find the leaf node for the first time? If the search tree for the first dimension has 3 layers, the second dimension has 2 layers, and the third dimension has 2 layers, then the layer of the breadth precedence tree is: First dimension-Second dimension-third dimension-first dimension-second dimension (first encounter leaf)-third dimension (leaf node)-first dimension .... In any case, even if you start cutting from the second dimension, it is not possible to touch the leaves on the second floor ... This means that even the fast-failing test is delayed more than the depth priority.
5.5. Comparing the exact match classification of packets, the depth-first approach is better than the breadth-first approach, which is not only to say that if there is no exact interval match, then the close match does not count ( For example, the exact match 192.168.1.18, then 192.168.1.19 obviously does not meet the requirements, although it is very close), but also includes this way to maximize the use of the system cache.
Note that the mask match of the packet is not an exact match, this article is strictly based on the interval to match, the mask divides a domain into a number of precise intervals, and the ultimate goal is to locate a packet of a domain exactly to which interval, note that this is an exact match.
The KD tree, or k-dimension tree, uses the above-mentioned breadth-first matching method. However, the idea of its dimension sequencing can be used as a reference for depth-first matching, and the ultimate goal is to build a tree that "fails quickly if it does not match"!
Note: K-dimension tree looks like it's a B-tree, but it's not!
The KD tree has too much information on the web. What I'm going to say is how the variance is used in its recursive construction. Now let's answer the question of where the first knife was cut.
I use the interval length as a benchmark, so each dimension can calculate the variance of an interval length, namely:



This value is the largest dimension of the first dimension, and then using the naïve two-point construction, the dimension of the complete set of the nearest midpoint as the root, start to build a recursive binary tree. Each layer calculates the variance of the subdomain, thus selecting the two roots of the subtree based on the principle of maximum variance. By the breadth-first match above, you can see that the graph should be a simple kd tree.
The idea of using the minimum variance is to make the tree balance some, the variance ratio is larger, the interval distribution on this dimension is more uniform, the marking point is not concentrated near the mean, so the cut will not cut off a chunk at a time, thus missing the areas that may be closer to the final answer ( The final result of the approximation principle: cut the granularity as small as possible, must be careful, like a pear or sharpen a pencil, and so on. Note that the goal of the KD tree is not to match exactly, but rather to blur the best match, such as the classic search for K's nearest point.
6.Dimension tree Dimension ordering because depth-first matching is not recursive, so you do not need to calculate the variance every step to determine the dimension of the cut, for the depth-first match, as long as a certain dimension, the dimension will be cut directly, for the three-dimensional situation, is directly cut into a piece, for two-dimensional is cut into a stick ... For the next cut there is no need to consider this dimension, because in this dimension, it has reached its interval precisely. So, we just need to find a sequence of dimensions to be able.
Or is it based on variance like a kd tree? Not necessarily. What we want to do at this point is to let rule be the first dimension in the least-averaged dimension of each interval, and so on. This is because the rule distribution is not averaged between the ruleset and the next dimension, and the corresponding interval after the intersection calculation, the likelihood of a unique default rule resulting in a rapid failure is greater, which will reduce the number of child nodes, for a m fork tree, The more children close to the root of the node, the more able to suppress the overgrowth of the lower nodes. We determine whether the formula for extending the child is:
Current dimension interval ruleset & (the ruleset of each of the next dimensions)
We can't control the ruleset of each interval of the next dimension, we don't even know which next dimension is, but we can find the ruleset distribution of the current dimension. To give one of the simplest examples, there is a dimension divided into 4 intervals, the first interval of the ruleset in 10, the last three intervals are 2, then only 2 rule of the interval and others to take the intersection is likely to be empty, which gives us a basis for the calculation, it is clear that Instead of the variance of the KD tree this time, we represent the weights as an interval number, the distribution of rule in the interval (which can be used as a function of variance), to calculate the final order. The simplest calculation is to use the variance to calculate the variance of the rule number in each dimension interval, taking the minimum value as the highest weight, which means that rule is more concentrated in an interval than in other intervals, thus having a higher likelihood of cutting off the child's umbilical cord resulting in "rapid failure"!
7. Interval coverage problem is this a problem? Not a problem.
In this article, we use the interval as the benchmark, so we can split the ruleset. For one of the simplest examples, setting the rule1,2 interval 192.168.1.0/24 is set to rule1,3 interval 192.168.0.0/16 overlay, but this is not the problem, as shown in the resolution:



8.Dimension Tree Parallel match don't hang yourself in a tree!
Do we have to plug the entire matching data structure into a tree? If we had to do this, we would lose the possibility of parallel operation, because there is only one path from the root to the leaf of a tree, and given the sheer amount of the underlying branch, it is almost impossible to guess the branch. So what do we do? Because the final result is n dimensions (in our case, N is 3) the match interval ruleset intersection, then it is good to do, n dimensions are constructed n tree, n processing process simultaneous operation, the final intersection. Let's go back to the original picture and click on the question:




Proposition Composition: Dimension tree interval lookup and IP packet classification

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.