Analysis of IP Nat details of Linux Bridge-the process of filling another hole

Last Update:2017-03-24 Source: Internet

Author: User

Tags switches

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Pre-order

Recently Wenzhou leather shoes factory boss is busy learning Linux bridge and a lot of virtual network card related things, old wet gave some guidance, but the most fundamental still rely on Wenzhou boss himself. As if there is a fairy spirit in listening to the heart, I am because of the Wenzhou boss again again miss once played to the Linux Bridge,linux netfilter the pain and happy time, another fun thing just at this time cut into.

I haven't played Linux bridge for more than three years, but I miss it. Thank the colleague to give me a bridge aspect of the difficult diseases let me diagnose!
With experience, quickly fix the problem, but if the end is afraid after many years to miss some of the opportunity to boast, so as to make this article, thought later left the sigh of the year.
First look at the two pits of Linux bridge that was discovered:
1. "Ebtables output chain Dnat problem"
It is roughly the EBT that did not re-update the lookup mac-port mappings after Dnat, causing the frame to be sent to the wrong port. This is based on the 2.6.32 version, I was manually pits, I do not know the community in the later version of the revision, this I am no longer concerned about.
2. "Design Issues on Bridge-nf-call-iptables"
To be exact, this is not a pit, but a bit of a flaw, which was also manually filled.
So what is this pit today?
The problem description looks at the topology diagram first:

Excuse me, in the HOST1 execution of CMD1 and CMD2, which will succeed, which will fail? This section does not analyze the problem, so just say the answer:
Linux 2.6.32 Version: Cmd1,cmd2 all success;
Linux 3.10.9 Version: CMD1 failure, CMD2 success;
Linux 4.9.0 Version: Cmd1,cmd2 all failed;
Explode! why?!
Bridge Process Analysis I have analyzed the detailed Linux bridge process, see the "Transparent firewall to achieve the necessary knowledge-bridge filter Half View" (because the writing of the article are not indexed, so many can not find, I remember this half-view analysis, the other should also have, If someone finds out please tell me the topic, I better add an index), the subsequent kernel version, the process changes little, so do not repeat. Simply introduce the processes associated with this article.
We know that bridge can directly invoke the IP layer's NF HOOK to complete some of the operation of the packet, in which NAT is of course a typical, and it is this NAT operation caused the problem today. So I'm going to describe the process of calling IP Nat in bridge.
2.6.32 version of the bridge I do not analyze, before the analysis for many years, directly say 3.10 and 4.9 version of it.
Linux Bridge has a feature that can call Iptables, although this feature has always existed, but its behavior is not immutable. Let's start with the logic of 3.10.
Process analysis and solution of Linux 3.10 bridge/ip Dnat I prefer to use flowcharts before analyzing the kernel path of a packet, but if you're not interested in how Linux is implemented or you don't know anything about it, it's hard to see. It's hard to extrapolate even if you see it. After all, Linux is just one way to do it, not necessarily standard, so the process of drawing a Linux execution flow simply helps you understand the logic of execution and does little to help solve the problem. When you encounter a similar problem, you still have to go through the code. Anyway, to the code, it is better to start directly from the code, so you can directly point out which piece of code is the problem, and then the solution will be consciously appeared.
After the call iptables the IP layer prerouting Hook point execution, the execution flow enters the Br_nf_pre_routing_finish function, if there is dnat on the IP layer prerouting, then Dnat is completed , after which the logic is as follows:

Note The Red Star in the red Word annotation, this is the problem, if you use Brctl showmacs br0 look, you will find:
P1--host1/eth0
P2--host2/eth0
The final data to HOST2 ETH0, through the mac/port mapping can be seen from the P1, that is Box0 of ETH1 issued, but the data is originally from the Box0 ETH1 received in ... The data frame from which port is received will no longer be sent out from that port, even flood!
In fact, the problem with the previous "Ebtables local out after Dnat did not reroute" problem is similar, since the IP dnat has been done, the target IP has been changed so that the target IP before the change of the two-layer metadata should be invalidated, because the target IP changes before The next hop resolution for this IP may be completely different from the next hop parsing result after the target IP change, possibly IP dnat before the data frame to go out from the Port1, and after the IP Dnat, the data frame from Port2 out. This paragraph is considered a complaint, the antidote is given below.
Understand the problem, naturally have to understand the medicine. There are three kinds of antidote:
Jie Yi: Modify the kernel source code is very clear, as long as the IP DNAT has occurred, it is no longer limited to "from which Port received the data frame, it will not be issued from the port again" this constraint. The Mask field in the Nf_bridge_info structure adds a brnf_bridged_ip_dnat:

#define BRNF_BRIDGED_IP_DNAT            0x40

Then, when the IP Dnat is detected, the br_nf_pre_routing_finish is changed:

if (Dnat_took_place (SKB)) {    nf_bridge->mask |= brnf_bridged_ip_dnat;    ...}

Finally, in determining whether the port can be forwarded when the IP Dnat This new mask, modify the Should_deliver:

static inline int should_deliver (const struct Net_bridge_port *p,                 const struct Sk_buff *skb) {    struct NF_BRIDGE_ Info *nf_bridge = skb->nf_bridge;    Return (((P->flags & Br_hairpin_mode) | | Skb->dev! = P->dev | | Nf_bridge->mask & BRNF_BRIDGED_IP_DNAT ) &&        br_allowed_egress (p->br, Nbp_get_vlan_info (p), SKB) &&        p->state = Br_state_ FORWARDING);}

And then it got through.
However, this is also a small trick to modify the kernel, if you can not modify the kernel, then what? In fact, will change the kernel at best just to show that they are familiar with the code, the performance of the network level, so the following solution is kingly.
Solution Two: Configure hairpin (hairpin bend) what is hairpin?
This is a commonly mentioned concept in network virtualization technology, which is the VEPA mode of switch ports. This technology solves the problem of traffic forwarding between virtual machines with physical switches. Obviously, in this case, the source and the target are in one direction, so that's where to go from where the pattern is. If this hairpin is configured, then the problem described in this article is resolved (of course, under the 3.10 kernel version).
How to configure it? Very simple:
Brctl Hairpin br0 eth1 on
If your kernel is a manual compilation upgrade, then your user-state program does not support all the features of the new kernel, that is, your brctl may be too old to support the hairpin command, then you can sysfs to fix:
Echo 1 >/sys/class/net/br0/brif/eth1/hairpin_mode
Very good, a command, the perfect solution!
Solution Three: Configure promiscuous mode The third method is an unconventional and simple solution.
Through the process analysis of the source code, you can see if a network card port to enable promiscuous mode, then it receives the data will be unconditionally send a copy of the IP layer. It is also possible to execute the following command:
ifconfig eth1 +promisc
Obviously, the promiscuous mode has inadvertently helped us: since IP Dnat has been implemented, then all the packets to the IP layer to make a choice! Although this approach appears to be too violent (because Bridge wants to be, even if the IP DNAT, can also be as far as possible in the bridge layer to do forwarding, not the IP layer), but to ensure the correct routing of data, make up the bridge layer do bad stains.
In addition, in the troubleshooting Why do not pass this problem, grab bag is necessary, as long as a clutch, on the pass, do not grab the bag will not pass, naturally think of promiscuous mode! However, this bonus is not available in the 4.9 kernel version. When using the 4.9 kernel to reproduce the problem, it is found that even the clutch does not pass ... 3.10 version of the resolution to this end, see below the 4.9 version of what the kernel solution!

The process analysis and solution of Linux 4.9 bridge/ip Dnat is similar to the 3.10 analysis method, the following is a direct look at the same process 4.9, the details are different:

Understand the cause of the problem. This time it was obvious that I had a problem with the code. So given the order of the solution is different from 3.10, I try to give a configuration to solve the problem, and then give a thorough code modification scheme:
Solution One: Solve the problem through the complex configuration by Br_nf_hook_thresh the second time you enter Br_nf_pre_routing_finish, the dev parameter of Ip_route_input has become the physical port of bridge receiving frame. That is the eth1 of Box0. Then Ip_route_input will surely fail! This is because the routing device to the post-conversion IP is still br0, not eth1:
[Email protected] b]# route-n
Kernel IP Routing Table
Destination Gateway genmask Flags Metric Ref use Iface
1.1.1.0 0.0.0.0 255.255.255.0 U 0 0 0 br0
192.168.44.0 0.0.0.0 255.255.255.0 U 0 0 0 br0 "Here's the key."
So in order for this ip_route_input routing query to succeed, there must be a route to the new destination IP, whose exit is eth1, not br0. So add a host route:
Route add-host 192.168.44.129 Dev eth1
The routing table at this point is:
[Email protected] b]# route-n
Kernel IP Routing Table
Destination Gateway genmask Flags Metric Ref use Iface
1.1.1.0 0.0.0.0 255.255.255.0 U 0 0 0 br0
192.168.44.0 0.0.0.0 255.255.255.0 U 0 0 0 br0
192.168.44.129 0.0.0.0 255.255.255.255 UH 0 0 0 eth1
Today's process is as follows:

Is that the end of it? Is it a pass?
Not so simple!!
Although now the packet has reached the upper layer, and through the IP route again sent down to HOST2, then before sending to HOST2 must be ARP request 192.168.44.129 MAC address, HOST2 response must be bound with eth1, that is, we need such an ARP:
192.168.44.129 ether 00:0c:29:a9:f3:d3 C eth1
Instead of:
192.168.44.129 ether 00:0c:29:a9:f3:d3 C br0
So the Broute rules for this ARP response are ebtables:
ebtables-t broute-a brouting-p ARP--arp-ip-src 192.168.44.129-j DROP "Note that this DROP means throwing out this bridge."
OK, ARP is through, so is there a problem? Alas... Explosion! Yes!
Now the packets can be sent to 192.168.44.129, but what about the packets coming back from 192.168.44.129? Obviously the target Mac of these packets is Box0 's eth1 mac, but what's wrong with receiving this packet br0, not eth1?
In a conventional router, there is a measure called a reverse route query, which guarantees that the interface that receives the packet and the interface that the packet is sent back to is the same interface, which is to ensure security and prevent forgery of the data message. Back to our instance, since the interface that received the packet is BR0, and the egress device that performs the reverse route lookup is eth1 (because we statically configured a host route), the two do not match, so the reverse route check fails, so the packet is discarded.
So, is it possible to disable this reverse route lookup? The answer is YES! The Disable method is simple:
sysctl-w net.ipv4.conf.br0.rp_filter=0
Perfect! perfect!
But there is another way, which is more general than changing the Rp_filter, which is configured with a more general ebtables, meaning that the packets from 192.168.44.129 all go to the IP layer:
ebtables-t broute-a brouting-p IPv4--ip-src 192.168.44.129-j DROP
We know that Broute can change the packet's receiving network card, which means that even if the rp_filter can communicate successfully!
...
So far, the solution has been introduced, I have to sigh, if using OvS (Open vSwitch) To configure this, how simple ah, only need to flow the table, and use bridge, you have to employ ebtables,iptables,iproute2, ARP and other tools (although these are my favorite ...), alas.
Recently Wenzhou shoes factory boss in Learning Network virtualization, then this OVS configuration will be given to Wenzhou shoes factory owner.
Solution Two: Modify the Br_nf_pre_routing_finish Function! Here is a comparison of the solution for once and for all. I think if you do not configure the physical network card host routing, then the issue of this article in the Linux 4.9 version of the kernel is impossible to pass, this really should not ah! So I decided to modify the situation, the measure is very simple, modify a place, that is, Ip_route_input's dev parameter value:

if (err = Ip_route_input (SKB, Iph->daddr, iph->saddr, Iph->tos, Dev))) {

Switch

if (err = Ip_route_input (SKB, Iph->daddr, Iph->saddr, Iph->tos, bridge_parent (dev)? bridge_parent (dev):d ev)) ) {

This ensures that if the IP dnat has occurred after the first entry into the Br_nf_pre_routing_finish and the SKB's dev is changed to a physical network card, the second entry through the Br_nf_hook_thresh will be br_nf_pre_ Routing_finish still uses bridge as the dev parameter when calling Ip_route_input (although SKB's dev is already a physical NIC), so that the correct route can be guaranteed, and the next process will naturally enter " The logic of routing this packet to the local IP layer.
In short, as long as the IP DNAT has occurred, then the packet routed to the IP layer, is absolutely wrong.
Solution Three: In the second entry Br_nf_pre_routing_finish after the third method is more simple, that is, after the second entry into the Br_nf_pre_routing_finish through Br_nf_hook_thresh, If the second entry is found, then jump directly to the last Br_handle_frame_finish:

    nf_bridge->in_prerouting = 0;    if (br_nf_ipv4_daddr_was_changed (SKB, Nf_bridge)) {        if (bridge_parent (dev) = NULL)            goto go;        ... go:        RT = Bridge_parent_rtable (Nf_bridge->physindev);        if (!rt) {            kfree_skb (SKB);            return 0;        }        Skb_dst_set_noref (SKB, &RT->DST);    }    ...    Br_nf_hook_thresh (nf_br_pre_routing, net, SK, SKB, Skb->dev, NULL,              br_handle_frame_finish);    return 0;}

After this modification, although the IP DNAT, but bridge still does not abandon the ownership of the packet, still trying to route packets in Layer Two, then the consequences can be imagined, still face hairpin problems (see 3.10 version of the kernel analysis), The solution of course is to use BRCTL or SYSFS to set the Br0 eth1 interface hairpin.

Summarize

Whether it is 3.10 kernel or 4.9 kernel, although the problem is different, but the idea of solving the problem is consistent, there are two scenarios, one is also the original bridge path to send IP Dnat after the frame, and the other is the IP dnat after the frame routing to the local IP layer. In this article, you can grasp a lot of bridge knowledge, including Hairpin,ip Dnat and so on. However, this article does not deal with the very important issue of STP.

The rest, we'll talk in the morning.

---The following day add:

In this article describes the solution of the 4.9 kernel solution One, I mentioned the Broute configuration, I guess you do not necessarily understand the ebtables Broute and redirect difference, here is actually a pit, for details, see the Ebtables brouting and prerouting redirect Difference ", in short, the Linux bridge pit is too much, this is not a bug, but really is the view benevolent see things, how the authors think is how to achieve, After all, this is also an area of no standards, who has seen which switch can configure IP Dnat and specify how to implement the specification? There is no such specification. Cisco industry leader, its switches, routers support what, that this thing is basically standard, at least it is the standard of its home, affecting the entire industry, and its home of things do not let you inside the insider ... Linux on the contrary, who can look at the fact, but this does not mean that everyone has a voice, the Linux open source community is also an acquaintance of the community, by a few big giants control, such as Google if the submission of a patch, nine to Ten will be apply, The same thing was submitted by an unemployed homeless estimate of 200% by deny ... Say Android customization so trivial, not also to the main line "contribute" a lot of rubbish? Just because it's Google's.
In the end, I will describe myself, I emphasize again, my programming is not good, but it is not a bit, I am a little bit, but I am familiar with the network protocol stack processing data flow path, the network protocol is very familiar with, so you can quickly locate some network problems and know how to quickly fix.
Also want to say is, the Linux kernel from 2.4.1 to 4.10, the core of things can say there is no change, the goal is always consistent, so do not take O (n) Scheduler and CFS major differences, want to thoroughly understand a technology, the first step must be used, and then is familiar with its working principle, The last is to look at the source code, Wenzhou leather shoes factory boss always thought to learn a technology to see the source is OK, this is extremely wrong. If someone gives you a transmitter, you do not know what this thing can do, you did not let it turn, directly to the demolition, which is typical for disassembly and disassembly, even if you understand the inside there are cylinders, there are pistons ... You don't even know what this thing is for, and you can't imagine what it's like to move. Now analysis of the source of too many people, a variety of resumes written on the familiar with the XXX source code, proficient yyy Source ... It's like a special, really good? In addition to the ability to recite a few functions of the name seems very good, many people (I do not say that all) in fact, even its principles do not understand. Linux kernel code write is good, also better read, take a moment to read it is not impossible, at least can read a few subsystems, but better practice is not the first to learn operating system principles or network protocol?
Or is it my own. Just graduated that year to participate in the University of Jida Campus Job fair, applied for the soft Jida (old, no longer anonymous company, hoping to leave footprints) of the programmer position, participated in the collective written examination, about 10 people around the appearance of the results won the first place. Later, at the end of all of our new staff training period, the old staff began one-on-one to take us into the first job, take my people is a more senior engineer, is also a leader, here may have my written test results good reasons. When I was talking to him for the first time, I said I was not good at programming, I was afraid I would do well. He told me, coding and understanding of the network protocol, which do you think is important? I think he has the answer in mind, just ask me, to dispel my inferiority. At that time in the mind suddenly appeared in the job fair how to get the interviewer's favor, on my resume in addition to write Java will be a little but not proficient, all the others are network aspects: atm,x.25, Frame Relay, IPSEC,GRE,OSPF,PPP,DDN, switching network ( Now a lot of them have been eliminated ... Then you are allowed to take the written test, probably because this resume is quite unique. Fortunately, the written question of the programming problem is not limited to language, I have a Java can resist, the other problem is my dissertation. Again, I was thinking about when I learned so much about network technology. That is in Henan Agricultural University's study room reading, in the Zhengzhou Institute of Technology Laboratory Verification so toss two years to play, and then took the Huawei HCSE exam and took the certificate, then want to test CCNP or IE, but not so much money even ... That year is 2004, and then go back to the first time when I touch Java, then I was in Harbin, there is a weekend in the table, I do not know how to buy a Java book, also do not understand, it seems to be kept. But the book in Shanghai, did not bring Shenzhen, or I will certainly take pictures of stickers, that year is 2002 years.

Analysis of IP Nat details of Linux Bridge-the process of filling another hole

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More