TCP/IP network problems (implementation of routing/protocol/Linux)

Last Update:2018-12-03 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. Linux virtual network card-Tun
The virtual NIC Driver in Linux can be configured in two modes: point-to-point Tun mode and Ethernet tap mode. In Tun mode, IP datagram is generated from the virtual Nic, that is, layer-3 data, and the Ethernet frame from the virtual Nic In the tap mode, that is, layer-2 link layer data. IP datagram encapsulated in Tun mode can be directly transmitted to the peer end. In this point-to-point mode, the peer end is determined and does not need to be addressed. Therefore, the virtual network card in Tun mode does not have a MAC address, that is, the link layer address. Therefore, ARP Address Resolution is not required for Tun-mode communication, but it is different in the tap mode. In the tap mode, a complete Ethernet frame is encapsulated, according to the 802.3 protocol format, in earlier versions of the Tun driver, the MTU of A tapx Nic cannot be set to exceed the maximum MTU of the Ethernet. However, the latest driver removes this limit, the virtual network card in Tap mode completely simulates a broadcast Ethernet network card. Because it is a broadcast network, it needs to resolve the layer-3 address to the link layer, so ARP is inevitable, all the tap-mode NICs that are connected by user-assisted processes such as openvpn belong to an Ethernet LAN, regardless of the physical distance between them. ARP is used for address resolution, in addition, the tap-mode virtual network card can encapsulate layer-3 data of non-IP protocol. In essence, it fully simulates 802.3 Ethernet. The NIC initialization function of the Tun driver is as follows:
Static void tun_net_init (struct net_device * Dev)
{
...
Switch (Tun-> flags & tun_type_mask ){
Case tun_tun_dev:
Dev-> hard_header_len = 0; // The Tun device does not have a MAC address, point-to-point, and does not need an ARP resolution address. Because the point-to-point two hosts have a layer-3 address and a layer-2 address, generally, it only distinguishes between the primary and secondary nodes, or between active and passive nodes.
Dev-> addr_len = 0;
Dev-> MTU = 1500;
Dev-> type = arphrd_none;
Dev-> flags = iff_pointopoint | iff_noarp | iff_multicast;
Dev-> tx_queue_len = 10;
Break;
Case tun_tap_dev:
Dev-> set_multicast_list = tun_net_mclist;
* (*) Dev-> dev_addr = htons (0x00ff); // The tap device has a MAC address.
Get_random_bytes (Dev-> dev_addr + sizeof (2010), 4 );
Ether_setup (Dev); // set according to Ethernet
Break;
}
}
The following tun_get_user is the function called when a user-state process writes a virtual Nic character device.
Static _ inline _ ssize_t tun_get_user (struct tun_struct * Tun, struct iovec * iv, size_t count)
{
...
Switch (Tun-> flags & tun_type_mask ){
Case tun_tun_dev: // The Tun device only needs to extract data.
SKB-> Mac. Raw = SKB-> data;
SKB-> protocol = pi. proto;
Break;
Case tun_tap_dev: // The tap device must process the Ethernet header.
SKB-> protocol = eth_type_trans (SKB, Tun-> Dev );
Break;
};
...
Netif_rx_ni (SKB );
...
}
2. About packet_otherhost
The destination address of this macro mark data is not set on the local machine. After comparing the destination MAC address of the data frame and the MAC address of the entry Nic, we find that they are set differently, but they are not discarded at this time, many protocol modules can process this type of data packets, typically Ethernet packet capture tools. Therefore, no matter whether packet_otherhost is set or not, it is necessary to traverse layer-3 protocols and deliver them one by one, the layer-3 protocol determines whether the Protocol needs to process the packet_otherhost data packets.
3. Select source IP address
If one machine accesses another machine, such as the simplest Ping, the destination address is available, how can I select the source address? Especially when the host has multiple NICs, each Nic is configured with multiple IP addresses. The answer is to select a route query result. Do the following experiment:
PC1 Configuration:
Eth0: 192.168.1.3
Eth0: 1: 1.2.3.4
Eth0: 2: 7.6.5.3
Eth1: 172.16.1.3
Eth1: 1: 4.3.2.1
Default Gateway: 192.168.1.1
PC2 Configuration:
Eth0: 192.168.1.4
Eth0: 1: 1.2.3.5
Eth1: 7.6.5.4 (no network cable inserted)
Default Gateway: 192.168.1.1
In the above configuration, Ping 119.75.217.56 (an address of Baidu), and then use tcpdump-I eth0 host 119.75.217.56 on pC1 to view the source address. The source address is 192.168.1.3, and Ping is not stopped, add a host route entry on pC1: Route add-host 119.75.217.56 GW 1.2.3.5. After reading the output of tcpdump, the source address will change to 1.2.3.4 immediately. This only proves that the route decides the source address on one network card. What if we want to set the source address to 4.3.2.1? Can I configure a 4.3.2.0 CIDR block on PC2? This address can be configured, but an error occurs when a route is added, because eth1 of pC1 and eth0 of Pc2 are not directly connected, you cannot send data from eth0 as the source address 4.3.2.1.
What is the opposite? In pC1, how do I select 7.6.5.3 as the source IP address? It is easy to set the gateway of 119.75.217.56 to eth1 of Pc2. Although no network cable is inserted, as long as the arp_ignore of eth0 is 0 or 3 and 7.6.5.4 is not the host address, it will be successful. Note that the gateway address of 7.6.5.4 119.75.217.56 can be configured on any Nic, except for some special loopback. In principle, the IP address on loopback cannot be routed, however, you can still find the addresses on some routing loopback manually, as shown in Experiment 1.
Therefore, the source address of network communication is generally added after the route. Unless a source address is bound in advance, the route will add the source address based on the route query results. If the source address is sent by the gateway, add the IP address of the same subnet as the gateway. If it is a direct connection route, select the IP address of the direct connection route subnet. By the way, if the via parameter is specified when adding a route, that is, if the request is sent through the gateway, the ARP request will request the gateway's MAC address when the data is actually transmitted. If there is no via parameter, and only one Dev parameter is the direct connection route, during communication, ARP requests directly request the MAC address of the target address.
Lab 1:
Configuration on pC1:
Eth0: 192.168.1.3
Eth0: 1: 4.3.2.1/24 (the non-32-bit mask must be retained; otherwise, the route to 4.3.2.2 must be added)
Configuration on PC2:
Eth0: 192.168.1.4
Eth0: 1: 4.3.2.2/32 (or lo: 1: 4.3.2.2/32)
Direct connection to eth0 of pC1 and PC2
Only ARP is considered, because the first step of any communication is ARP resolution, as long as ARP is resolved to the MAC address, and then configured with a route communication, it will certainly be possible. Therefore, only ARP is considered here. Ping 4.3.2.2 on pC1. This is because no route is returned on pc2. however, the MAC address of eth0 on PC2 has been resolved and can be viewed through ARP commands. In theory, if you add a return route to pc2. however, this return route has a restriction, that is, its type cannot be unicast by default, if you configure a route such as the following, not only does not get through, but cannot even receive ARP replies:
IP Route del Table Local local 4.3.2.1/32 Dev eth0
Note: The type of this route is local. The optional types include [unicast | Local | broadcast | multicast | throw | unreachable | Prohibit | blackhole | Nat]. Why? Because when the system finds a route, it generally does not care about the source IP address. The source IP address is verified only after the route table is searched, instead of being verified during the search process, this is to prevent confusion. After all, the meaning of the route itself aims to find the destination address. If the source IP address is reachable, everything goes smoothly. Otherwise, some special processing will be performed, in addition, the type of the result of finding a route using the source IP address as the destination address cannot be local. Therefore, if the type of the route for finding a source IP address is also local, the problem arises, because there is a problem with the route found in arp_process, ARP requests are not returned, so communication cannot proceed, you can modify the type of the preceding route from local to unicast. So if the local access to the local machine is used, will this process not cause errors? In fact, in this case, the Linux protocol stack is specially processed. If the type of the destination address route is local in ip_route_output_slow, The dst_entry is set, the sending device is set to loopback_dev. In the xmit function of loopback_dev, skb and its dst_entry are not touched at all. Therefore, when ip_rcv_finish, the following route query will not go into the upper layer for processing at all:
If (SKB-> DST = NULL ){
If (ip_route_input (SKB, IPH-> daddr, IPH-> saddr, IPH-> TOS, Dev ))
Goto drop;
}
In the code, where does it show that the type cannot be configured as a non-Unicast route? In ip_route_input_slow, call:
If (res. type = rtn_local ){
Int result;
Result = maid (saddr, daddr, TOS,
Loopback_dev.ifindex,
Dev, & spec_dst, & ITAG );
If (result <0)
Goto martian_source;
If (result)
Flags | = rtcf_directsrc;
Spec_dst = daddr;
Goto local_input;
}
Let's take a look at the specific logic of fib_validate_source:
Int maid (...)
{
Struct in_device * in_dev;
Struct flowi FL =...; // use the source address as the destination address to find the route
...
If (maid (& FL, & res ))
Goto last_resort;
If (res. type! = Rtn_unicast) // if it is not an unicast route, the local route and broadcast route are filtered out.
Goto e_inval_res;
...
If (maid (RES) = Dev) {// The exit is consistent with the entry, which is generally required.
Ret = maid (RES). nh_scope> = rt_scope_host; // can routes be directly connected? When adding a route, you can add a gateway (using the via Parameter/or directly using the route command to add the GW parameter) or not add a gateway, its "Next Hop" is the gateway, so its scope must be link or global. For a route without a gateway, its scope may be host, that is, if the nh_scope value is greater than or equal to the host value, it can be directly connected without a gateway or local route.
Maid (& res );
Return ret;
}
...
FL. OIF = Dev-> ifindex; // the entrances and exits are inconsistent. Therefore, the egress device is forced to query the route table again as the entry device. the previous query did not limit the egress device. There may be many routes, however, only the first query is returned. This query limits the egress device, and there may be available routes to return.
Ret = 0;
If (maid (& FL, & res) = 0 ){
If (res. type = rtn_unicast ){
* Spec_dst = maid (RES );
Ret = maid (RES). nh_scope> = rt_scope_host;
}
Maid (& res );
}
Return ret;
...
}
4. Next Hop of the route table
Each route table item has the concept of a next hop. The so-called next hop is where the next packet is to be sent. Generally, there are two kinds of places, either directly to the destination or to the gateway. In Linux, rtable is a routing cache. Its rt_gateway field indicates the IP address to be resolved to a L2 address. After finding the route, it is initialized as the destination address of the data packet:
Rth-> rt_gateway = fL. fl4_dst;
In the subsequent process, if a gateway exists, it will be replaced with the IP address of the gateway, which all happens in rt_set_nexthop:
If (maid (* res) & maid (* res). nh_scope = rt_scope_link)
RT-> rt_gateway = maid (* res );
The fib_res_gw macro determines the existence of the Gateway. & the scope of the next hop must be link, that is, it must be directly accessible on the link, the next ARP resolution is successful.
In rt_intern_hash, arp_bind_neighbour is called. The following code is provided:
If (n = NULL ){
U32 nexthop = (struct rtable *) DST)-> rt_gateway; // retrieve nexthop and prepare for parsing
...
N = _ neigh_lookup_errno (& arp_tbl, & nexthop, Dev );
...
DST-> neighbor = N;
}
When adding a route, the kernel protocol stack Code imposes the following restrictions (without considering the complex situations such as multi-path and NAT ):
A. You cannot add a route with a larger scope than the host, and a larger scope is nowhere;
B. If the routing scope is host, the next hop scope is nowhere in most cases;
C. If the gateway is configured, the scope of the next hop is link;
D. If no gateway is specified, the scope of the next hop is the host.
Finally, let's take a look at the ICMP Source Address Mask Request/response. With the above D, this is easy to understand. Because the ICMP Source Address Mask response is only sent on the subnetwork, therefore, the destination is definitely not configured with the gateway, so in the fiber _ validate_source, as long as you find that the "Next Hop" scope is host (local subnet) or nowhere (Local Machine, 1 is returned, and the rtcf_directsrc flag is set externally. When the host replies to the mask response, if this flag is not found, it will not be sent again.
5. Role of arp_announce
It limits the source address selection when an ARP request is sent, which is the opposite of arp_ignore. If it is 0, ARP is performed unconditionally based on the result of the route query, if it is 1, select the address in the same subnet as the request target, and the device must be consistent with the routing result. In any case, even if the address is not found, still select an address. The selection policy is: first, find the address of the same subnet as the request target in the egress device of the route recommendations, if not, find the scope of the egress device to be less than the link address. If not, traverse all NICs and select the IP address. scope is not the link address and smaller than the link address, that is, the host address. The Code logic is as follows:
Struct net_device * Dev = neigh-> dev;
U32 target = * (u32 *) neigh-> primary_key; // route Query Result
Struct in_device * in_dev = in_dev_get (Dev );
Switch (in_dev_arp_announce (in_dev) {returns the value of arp_announce.
Case 0: // use the source IP address in the route result. However, if the source IP address is not configured, The saddr value is 0. In many cases, the source IP address is not configured when the route is configured, configure only one egress device, so it is often 0, for example, IP Route add 1.2.3.4/32 Dev eth2
If (SKB & inet_addr_type (SKB-> NH. iph-> saddr) = rtn_local)
Saddr = SKB-> NH. iph-> saddr;
Break;
Case 1: // use the source IP address in the same subnet as the Requested destination address. It is the same as the source IP address above. If it is not found, saddr is 0.
Saddr = SKB-> NH. iph-> saddr;
If (inet_addr_type (saddr) = rtn_local ){
If (inet_addr_onlink (in_dev, target, saddr ))
Break;
}
Saddr = 0;
Break;
}
If (! Saddr) // if no source IP address is found, select one. Under scope link, select Dev as the first choice. If not, traverse all NICs.
Saddr = inet_select_addr (Dev, target, rt_scope_link );
Arp_announce is far from complicated as arp_ignore.
6. IP route command
It is really a good thing.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

TCP/IP network problems (implementation of routing/protocol/Linux)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

TCP/IP network problems (implementation of routing/protocol/Linux)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support