Understanding docker across multiple host container networks

Source: Internet
Author: User
Tags docker run value store
This is a creation in Article, where the information may have evolved or changed.

Before Docker 1.9 was born, there were roughly three container communication scenarios across multiple hosts, such as:

1. Port mapping

The port P of host A is mapped to Port P ', which is the network space listener of container C, and is only available for applications and services up to four layers. This enables containers on other hosts to communicate with container C by accessing the port P of host a. It is clear that the scenario is limited in its application.

2, the physical network card bridge to the virtual bridge, so that the container and host configuration under the same network segment

On each host to establish a new virtual bridge device Br0, the respective physical network card eth0 Bridge br0, eth0 IP address assigned to BR0, while modifying the daemon of Docker docker_opts, set-b=br0 (alternative Docker0), and limit the container IP address to the same physical segment address (–FIXED-CIDR). After restarting the Docker daemon for each host, the Docker container in the same segment as the host can be accessed across hosts. This scheme also has the problem of limitation and expansibility: for example, the address of the physical network segment should be divided into small pieces, distributed to each host, prevent the IP conflict; The subnet partition relies on the physical switch settings, and the host address space of the Docker container depends on the physical network partition.

3. Use of third-party SDN-based solutions, such as the use of open Vswitch–ovs or CoreOS flannel.

For details on these third-party programs, you can refer to the "Docker Cookbook" Book of O ' Reilly.

In the 1.9 release, Docker brings a native multi-host container network solution, which is based on Vxlan overlay technology. The use of the scheme has some prerequisites:

1, Linux kernel version >= 3.16;
2. An external key-value Store is required (consul is used in the official example);
3. The Docker daemon on each physical host requires some specific startup parameters;
4. The physical host allows some specific TCP/UDP ports to be available.

This article will take you through the Docker 1.9.1 to create a cross-host container network and analyze the inter-container communication principle based on the network.

I. Establishment of experimental environment

1. Upgrading Linux Kernel

Because the lab environment is using Ubuntu 14.04 Server AMD64, its kernel version does not meet the requirements for establishing a cross-host container network, so the kernel version needs to be upgraded. Download the 3.16.7 utopic kernel three files on the Ubuntu kernel site:

linux-headers-3.16.7-031607_3.16.7-031607.201410301735_all.deblinux-image-3.16.7-031607-generic_3.16.7-031607.201410301735_amd64.deblinux-headers-3.16.7-031607-generic_3.16.7-031607.201410301735_amd64.deb

Execute the following command locally to install:

sudo dpkg -i linux-headers-3.16.7-*.deb linux-image-3.16.7-*.deb

It should be noted that the 3.16.7 kernel on kernel mainline does not have Linux-image-extra, and there is no aufs driver, so Docker Daemon will not support the default storage driver: –storage-driver= Aufs, we need to change the storage driver to Devicemapper.

Kernel upgrade is a risky operation, and whether or not to upgrade the success of the "luck": My two blades, is one upgrade successfully one upgrade failed (always reported NIC problem).

2. Upgrade Docker to 1.9.1 version

Downloading Docker's official installation package from the country is slow, using the method provided by Daocloud.io to quickly install the latest version of Docker:

$ curl -sSL https://get.daocloud.io/docker | sh

3. Topology

This multi-host container network is based on two physical machines in different sub-network segments, based on physical machines, to simplify the analysis of subsequent network communication principles.

The topology diagram is as follows:

Two, multi-host container network construction

1. Create Consul Service

Considering that the KV store is not critical in this paper, it is used only as a precondition for the creation of a boot across multi-host container networks, so only "cluster" with one server node is used.

Referring to the topology diagram, we started a consul on 10.10.126.101, and the details of the consul cluster and service registration, service discovery, etc. can be consulted in one of my previous articles:

$./consul -d agent -server -bootstrap-expect 1 -data-dir ./data -node=master -bind=10.10.126.101 -client=0.0.0.0 &

2. Modify DOCKER Daemon docker_opts parameters

As mentioned earlier, creating a cross-host container network through Docker 1.9 requires reconfiguring the startup parameters of Docker daemon on each host node:

ubuntu系统这个配置在/etc/default/docker下:DOCKER_OPTS="--dns 8.8.8.8 --dns 8.8.4.4  -H tcp://0.0.0.0:2375 -H unix:///var/run/docker.sock --cluster-advertise eth0:2375 --cluster-store consul://10.10.126.101:8500/network --storage-driver=devicemapper"

Here are a few more words to say:

-H (or –host) is configured with the Docker client (both local and remote client) and the Docker Daemon communication medium, and is the service port of the Docker REST API. The default is/var/run/docker.sock (local only) and, of course, can be communicated via the TCP protocol to facilitate remote client access, as configured above. Non-encrypted network communication uses 2375 ports, while TLS encrypted connections use 2376 ports. These two ports have been registered and approved in the IANA and become a well-known port. -H can be configured with multiple, as configured above. UNIX sockets facilitate local Docker client access to the local Docker daemon;tcp port for remote client access. This: Docker pull Ubuntu, go docker.sock, while docker-h 10.10.126.101:2375 pull Ubuntu go TCP socket.

–cluster-advertise is configured with the address of this Docker daemon instance in cluster;
–cluster-store configuration is the access address of the cluster distributed KV store;

If you have previously modified the iptables rules manually, it is recommended to clean up the iptables rules before restarting Docker daemon: sudo iptables-t nat-f, sudo iptables-t filter-f, etc.

3. Start the Docker Daemon on each node

Take 10.10.126.101 as an example:

$ sudo service docker start$ ps -ef|grep dockerroot      2069     1  0 Feb02 ?        00:01:41 /usr/bin/docker -d --dns 8.8.8.8 --dns 8.8.4.4 --storage-driver=devicemapper -H tcp://0.0.0.0:2375 -H unix:///var/run/docker.sock --cluster-advertise eth0:2375 --cluster-store consul://10.10.126.101:8500/network

After starting iptables NAT, the filter rule is not the same as the initial case of a single-machine Docker network.

101节点上初始网络driver类型:$docker network lsNETWORK ID          NAME                DRIVER47e57d6fdfe8        bridge              bridge7c5715710e34        none                null19cc2d0d76f7        host                host

4. Create overlay network Net1 and Net2

On the 101 node, create the Net1:

$ sudo docker network create -d overlay net1

On the 71 node, create the Net2:

$ sudo docker network create -d overlay net2

After that, both the 71-node and the 101-node, we look at the current network and the driver type are the following results:

$ docker network lsNETWORK ID          NAME                DRIVER283b96845cbe        net2                overlayda3d1b5fcb8e        net1                overlay00733ecf5065        bridge              bridge71f3634bf562        none                null7ff8b1007c09        host                host

At this point, the iptables rule has not changed.

5, start two overlay net containers

We launch two container under Net1 and Net2 respectively, each of the Net2 of Net1 and container on each node:

101:sudo docker run -itd --name net1c1 --net net1 ubuntu:14.04sudo docker run -itd --name net2c1 --net net2 ubuntu:14.0471:sudo docker run -itd --name net1c2 --net net1 ubuntu:14.04sudo docker run -itd --name net2c2 --net net2 ubuntu:14.04

After booting, we get the following network information (the container's IP address may be inconsistent with the previous topology map, each container start IP address may change):

net1:    net1c1 - 10.0.0.7    net1c2 - 10.0.0.5net2:    net2c1 - 10.0.0.4    net2c2 -  10.0.0.6

6. Container Connectivity

In Net1c1, let's look at the connectivity to Net1 and Net2:

root@021f14bf3924:/# ping net1c2PING 10.0.0.5 (10.0.0.5) 56(84) bytes of data.64 bytes from 10.0.0.5: icmp_seq=1 ttl=64 time=0.670 ms64 bytes from 10.0.0.5: icmp_seq=2 ttl=64 time=0.387 ms^C--- 10.0.0.5 ping statistics ---2 packets transmitted, 2 received, 0% packet loss, time 999msrtt min/avg/max/mdev = 0.387/0.528/0.670/0.143 msroot@021f14bf3924:/# ping 10.0.0.4PING 10.0.0.4 (10.0.0.4) 56(84) bytes of data.^C--- 10.0.0.4 ping statistics ---2 packets transmitted, 0 received, 100% packet loss, time 1008ms

It can be seen that the containers in the Net1 are interoperable, but the Net1 and Net2 are isolated between the two overlay net.

Three, multi-host container network communication principle

In the article "stand-alone container network", we said that the communication between containers and the communication of containers to the external network was accomplished by Docker0 Bridge and combining iptables. So how is container communication implemented in the network of multi-host containers that have been established above? Let's take a look at it together. Note: With a stand-alone Container Network Foundation, many of the network details here are no longer detailed.

Let's take a look at the network configuration of the container under Net1, taking the NET1C1 container on 101 as an example:

$ sudo docker attach net1c1root@021f14bf3924:/# ip routedefault via 172.19.0.1 dev eth110.0.0.0/24 dev eth0 proto kernel Scope link src 10.0.0.4172.19.0.0/16 dev eth1 proto kernel scope link src 172.19.0.2root@021f14bf3924:/# IP a1:lo:
  
   MTU 65536 Qdisc noqueue State UNKNOWN Group default Link/loopback 00:00:00:00:00:00 BRD 00:00:00:00:00:00 inet 127.0 .0.1/8 Scope host lo valid_lft forever Preferred_lft Forever Inet6:: 1/128 scope host Valid_lft forever pre Ferred_lft Forever8:eth0:
   
    
     MTU 1450 QDISC noqueue State up group default Link/ether 02:42:0a:00:00:04 BRD ff:ff:ff:ff:ff:ff inet 10.0.0. 4/24 scope global eth0 Valid_lft forever Preferred_lft forever inet6 fe80::42:aff:fe00:4/64 scope link Vali D_lft Forever preferred_lft forever10:eth1: 
     
      MTU Qdisc noqueue State up group default Link/ether 02:42:a    C:13:00:02 BRD ff:ff:ff:ff:ff:ff inet 172.19.0.2/16 scope global eth1 Valid_lft forever Preferred_lft Forever Inet6 fe80::42:acff:fe13:2/64 scope link Valid_lft forever Preferred_lft forever
      

    
   
  

It can be seen that NET1C1 has two network ports: Eth0 (10.0.0.4) and eth1 (172.19.0.2), from the routing table, the destination address in the 172.19.0.0/16 range, walk eth1, the destination address within the 10.0.0.0/8 range, go eth0.

We jump out of the container and go back to the mainframe network category:

On 101: $ IP A ... 5:docker_gwbridge:
  
   
 MTU Qdisc Noqueue State up Link/ether 02:42:52:35:C9:FC BRD ff:ff:ff:ff:ff:ff inet 172.19.0.1/16 Scope Global Docker_gwbridge Valid_lft Forever Preferred_lft forever inet6 fe80::42:52ff:fe35:c9fc/64 Scope link valid_  LfT Forever Preferred_lft Forever6:docker0:
   
    
     MTU Qdisc noqueue State link/ether 02:42:4b:70:68:9a BRD ff:ff:ff:ff:ff:ff inet 172.17.0.1/16 Scope Global Docker0 Valid_lft Forever preferred_lft forever11:veth26f6db4: 
     
      MTU qdisc noqueue Master Dock Er_gwbridge state up Link/ether b2:32:d7:65:dc:b2 BRD ff:ff:ff:ff:ff:ff inet6 fe80::b032:d7ff:fe65:dcb2/64 Scope Lin K Valid_lft Forever Preferred_lft forever16:veth54881a0: 
      
       MTU qdisc noqueue Master Docker_gwbridge State up Link/ether 9e:45:fa:5f:a0:15 BRD ff:ff:ff:ff:ff:ff inet6 fe80::9c45:faff:fe5f:a015/64 scope link Vali D_lft Forever Preferred_lft Forever
       
 
      

    
   
  

We see that in addition to the familiar Docker0 bridge, there is an extra Docker_gwbridge bridge:

$ brctl showbridge name    bridge id        STP enabled    interfacesdocker0        8000.02424b70689a    nodocker_gwbridge        8000.02425235c9fc    no        veth26f6db4                            veth54881a0

And from the output of Brctl, two veth are bridged on docker_gwbridge, not Docker0; Docker0 is not used across multi-host container networks. Docker_gwbridge replaces Docker0, which is used to communicate between containers in the NET1 network or Net2 Network and container-to-external communications on 101, as well as Docker0 in a stand-alone container network.

However, the communication between the two containers net1c1 and NET1C2, which is located in different host and subordinate to Net1, is obviously not done by Docker_gwbridge, from the NET1C1 routing table, when Net1c1 ping net1c2, the message is through Eth0, That is, 10.0.0.4 this IP out. From the host's point of view, NET1C1 's eth0 does not seem to have network devices connected to it, how is network communication done?

It all started with the creation of the network. Before we execute the Docker network create-d overlay Net1 to create the NET1 overlay network, this command creates a new network namespace.

We know that each container has its own network namespace, looking at its network namespace from a container's perspective, and we can see network devices such as Lo, eth0. This eth0 is a virtual NIC pair with VETHX in the host network namespace. The overlay network also has its own net NS, while the net NS of the overlay network and the container's net NS also have some networking device correspondence.

Let's look at the ID of the network namespace first. In order to be able to manage network NS with the Iproute2 tool, we need to do the following:

$cd /var/run$sudo ln -s /var/run/docker/netns netns

This is because Iproute2 can only operate the net NS under/var/run/netns, while Docker's default net NS is placed under/var/run/docker/netns. Once the above operation is executed successfully, we can view and manage net NS through the IP command:

$ sudo ip netns29170076ddf61-283b96845c5ae976d9dc6a1-da3d1b5fcb

We see that on 101 hosts, there are 4 already established net NS. Let's take a wild guess, these four net NS are two container net NS and two overlay network net NS respectively. Look at the ID format from NETNS and the network ID in the output of the following command:

$ docker network lsNETWORK ID          NAME                DRIVER283b96845cbe        net2                overlayda3d1b5fcb8e        net1                overlaydd84da8e80bf        host                host3295c22b22b8        docker_gwbridge     bridgeb96e2d8d4068        bridge              bridge23749ee4292f        none                null

We can roughly guess:

1-da3d1b5fcb 是 net1的net ns;1-283b96845c是 net2的net ns;29170076ddf6和5ae976d9dc6a则分属于两个container的net ns。

Since we take Net1 as an example, let's analyze Net1 's net NS–1-DA3D1B5FCB. With the IP command we can get the following result:

$ sudo ip netns exec 1-DA3D1B5FCB IP a1:lo:
  
   MTU 65536 qdisc noqueue State UNKNOWN link/loopback 00:00:00:00:00:00 BRD 00:00:00:00:00:00 inet 127.0.0.1/8 Scope h Ost lo Valid_lft forever Preferred_lft Forever Inet6:: 1/128 scope host Valid_lft forever Preferred_lft for EVER2:BR0:
   
    MTU 1450 Qdisc noqueue State up Link/ether 06:b0:c6:93:25:f3 BRD ff:ff:ff:ff:ff:ff inet 10.0.0.1/24 Scope Global BR0 Valid_lft Forever Preferred_lft Forever inet6 fe80::b80a:bfff:fecc:a1e0/64 scope link Valid_lft Forever PR Eferred_lft Forever7:vxlan1:
    
     
      MTU Qdisc noqueue Master br0 State UNKNOWN link/ether ea:0c:e0:bc:19:c5 BRD ff:ff:ff:ff:ff:ff inet6 fe80 :: e80c:e0ff:febc:19c5/64 scope link Valid_lft forever preferred_lft forever9:veth2: 
      
       MTU 1450 Qdisc Noqueu E Master br0 State up Link/ether 06:b0:c6:93:25:f3 BRD ff:ff:ff:ff:ff:ff inet6 fe80::4b0:c6ff:fe93:25f3/64 Scope Lin K Valid_lft Forever preferred_lft forever$ sudo ip netns exec 1-da3d1b5fcb ip route10.0.0.0/24 dev br0 proto kernel Scope link src 10.0.0.1$ sudo ip netns exec 1-da3d1b5fcb brctl showbridge name Bridge ID STP enabled Inter Facesbr0 8000.06b0c69325f3 no veth2 vxlan1
       

     
    
   
  

See BR0, Veth2, our hearts finally have the bottom son. We suspect that the eth0 and veth2 in the NET1C1 container are a veth pair and bridged on the br0, which can be confirmed by the correspondence of Ethtool to find the Veth sequence number:

  $ sudo docker attach net1c1root@021f14bf3924:/# ethtool-s eth0nic statistics:peer_ifindex:9101 Host: $ sudo IP netns exec 1-da3d1b5fcb ip-d link1:lo: 
   
     MTU 65536 qdisc noqueue State UNKNOWN Link/lo Opback 00:00:00:00:00:00 BRD 00:00:00:00:00:002:br0: 
    
      MTU 1450 QDISC Noqueue St
      Ate up Link/ether 06:b0:c6:93:25:f3 BRD ff:ff:ff:ff:ff:ff bridge7:vxlan1: 
     
       MTU Qdisc Noqueue Master br0 State UNKNOWN link/ether ea:0c:e0:bc:19:c5 BRD ff:ff:ff:ff:ff:ff vxlan9:veth2 : 
      
        MTU 1450 QDISC noqueue Master br0 State up Link/ether 06:b0:c6:93:25:f 3 BRD ff:ff:ff:ff:ff:ff veth 
       
       
    
     

You can see that NET1C1 's eth0 pair Peer index is 9, exactly the same as the ordinal of veth2 in the net NS 1-DA3D1B5FCB.

What about Vxlan1? Note that this vxlan1 is not a veth device, and it has a device type of Vxlan in ip-d link output information. Previously said Docker's cross-host container network is based on Vxlan, where the vxlan1 is Net1 a VTEP of this overlay network, the Vxlan tunnel end Point–vxlan tunnel endpoint. It is the edge device of the Vxlan network. Vxlan related processing is performed on vtep, such as identifying the Vxlan of Ethernet data frames, two-layer forwarding of data frames based on Vxlan, encapsulation/unpacking packets, etc.

At this point, we can roughly draw a schematic diagram of a multi-host network:

If you ping net1c2 in net1c1, what is the path of the packet's walk?

1, NET1C1 (10.0.0.4) in ping net1c2 (10.0.0.5), according to the NET1C1 routing table, the packet can be reached via a direct-attached network NET1C2. The ARP request obtains the MAC address of the NET1C2 (the ARP on Vxlan is not detailed here), gets the MAC address after the packet, is sent from the eth0;
2, Eth0 Bridge in the net NS 1-DA3D1B5FCB Br0, the Br0 is a bridge (switch) virtual device, need to be sent from the eth0 packet, so the package to the Vxlan device; This can be seen through the arp-a of the clues:

$ sudo ip netns exec 1-da3d1b5fcb arp -a? (10.0.0.5) at 02:42:0a:00:00:05 [ether] PERM on vxlan1

3, Vxlan is a special device, after receiving the package, the Vxlan device is created by the device handler registered to process the package, that is, to carry out the Vxlan packet (during which the Net1 information stored in consul), the ICMP packet as a whole as a UDP packet payload package up, and send the UDP packet through the host's eth0.

4, 71 host received UDP packet, found is Vxlan package, according to the information in the Vxlan package (such as Vxlan Network identifier,vni=256) found Vxlan device, and transferred to the Vxlan device processing. Vxlan device handler to unpack, and the UDP payload out, the whole through br0 to Veth Port, NET1C2 received ICMP packets from eth0, reply ICMP reply.

We can capture the relevant Vxlan package through Wireshark, the high version Wireshark built-in Vxlan Protocol analyzer, can directly identify and display Vxlan package, Installed here is the 2.0.1 version (note: Some low-version wireshark do not support vxlan analyzers, such as 1.6.7 versions):

The details of the Vxlan protocol are too complex to be understood in subsequent articles maybe.

, Bigwhite. All rights reserved.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.