This is a creation in Article, where the information may have evolved or changed.
The research on the container network has been interrupted for some time due to the reasons for the exchange of work in 2016. With the current project to kubernetes application in depth, I feel that before the shallow understanding of the container network is not enough, the container network is placed in front of the "a hurdle." It is imperative to continue in-depth understanding of k8s networks and container networks. This article is a start-up, but also a supplement to the previous superficial understanding.
I still start with the Docker container network, although Docker and kubernetes use different network models: K8S is the container networks Interface, the MLM model, and Docker uses container Network model, CNM models. To understand the Docker container network, Understanding Linux network namespace is essential. In this article, we will try to understand the concepts of Linux network namespace and related Linux kernel networking devices, and manually simulate some implementations of the Docker container network model, including container-to-host connectivity in a single-machine container network, inter-container connectivity, and port mapping.
First, Docker's CNM network model
Docker implements the CNM network model through Libnetwork. The simple interpretation of the CNM model in Libnetwork design doc is as follows:
There are three components of the CNM model:
- Sandbox (sandbox): Each sandbox contains a configuration of a container network stack, which includes a container's network port, routing table, and DNS settings.
- Endpoint (Endpoint): Through Endpoint, the sandbox can be added to a network.
- Network: A group of endpoints that can communicate with each other directly.
In light of this, it is difficult to associate it with the real Docker container, after all, the abstract model does not correspond to the entity, there is always a floating bright. The paper also gives a reference implementation of the CNM model on Linux, such as: the implementation of the sandbox can be a Linux Network Namespace;endpoint can be a pair of veth;network can be implemented with Linux Bridge or Vxlan.
Instead, these implementations are compared to ground gas. Before we used the Docker container, we learned that Docker was isolated from the container network implemented by the Linux network namespace. When using Docker, there will be a Docker0 Linux Bridge,brctl show on the physical host or virtual machine to see Docker0 "plugged in" a lot of veth network devices:
# ip link show... ...3: docker0:
mtu 1500 qdisc noqueue state UP mode DEFAULT group default link/ether 02:42:30:11:98:ef brd ff:ff:ff:ff:ff:ff19: veth4559467@if18:
mtu 1500 qdisc noqueue master docker0 state UP mode DEFAULT group default link/ether a6:14:99:52:78:35 brd ff:ff:ff:ff:ff:ff link-netnsid 3... ...$ brctl showbridge name bridge id STP enabled interfaces... ...docker0 8000.0242301198ef no veth4559467
The model and reality have finally been a bit connected! Below we will take a closer look at the concepts of these terms.
Ii. Linux Bridge, Veth, and network Namespace
Linux bridge, or Linux bridge device, is one of the virtual network devices provided by Linux. It works much like a physical network switch device. Linux Bridge can work on two levels, or it can work on three levels, and it works on two levels by default. When working on layer two, Ethernet packets can be forwarded between different hosts on the same network; Once you assign an IP address to a Linux bridge, it opens the three-tier mode of work for that bridge. Under Linux, you can manage Linux bridge with the Iproute2 Toolkit or the BRCTL command.
VETH (Virtual Ethernet) is another special network device provided by Linux, which is called the virtual NIC interface in Chinese. It always comes in pairs and creates a pair to create it. The Veth in a pair is like the two endpoints of a network cable, where data is entered from one endpoint and inevitably flowed from the other. Each veth can be assigned an IP address and participate in the three-tier network routing process.
For more details on how Linux Bridge and Veth works, refer to the article "Basic network equipment in Linux" on the IBM Developerworks.
Network namespace, which allows you to create isolated network views on Linux, each network namespace has its own network configuration, such as network devices, routing tables, and so on. The new network namespace is isolated from the host default network namespace. The default network namespace of the host is what we usually do by default.
The concept is always abstract, and next we will see in an example of a Docker container network how these Linux network concepts and network devices function and how they operate.
Third, using Network namespace simulation Docker container networks
To learn more about the role and role of network namespace, bridge, and Veth in Docker container networks, let's do a demo: Using network namespace to simulate the Docker container networks, In fact, the Docker container network is also implemented on Linux based on network namespace, and we just make the "automatic" creation process "decomposition action", which is easy for everyone to understand.
1. Environment
We do this demo experiment on a physical machine. The physical machine is installed with Ubuntu 16.04.1, kernel version: 4.4.0-57-generic. Docker container version:
Client: Version: 1.12.1 API version: 1.24 Go version: go1.6.3 Git commit: 23cf638 Built: Thu Aug 18 05:33:38 2016 OS/Arch: linux/amd64Server: Version: 1.12.1 API version: 1.24 Go version: go1.6.3 Git commit: 23cf638 Built: Thu Aug 18 05:33:38 2016 OS/Arch: linux/amd64
In addition, the Iproute2 and BRCTL tools are installed in the environment.
2. Topology
Let's simulate a container bridge network with two containers:
The corresponding manual build topology is as follows (due to the same host, the analog version uses the 172.16.0.0/16 network segment):
3. Steps to create
A) Create container_ns1 and Container_ns2 network namespace
By default, all we see on host is the view of the default network namespace. To simulate the container network, we created two new network namespace:
sudo ip netns add Container_ns1sudo ip netns add Container_ns2$ sudo ip netns listContainer_ns2Container_ns1
The NS created can also be seen under the/var/run/netns path:
$ sudo ls /var/run/netnsContainer_ns1 Container_ns2
Let's explore the newly created NS network space (via the IP netns EXEC command to execute the relevant program within a specific NS, the EXEC command is critical and will play a bigger role later):
$ sudo ip netns exec Container_ns1 ip a1: lo:
mtu 65536 qdisc noop state DOWN group default qlen 1 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00$ sudo ip netns exec Container_ns2 ip a1: lo:
mtu 65536 qdisc noop state DOWN group default qlen 1 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00$ sudo ip netns exec Container_ns2 ip route
As you can see, the new NS network device has only one loopback port, and the routing table is empty.
b) Create MyDocker0 Bridge
We created MyDocker0 Linux Bridge under the default network namespace:
$ sudo brctl addbr MyDocker0$ brctl showbridge name bridge id STP enabled interfacesMyDocker0 8000.000000000000 no
Assign an IP address to the MYDOCKER0 and take effect on the device, turn on layer three, and prepare for subsequent acts as gateway:
$ sudo ip addr add 172.16.1.254/16 dev MyDocker0$ sudo ip link set dev MyDocker0 up
When enabled, we found that a route was added to the routing configuration of the default network namespace:
$ route -n内核 IP 路由表目标 网关 子网掩码 标志 跃点 引用 使用 接口0.0.0.0 10.11.36.1 0.0.0.0 UG 100 0 0 eno1... ...172.16.0.0 0.0.0.0 255.255.0.0 U 0 0 0 MyDocker0... ...
c) Create Veth, connect two to network namespaces
So far, there has been no relationship between the default NS and Container_ns1, Container_ns2. The next is the time to witness miracles. We establish a connection between multiple NS via Veth pair:
Create a connection between the default NS and Container_ns1 veth pair–veth1 and veth1p:
$sudo ip link add veth1 type veth peer name veth1p$sudo ip -d link show... ...21: veth1p@veth1:
mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 link/ether 66:6d:e7:75:3f:43 brd ff:ff:ff:ff:ff:ff promiscuity 0 veth addrgenmode eui6422: veth1@veth1p:
mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 link/ether 56:cd:bb:f2:10:3f brd ff:ff:ff:ff:ff:ff promiscuity 0 veth addrgenmode eui64... ...
"Plug veth1" into "MyDocker0" on this bridge:
$ sudo brctl addif MyDocker0 veth1$ sudo ip link set veth1 up$ brctl showbridge name bridge id STP enabled interfacesMyDocker0 8000.56cdbbf2103f no veth1
Put the veth1p "into" the container_ns1:
$ sudo ip link set veth1p netns Container_ns1$ sudo ip netns exec Container_ns1 ip a1: lo:
mtu 65536 qdisc noop state DOWN group default qlen 1 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:0021: veth1p@if22:
mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether 66:6d:e7:75:3f:43 brd ff:ff:ff:ff:ff:ff link-netnsid 0
At this point, you will not see the VETH1P virtual network device in the default NS. According to the topology above, the Veth in container_ns1 should be renamed Eth0:
$ sudo ip netns exec Container_ns1 ip link set veth1p name eth0$ sudo ip netns exec Container_ns1 ip a1: lo:
mtu 65536 qdisc noop state DOWN group default qlen 1 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:0021: eth0@if22:
mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether 66:6d:e7:75:3f:43 brd ff:ff:ff:ff:ff:ff link-netnsid 0
Take the eth0 in container_ns1 into effect and configure the IP address:
$ sudo ip netns exec Container_ns1 ip link set eth0 up$ sudo ip netns exec Container_ns1 ip addr add 172.16.1.1/16 dev eth0
Once the IP address is given, a direct-connect route is generated automatically:
sudo ip netns exec Container_ns1 ip route172.16.0.0/16 dev eth0 proto kernel scope link src 172.16.1.1
Now you can ping MyDocker0 under container_ns1, but because there are no other routes, including the default route, pinging other addresses is still not available (for example: Docker0 address: 172.17.0.1):
$ sudo ip netns exec Container_ns1 ping -c 3 172.16.1.254PING 172.16.1.254 (172.16.1.254) 56(84) bytes of data.64 bytes from 172.16.1.254: icmp_seq=1 ttl=64 time=0.074 ms64 bytes from 172.16.1.254: icmp_seq=2 ttl=64 time=0.064 ms64 bytes from 172.16.1.254: icmp_seq=3 ttl=64 time=0.068 ms--- 172.16.1.254 ping statistics ---3 packets transmitted, 3 received, 0% packet loss, time 1998msrtt min/avg/max/mdev = 0.064/0.068/0.074/0.010 ms$ sudo ip netns exec Container_ns1 ping -c 3 172.17.0.1connect: Network is unreachable
We'll add a default route to Container_ns1 to ping the network device addresses in other network devices or other NS spaces on the physical host:
$ sudo ip netns exec Container_ns1 ip route add default via 172.16.1.254$ sudo ip netns exec Container_ns1 ip routedefault via 172.16.1.254 dev eth0172.16.0.0/16 dev eth0 proto kernel scope link src 172.16.1.1$ sudo ip netns exec Container_ns1 ping -c 3 172.17.0.1PING 172.17.0.1 (172.17.0.1) 56(84) bytes of data.64 bytes from 172.17.0.1: icmp_seq=1 ttl=64 time=0.068 ms64 bytes from 172.17.0.1: icmp_seq=2 ttl=64 time=0.076 ms64 bytes from 172.17.0.1: icmp_seq=3 ttl=64 time=0.069 ms--- 172.17.0.1 ping statistics ---3 packets transmitted, 3 received, 0% packet loss, time 1999msrtt min/avg/max/mdev = 0.068/0.071/0.076/0.003 ms
But at this time, if you want to ping the container_ns1 in the physical host outside the address, such as: google.com, it is still not a pass. Why is it? Because the source address of the ICMP packet of pings does not do Snat (Docker is implemented by setting the Iptables rule), the package that goes out with the 172.16.1.1 as the source address "has to go without returning" ^0^.
Next, we follow the above steps, and then create a connection between the default NS and Container_ns2 veth pair–veth2 and veth2p, because the steps are the same, there is not so much information listed here, only the key operations:
$ sudo ip link add veth2 type veth peer name veth2p$ sudo brctl addif MyDocker0 veth2$ sudo ip link set veth2 up$ sudo ip link set veth2p netns Container_ns2$ sudo ip netns exec Container_ns2 ip link set veth2p name eth0$ sudo ip netns exec Container_ns2 ip link set eth0 up$ sudo ip netns exec Container_ns2 ip addr add 172.16.1.2/16 dev eth0$ sudo ip netns exec Container_ns2 ip route add default via 172.16.1.254
At this point, the simulation is created! Between two NS and they are connected to the default NS!
$ sudo ip netns exec Container_ns2 ping -c 3 172.16.1.1PING 172.16.1.1 (172.16.1.1) 56(84) bytes of data.64 bytes from 172.16.1.1: icmp_seq=1 ttl=64 time=0.101 ms64 bytes from 172.16.1.1: icmp_seq=2 ttl=64 time=0.083 ms64 bytes from 172.16.1.1: icmp_seq=3 ttl=64 time=0.087 ms--- 172.16.1.1 ping statistics ---3 packets transmitted, 3 received, 0% packet loss, time 1998msrtt min/avg/max/mdev = 0.083/0.090/0.101/0.010 ms$ sudo ip netns exec Container_ns1 ping -c 3 172.16.1.2PING 172.16.1.2 (172.16.1.2) 56(84) bytes of data.64 bytes from 172.16.1.2: icmp_seq=1 ttl=64 time=0.053 ms64 bytes from 172.16.1.2: icmp_seq=2 ttl=64 time=0.092 ms64 bytes from 172.16.1.2: icmp_seq=3 ttl=64 time=0.089 ms--- 172.16.1.2 ping statistics ---3 packets transmitted, 3 received, 0% packet loss, time 1999msrtt min/avg/max/mdev = 0.053/0.078/0.092/0.017 ms
Of course, at this time two NS between the connectivity, mainly through the direct connection network, is essentially MYDOCKER0 in the two layer played a role. Take the eth0 address of Ping Container_ns2 in container_ns1 as an example:
CONTAINER_NS1 The routing table at this time:
$ sudo ip netns exec Container_ns1 ip routedefault via 172.16.1.254 dev eth0172.16.0.0/16 dev eth0 proto kernel scope link src 172.16.1.1
After Ping 172.16.1.2 executes, according to the routing table, it will first match to the direct-attached network (second), that is, the packet can be delivered directly without gateway forwarding. After the ARP query (either from the ARP cache or the MYDOCKER0 in this Layer two switch), the MAC address of the 172.16.1.2 is obtained. IP packet Destination IP is filled in 172.16.1.2, the two-layer data frame packet will be the purpose of Mac filled with the MAC address just found, through Eth0 (172.16.1.1) sent out. Eth0 is actually a veth pair, the other end "plug" on the MYDOCKER0 switch, so this process is a standard two-layer switch data message exchange process, MYDOCKER0 equivalent to a port on the switch to receive the Ethernet frame data, and send the data from another port. This is also true for ping pickup packages.
And if you are pinging the address of a Docker container in container_ns1, such as 172.17.0.2. When the ping executes, according to the routing table under CONTAINER_NS1, there is no match to the direct-attached network, and the packet can only be sent to gateway:172.16.1.254 through the default route. Although all MYDOCKER0 receive data, this time it is more like "the data is sent directly to bridge, not bridge from a port (this is slightly different from the understanding in my previous article)". The destination MAC address for the second tier is the Gateway 172.16.1.254 's own MAC address (bridge's MAC address), at which point the MyDocker0 is more like the role of a common NIC, working on the three floor. MyDocker0 received the packet, found that it is not sent to its own IP packets, through the host routing table to find the direct link route, MyDocker0 the packet forward to DOCKER0 (the destination MAC address of the encapsulated Layer two packet is the MAC address of Docker0). At this time the Docker0 is also a "network card" role, because the destination IP is still not DOCKER0 itself, so DOCKER0 will continue the forwarding process. This process can be confirmed by traceroute:
$ sudo ip netns exec Container_ns1 traceroute 172.17.0.2traceroute to 172.17.0.2 (172.17.0.2), 30 hops max, 60 byte packets 1 172.16.1.254 (172.16.1.254) 0.082 ms 0.023 ms 0.019 ms 2 172.17.0.2 (172.17.0.2) 0.054 ms 0.034 ms 0.029 ms$ sudo ip netns exec Container_ns1 ping -c 3 172.17.0.2PING 172.17.0.2 (172.17.0.2) 56(84) bytes of data.64 bytes from 172.17.0.2: icmp_seq=1 ttl=63 time=0.084 ms64 bytes from 172.17.0.2: icmp_seq=2 ttl=63 time=0.101 ms64 bytes from 172.17.0.2: icmp_seq=3 ttl=63 time=0.098 ms--- 172.17.0.2 ping statistics ---3 packets transmitted, 3 received, 0% packet loss, time 1998msrtt min/avg/max/mdev = 0.084/0.094/0.101/0.010 ms
Now you should get a general idea of what the Docker engine is doing behind the scenes when creating a stand-alone container network (of course, this is just a simple simulation, and Docker actually does a lot more than that).
Iv. simulation of container port mapping based on Userland proxy
Port mapping allows services located in containers to extend the scope of service beyond the host, for example: an nginx running in container can provide HTTP server services externally via the host's 9091 port:
$ sudo docker run -d -p 9091:80 nginx:latest8eef60e3d7b48140c20b11424ee8931be25bc47b5233aa42550efabd5730ac2f$ curl 10.11.36.15:9091Welcome to nginx!Welcome to nginx!
If you see this page, the nginx web server is successfully installed andworking. Further configuration is required.
For online documentation and support please refer tonginx.org.
Commercial support is available atnginx.com.
Thank you for using nginx.
The port mapping of the container is actually implemented by Docker engine's Docker proxy function. By default, the Docker engine (as of Docker 1.12.1) uses Userland proxy (–userland-proxy=true) to launch a proxy instance for port traffic forwarding for each expose port container:
$ ps -ef|grep docker-proxyroot 26246 6228 0 16:18 ? 00:00:00 /usr/bin/docker-proxy -proto tcp -host-ip 0.0.0.0 -host-port 9091 -container-ip 172.17.0.2 -container-port 80
The docker-proxy is actually forwarding traffic between the default NS and container ns. We can completely simulate this process.
We create a fileserver demo:
//testfileserver.gopackage mainimport "net/http"func main() { http.ListenAndServe(":8080", http.FileServer(http.Dir(".")))}
We start this fileserver service under CONTAINER_NS1:
$ sudo ip netns exec Container_ns1 ./testfileserver$ sudo ip netns exec Container_ns1 lsof -i tcp:8080COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAMEtestfiles 3605 root 3u IPv4 297022 0t0 TCP *:http-alt (LISTEN)
As you can see below the container_ns1, 8080 has been testfileserver monitored, but under the default NS, the 8080 port is still avaiable.
Next, we create a simple proxy under the default NS:
Proxy.go. var (host string Port string container string Containerport string) fun C Main () {flag. Stringvar (&host, "host", "0.0.0.0", "host addr") flag. Stringvar (&port, "Port", "" "," host port ") flag. Stringvar (&container, "container", "" "," container addr ") flag. Stringvar (&containerport, "Containerport", "8080", "container port") flag. Parse () fmt. Printf ("%s\n%s\n%s\n%s", host, port, container, containerport) ln, err: = Net. Listen ("TCP", host+ ":" +port) if err! = Nil {//Handle error log. PRINTLN ("Listen error:", err) return} log. Println ("Listen OK") for {conn, err: = ln. Accept () if err! = Nil {//Handle error log. PRINTLN ("Accept Error:", err) continue} log. PRINTLN ("Accept Conn", conn) go handleconnection (conn)}}func handleconnection (conn net. Conn) {CLI, err: = Net. Dial ("TCP", container+ ":" +containerporT) if err! = Nil {log. PRINTLN ("Dial error:", err) return} log. PRINTLN ("Dial", container+ ":" +containerport, "OK") go IO. COPY (conn, CLI) _, err = Io. Copy (CLI, conn) fmt. PRINTLN ("Communication over:error:", Err)}
Execute under Default ns:
./proxy -host 0.0.0.0 -port 9090 -container 172.16.1.1 -containerport 80800.0.0.09090172.16.1.180802017/01/11 17:26:10 listen ok
We get the 9090 port of the host on http:
$curl 10.11.36.15:9090proxyproxy.gotestfileservertestfileserver.go
Successfully obtained file list!
Output Log of proxy:
2017/01/11 17:26:16 accept conn &{{0xc4200560e0}}2017/01/11 17:26:16 dial 172.16.1.1:8080 okcommunication over: error:
Because each port mapping container to start at least one Docker proxy with it, once the running of the container increased, then Docker proxy on the resource consumption will be significant. Docker engine therefore provides a iptables-based port mapping mechanism after Docker 1.6 (as it is this version), eliminating the need to start the Docker proxy process. We just need to modify the boot configuration of the Docker engine:
In systems that use the SYSTEMD init system, if you configure –userland-proxy=false for Docker engine, you can refer to the article "when Docker encounters Systemd".
Since this is not related to network namespace, follow-up alone understands ^0^.
Vi. references
1, "Docker Networking Cookbook"
2. "Docker Cookbook"
, Bigwhite. All rights reserved.