Here we go. Start with the Linux namespace first.
Brief introduction
Linux namespace is a method of kernel-level environmental isolation provided by Linux. I don't know if you remember a long time ago Unix has a system call called chroot (by modifying the root directory to jail the user to a specific directory), Chroot provides a simple isolation mode: chroot internal file system cannot access external content. On this basis, Linux namespace provides the isolation mechanism of UTS, IPC, Mount, PID, network, user, etc.
For example, as we all know, Linux Super Father process PID is 1, so, like chroot, if we can put the user's process space jail to a process branch, and as the chroot so that the following process to see the super parent process PID is 1, So you can achieve the effect of resource isolation (processes in different PID namespace cannot see each other)
Linux Namespace has the following kinds, the official document here "Namespace in Operation"
Classification system call parameter related kernel version
Mount Namespaces clone_newns Linux 2.4.19
UTS Namespaces clone_newuts Linux 2.6.19
IPC Namespaces CLONE_NEWIPC Linux 2.6.19
PID Namespaces Clone_newpid Linux 2.6.24
Network namespaces Clone_newnet started with Linux 2.6.24 completed in Linux 2.6.29
User namespaces Clone_newuser starts with Linux 2.6.23 complete with Linux 3.8)
Mainly three system calls
Clone ()-implements a thread's system call to create a new process and can be isolated by designing the above parameters.
Unshare ()-causes a process to detach from a namespace
Setns ()-Add a process to a namespace
Unshare () and Setns () are relatively simple, we can own man, I do not say here.
Let's take a look at some examples (the following test program is best run in the Linux kernel for more than 3.8 versions, I use Ubuntu 14.04).
Clone () system call
First, let's take a look at an example of the simplest clone () system call (and later, our program will make changes based on this program):
#define _gnu_source
#include <sys/types.h>
#include <sys/wait.h>
#include <stdio.h>
#include <sched.h>
#include <signal.h>
#include <unistd.h>
/* Define a stack for clone, stack size 1M * *
#define STACK_SIZE (1024 * 1024)
static Char container_stack[stack_size];
char* Const container_args[] = {
"/bin/bash",
Null
};
int Container_main (void* arg)
{
printf ("Container-inside the container!\n");
/* Execute a shell directly so that we can see if the resources in the process space are quarantined.
EXECV (Container_args[0], Container_args);
printf ("Something ' s wrong!\n");
return 1;
}
int main ()
{
printf ("Parent-start a container!\n");
/* Call the Clone function, where a function is sent out, there is a stack space (why the tail pointer, because the stack is reversed) * *
int container_pid = Clone (Container_main, Container_stack+stack_size, SIGCHLD, NULL);
/* Wait for child process to end * *
Waitpid (container_pid, NULL, 0);
printf ("Parent-container stopped!\n");
return 0;
}
From the above procedure, we can see that this is basically the same as pthread. However, for the above program, there is no difference in the process space of the parent-child process, and the child processes that are accessible to the parents process can also.
Next, let's take a look at a few examples of what Linux is namespace.
UTS Namespace
The following code, I omitted the above header files and data structure definition, only the most important part.
int Container_main (void* arg)
{
printf ("Container-inside the container!\n");
SetHostName ("container", 10); /* Set hostname * *
EXECV (Container_args[0], Container_args);
printf ("Something ' s wrong!\n");
return 1;
}
int main ()
{
printf ("Parent-start a container!\n");
int container_pid = Clone (Container_main, Container_stack+stack_size,
clone_newuts | SIGCHLD, NULL); /* Enable Clone_newuts namespace Isolation * *
Waitpid (container_pid, NULL, 0);
printf ("Parent-container stopped!\n");
return 0;
}
Run the above program you will find (requires root permission), the hostname of the child process becomes container.
hchen@ubuntu:~$ sudo./uts
Parent-start a container!
Container-inside the container!
root@container:~# hostname
Container
root@container:~# Uname-n
Container
IPC Namespace
IPC Full name inter-process communication is a way of interprocess communication under Unix/linux, and IPC has such methods as shared memory, Semaphore, message queue and so on. Therefore, in order to isolate, we also need to separate the IPC, so that only under the same namespace process can communicate with each other. If you are familiar with the principles of IPC, you will know that IPC needs to have a global ID, which is the global, then means that our namespace need to isolate this ID, not to allow other namespace processes to see.
To start IPC isolation, we only need to add the CLONE_NEWIPC parameter when invoking clone.
int container_pid = Clone (Container_main, Container_stack+stack_size,
clone_newuts | CLONE_NEWIPC | SIGCHLD, NULL);
First, we first create a queue for IPC (the global queue ID is 0, as shown below)
hchen@ubuntu:~$ Ipcmk-q
Message Queue id:0
hchen@ubuntu:~$ Ipcs-q
------Message Queues--------
Key Msqid owner perms used-bytes messages
0XD0D56EB2 0 Hchen 644 0 0
If we run a program without CLONE_NEWIPC, we will see that the fully-open IPC Queue is still visible in the subprocess.
hchen@ubuntu:~$ sudo./uts
Parent-start a container!
Container-inside the container!
root@container:~# Ipcs-q
------Message Queues--------
Key Msqid owner perms used-bytes messages
0XD0D56EB2 0 Hchen 644 0 0
However, if we run the program with CLONE_NEWIPC, we will have the following result:
root@ubuntu:~$ SUDO./IPC
Parent-start a container!
Container-inside the container!
root@container:~/linux_namespace# Ipcs-q
------Message Queues--------
Key Msqid owner perms used-bytes messages
We can see that the IPC has been quarantined.
PID Namespace
We continue to modify the above procedure:
int Container_main (void* arg)
{
/* View the PID of the subprocess, we can see its output sub process PID is 1 * *
printf ("Container [%5d]-Inside the container!\n", Getpid ());
SetHostName ("container", 10);
EXECV (Container_args[0], Container_args);
printf ("Something ' s wrong!\n");
return 1;
}
int main ()
{
printf ("Parent [%5d]-Start a container!\n", getpid ());
/* Enable PID namespace-clone_newpid*/
int container_pid = Clone (Container_main, Container_stack+stack_size,
clone_newuts | Clone_newpid | SIGCHLD, NULL);
Waitpid (container_pid, NULL, 0);
printf ("Parent-container stopped!\n");
return 0;
}
The results of the operation are as follows (we can see that the PID for the subprocess is 1):
hchen@ubuntu:~$ sudo./pid
Parent [3474]-Start a container!
Container [1]-Inside the container!
root@container:~# Echo $$
You may ask, PID for 1 have a hair use AH? We know that in the traditional UNIX system, the PID 1 process is init, the status is very special. He is the parent process of all processes, there are a lot of privileges (such as masking signals, etc.), and in addition to checking the state of all processes, we know that if a child process is detached from the parent process (the parent process does not wait for it), then Init will be responsible for reclaiming the resource and ending the child process. Therefore, to achieve process space isolation, the first to create a PID 1 process, preferably like chroot, the process of the PID in the container into 1.
However, we will find that we can still see all the processes in the subroutine's shell by typing ps,top commands. The description is not completely isolated. This is because, like PS, top these commands will read the/proc file system, so because the/proc file system is the same in both parent and child processes, the commands display the same thing.
Therefore, we also need to isolate the file system.
Mount Namespace
In the following routine, we have mount namespace enabled and the/proc file system has been restarted in the subprocess.
int Container_main (void* arg)
{
printf ("Container [%5d]-Inside the container!\n", Getpid ());
SetHostName ("container", 10);
/* Re-mount proc file system to/proc below * *
System ("MOUNT-T proc Proc/proc");
EXECV (Container_args[0], Container_args);
printf ("Something ' s wrong!\n");
return 1;
}
int main ()
{
printf ("Parent [%5d]-Start a container!\n", getpid ());
/* Enable Mount Namespace-Increase the clone_newns parameters * *
int container_pid = Clone (Container_main, Container_stack+stack_size,
clone_newuts | Clone_newpid | clone_newns | SIGCHLD, NULL);
Waitpid (container_pid, NULL, 0);
printf ("Parent-container stopped!\n");
return 0;
}
The results of the operation are as follows:
hchen@ubuntu:~$ sudo./pid.mnt
Parent [3502]-Start a container!
Container [1]-Inside the container!
root@container:~# ps-elf
F S UID PID PPID C PRI NI ADDR SZ wchan stime TTY time CMD
4 S root 1 0 0 0-6917 wait 19:55 pts/2 00:00:00/bin/bash
0 R Root 1 0 0-5671-19:56 pts/2 00:00:00 ps-elf
Above, we can see only two processes, and the pid=1 process is our/bin/bash. We can also see that the/proc directory is also a lot cleaner:
root@container:~# Ls/proc
1 DMA key-users Net SYSVIPC
Driver Kmsg Pagetypeinfo timer_list
ACPI Execdomains KPAGECOUNT Partitions Timer_stats
Asound FB kpageflags Sched_debug TTY
Buddyinfo filesystems loadavg Schedstat uptime
Bus FS Locks SCSI version
Cgroups interrupts Mdstat Self version_signature
CmdLine iomem meminfo slabinfo vmallocinfo
Consoles Ioports Misc Softirqs vmstat
Cpuinfo IRQ Modules Stat Zoneinfo
Crypto kallsyms mounts swaps
Devices Kcore MPT SYS
Diskstats Keys MTRR Sysrq-trigger
In the following illustration, we can also see that the top command in the subprocess sees only two processes.
Here, say more. After the Mount namespace is created through clone_newns, the parent process copies its own file structure to the child process. All mount operations in the new namespace in the child process affect only their own file systems without any impact to the outside world. This can be done more strictly isolated.
You might ask, do we have other file systems that need to be mount like this? Yes.
Docker's Mount Namespace
I'll show you a "cottage image" that mimics the Docker Mount Namespace.
First, we need a rootfs, which is that we need to copy the commands from the image we're going to make to a rootfs directory, and we'll simulate Linux to build the following directories:
hchen@ubuntu:~/rootfs$ ls
Bin Dev etc home Lib lib64 mnt opt proc root run sbin sys tmp usr var
We then copy the commands we need to the Rootfs/bin directory (SH command will need to be copy in, otherwise we can't chroot)
hchen@ubuntu:~/rootfs$ ls./bin./usr/bin
./bin:
Bash chown gzip less mount netstat RM tabs Tee top TTY
CAT cp hostname ln mountpoint ping sed tac Test Touch umount
CHGRP echo IP ls mv ps sh tail timeout tr uname
chmod grep kill more nc pwd sleep tar toe truncate which
./usr/bin:
awk env groups head ID MESG sort strace tail top uniq VI WC xargs
Note: You can use the LDD command to copy these commands to the corresponding directory:
hchen@ubuntu:~/rootfs/bin$ LDD Bash
Linux-vdso.so.1 => (0x00007fffd33fc000)
Libtinfo.so.5 =>/lib/x86_64-linux-gnu/libtinfo.so.5 (0x00007f4bd42c2000)
Libdl.so.2 =>/lib/x86_64-linux-gnu/libdl.so.2 (0x00007f4bd40be000)
Libc.so.6 =>/lib/x86_64-linux-gnu/libc.so.6 (0x00007f4bd3cf8000)
/lib64/ld-linux-x86-64.so.2 (0x00007f4bd4504000)
Here are some of the so files in my rootfs:
hchen@ubuntu:~/rootfs$ ls./lib64./lib/x86_64-linux-gnu/
./LIB64:
Ld-linux-x86-64.so.2
./lib/x86_64-linux-gnu/:
Libacl.so.1 libmemusage.so libnss_files-2.19.so libpython3.4m.so.1
libacl.so.1.1.0 libmount.so.1 libnss_files.so.2 libpython3.4m.so.1.0
Libattr.so.1 libmount.so.1.1.0 libnss_hesiod-2.19.so libresolv-2.19.so
Libblkid.so.1 libm.so.6 libnss_hesiod.so.2 libresolv.so.2
libc-2.19.so libncurses.so.5 libnss_nis-2.19.so libselinux.so.1
Libcap.a libncurses.so.5.9 libnss_nisplus-2.19.so libtinfo.so.5
libcap.so libncursesw.so.5 libnss_nisplus.so.2 libtinfo.so.5.9
Libcap.so.2 libncursesw.so.5.9 libnss_nis.so.2 libutil-2.19.so
libcap.so.2.24 libnsl-2.19.so libpcre.so.3 libutil.so.1
Libc.so.6 libnsl.so.1 libprocps.so.3 libuuid.so.1
libdl-2.19.so libnss_compat-2.19.so libpthread-2.19.so libz.so.1
Libdl.so.2 libnss_compat.so.2 libpthread.so.0
Libgpm.so.2 libnss_dns-2.19.so libpython2.7.so.1
libm-2.19.so libnss_dns.so.2 libpython2.7.so.1.0
Include some of the configuration files that these commands depend on:
hchen@ubuntu:~/rootfs$ ls./etc
BASH.BASHRC group hostname hosts Ld.so.cache nsswitch.conf passwd profile
Resolv.conf Shadow
Now you're going to say, damn, some configurations I want to set for him when the container is starting, not hard code in the mirror. For example:/etc/hosts,/etc/hostname, and DNS/etc/resolv.conf files. Good. So we're outside the rootfs, we're going to create a conf directory and put these files in this directory.
hchen@ubuntu:~$ ls./conf
Hostname hosts resolv.conf
In this way, our parent process can dynamically set the configuration of the files that the container needs, and then mount them into the container so that the configuration in the mirror of the container is more flexible.
Well, finally came to our program.
#define _gnu_source
#include <sys/types.h>
#include <sys/wait.h>
#include <sys/mount.h>
#include <stdio.h>
#include <sched.h>
#include <signal.h>
#include <unistd.h>
#define STACK_SIZE (1024 * 1024)
static Char container_stack[stack_size];
char* Const container_args[] = {
"/bin/bash",
"-L",
Null
};
int Container_main (void* arg)
{
printf ("Container [%5d]-Inside the container!\n", Getpid ());
Set hostname
SetHostName ("container", 10);
Remount "/proc" to make sure, "Top" and "PS" Show container ' s information
if (Mount ("proc", "Rootfs/proc", "Proc", 0, NULL)!=0) {
Perror ("proc");
}
if (Mount ("Sysfs", "Rootfs/sys", "Sysfs", 0, NULL)!=0) {
Perror ("sys");
}
if (Mount ("None", "rootfs/tmp", "Tmpfs", 0, NULL)!=0) {
Perror ("tmp");
}
if (Mount ("Udev", "Rootfs/dev", "Devtmpfs", 0, NULL)!=0) {
Perror ("Dev");
}
if (Mount ("Devpts", "rootfs/dev/pts", "Devpts", 0, NULL)!=0) {
Perror ("dev/pts");
}
if (Mount ("Shm", "Rootfs/dev/shm", "Tmpfs", 0, NULL)!=0) {
Perror ("Dev/shm");
}
if (Mount ("Tmpfs", "Rootfs/run", "Tmpfs", 0, NULL)!=0) {
Perror ("Run");
}
/*
* Imitation of Docker from the outbound container mount related configuration files
* You can view:/var/lib/docker/containers/<container_id>/directory,
* You'll see the Docker of these files.
*/
if (Mount ("Conf/hosts", "rootfs/etc/hosts", "none", Ms_bind, NULL)!=0 | |
Mount ("Conf/hostname", "Rootfs/etc/hostname", "none", Ms_bind, NULL)!=0 | |
Mount ("conf/resolv.conf", "rootfs/etc/resolv.conf", "none", Ms_bind, NULL)!=0) {
Perror ("conf");
}
/* Imitate the-V,--volume=[] parameter in the Docker Run command * *
if (Mount ("/tmp/t1", "rootfs/mnt", "none", Ms_bind, NULL)!=0) {
Perror ("mnt");
}
/* Chroot Isolation Directory * *
if (ChDir ("./rootfs")!= 0 | | chroot ("./")!= 0) {
Perror ("Chdir/chroot");
}
EXECV (Container_args[0], Container_args);
Perror ("exec");
printf ("Something ' s wrong!\n");
return 1;
}
int main ()
{
printf ("Parent [%5d]-Start a container!\n", getpid ());
int container_pid = Clone (Container_main, Container_stack+stack_size,
clone_newuts | CLONE_NEWIPC | Clone_newpid | clone_newns | SIGCHLD, NULL);
Waitpid (container_pid, NULL, 0);
printf ("Parent-container stopped!\n");
return 0;
}
sudo running the above program, you will see the following mount information and a so-called "mirror":
hchen@ubuntu:~$ sudo./mount
Parent [4517]-Start a container!
Container [1]-Inside the container!
root@container:/# Mount
Proc On/proc type proc (rw,relatime)
Sysfs On/sys type SYSFS (rw,relatime)
None on/tmp type TMPFS (rw,relatime)
Udev On/dev type DEVTMPFS (rw,relatime,size=493976k,nr_inodes=123494,mode=755)
Devpts on/dev/pts type devpts (rw,relatime,mode=600,ptmxmode=000)
Tmpfs on/run type TMPFS (rw,relatime)
/dev/disk/by-uuid/18086e3b-d805-4515-9e91-7efb2fe5c0e2 on/etc/hosts type EXT4 (rw,relatime,errors=remount-ro,data= Ordered
/dev/disk/by-uuid/18086e3b-d805-4515-9e91-7efb2fe5c0e2 on/etc/hostname type EXT4 (Rw,relatime,errors=remount-ro, data=ordered)
/dev/disk/by-uuid/18086e3b-d805-4515-9e91-7efb2fe5c0e2 on/etc/resolv.conf type EXT4 (Rw,relatime,errors=remount-ro , data=ordered)
root@container:/# Ls/bin/usr/bin
/bin:
Bash chmod echo hostname less more MV Ping rm sleep tail Test Top truncate uname
Cat chown grep IP ln mount NC ps sed tabs tar timeout touch tty which
CHGRP cp gzip kill ls mountpoint netstat pwd sh tac tee toe tr umount
/usr/bin:
awk env groups head ID MESG sort strace tail top uniq VI WC xargs
about how to do a chroot directory, here is a tool called Debootstrapchroot, you can follow the link to see (English OH)
The next thing, you can play on your own, I believe your imagination. :)
User Namespace
The User namespace mainly uses the Clone_newuser parameters. With this parameter, the UID and GID you see inside are different from the outside, and the default display is 65534. That's because the container can't find its true uid, so set the maximum UID (its setting is defined in/proc/sys/kernel/overflowuid).
To map the UID in the container to the UID of the real system, you need to modify the two files/proc/<pid>/uid_map and/proc/<pid>/gid_map. The formats for these two files are:
Id-inside-ns Id-outside-ns Length
which
The first field Id-inside-ns represents the UID or GID displayed in the container.
The second field id-outside-ns the true UID or GID that represents the outer map of the container.
The third field represents the scope of the mapping, generally filling in 1, representing one by one correspondence.
For example, mapping a real uid=1000 into a uid=0 in a container
$ cat/proc/2465/uid_map
0 1000 1
Again, for example, the following sample: Indicates that the namespace from the 0-based UID is mapped to an externally-starting UID of 0, with a maximum range of unsigned 32-bit shaping
$ cat/proc/$$/uid_map
0 0 4294967295
In addition, it is necessary to note that:
The process for writing these two files requires Cap_setuid (cap_setgid) permissions in this namespace (see capabilities)
The process that is written must be the user namespace process of the parent or child of this user namespace.
Another requirement is full of one of the following: 1 The parent process maps effective uid/gid to the user namespace of the subprocess, and 2 the parent process can map to any in the parent process if Cap_setuid/cap_setgid permissions are available Gid.
These rules look annoying, let's take a look at the program (the following program is a bit long, but very simple, if you read "UNIX network Programming" roll up, you should be able to understand):
#define _gnu_source
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <sys/mount.h>
#include <sys/capability.h>
#include <stdio.h>
#include <sched.h>
#include <signal.h>
#include <unistd.h>
#define STACK_SIZE (1024 * 1024)
static Char container_stack[stack_size];
char* Const container_args[] = {
"/bin/bash",
Null
};
int pipefd[2];
void Set_map (char* file, int inside_id, int outside_id, int len) {
file* MAPFD = fopen (file, "w");
if (NULL = = MAPFD) {
Perror ("Open file Error");
Return
}
fprintf (MAPFD, "%d%d%d", inside_id, outside_id, Len);
Fclose (MAPFD);
}
void Set_uid_map (pid_t pid, int inside_id, int outside_id, int len) {
Char file[256];
sprintf (file, "/proc/%d/uid_map", PID);
Set_map (file, inside_id, outside_id, Len);
}
void Set_gid_map (pid_t pid, int inside_id, int outside_id, int len) {
Char file[256];
sprintf (file, "/proc/%d/gid_map", PID);
Set_map (file, inside_id, outside_id, Len);
}
int Container_main (void* arg)
{
printf ("Container [%5d]-Inside the container!\n", Getpid ());
printf ("Container:euid =%ld; Egid =%ld, Uid=%ld, gid=%ld\n ",
(long) Geteuid (), (long) Getegid (), (long) Getuid (), (long) getgid ());
/* Wait for the parent process to notify and then proceed down (synchronization between processes) * *
Char ch;
Close (pipefd[1]);
Read (Pipefd[0], &ch, 1);
printf ("Container [%5d]-Setup hostname!\n", Getpid ());
Set hostname
SetHostName ("container", 10);
Remount "/proc" to make sure, "Top" and "PS" Show container ' s information
Mount ("proc", "/proc", "Proc", 0, NULL);
EXECV (Container_args[0], Container_args);
printf ("Something ' s wrong!\n");
return 1;
}
int main ()
{
const int Gid=getgid (), Uid=getuid ();
printf ("Parent:euid =%ld; Egid =%ld, Uid=%ld, gid=%ld\n ",
(long) Geteuid (), (long) Getegid (), (long) Getuid (), (long) getgid ());
Pipe (PIPEFD);
printf ("Parent [%5d]-Start a container!\n", getpid ());
int container_pid = Clone (Container_main, Container_stack+stack_size,
clone_newuts | Clone_newpid | clone_newns | Clone_newuser | SIGCHLD, NULL);
printf ("Parent [%5d]-Container [%5d]!\n", Getpid (), container_pid);
To map the Uid/gid,
We need edit the/proc/pid/uid_map (or/proc/pid/gid_map) in parent
The file format is
Id-inside-ns Id-outside-ns Length
If no mapping,
The UID would be taken From/proc/sys/kernel/overflowuid
The GID would be taken From/proc/sys/kernel/overflowgid
Set_uid_map (container_pid, 0, UID, 1);
Set_gid_map (container_pid, 0, GID, 1);
printf ("Parent [%5d]-User/group mapping done!\n", Getpid ());
/* Notify child process/*
Close (pipefd[1]);
Waitpid (container_pid, NULL, 0);
printf ("Parent-container stopped!\n");
return 0;
}
The above program, we use a pipe to synchronize the parent-child process, why do this? Because there is a EXECV system call in the subprocess, this system call will overwrite the process space of the current subprocess, we want to execv the user namespace Uid/gid mapping before the EXECV Bash will be the prompt for the # sign because we set the inside-uid with a UID of 0.
The whole program works as follows:
hchen@ubuntu:~$ ID
uid=1000 (Hchen) gid=1000 (hchen) groups=1000 (Hchen)
hchen@ubuntu:~$./user #<--Run as Hchen user
Parent:euid = 1000; Egid = 1000, uid=1000, gid=1000
Parent [3262]-Start a container!
Parent [3262]-Container [3263]!
Parent [3262]-User/group mapping done!
Container [1]-Inside the container!
Container:euid = 0; Egid = 0, uid=0, gid=0 #<---container are 0.
Container [1]-Setup hostname!
root@container:~# ID #<----We can see that the user and command line prompt in the container is the root user
Uid=0 (Root) gid=0 (root) groups=0 (root), 65534 (Nogroup)
Although the container is root, the/bin/bash process of the container is actually run as a normal user hchen. As a result, the security of our containers will be improved.
We note that user namespace is run by a normal user, but other namespace require root permission, so what if I have to use multiple namespace at the same time? In general, we first create user Namespace with the average user, then map this generic user to root and create other namesapce in the container with root.
Network Namespace
Namespace comparison of network?? Long T?inux, we generally use IP command to create network Namespace (Docker source code, it does not use IP command, but its own implementation of the IP commands within some of the functions-is used raw socket to send some "strange" data, hehe). Here, I still use IP command to explain.
First of all, we look at a graph, the following diagram is basically Docker on the host Network Diagram (the physical network card is not accurate, because the Docker may run in a VM, so the so-called "physical network card" is actually a router can be routed IP card)
In the figure above, Docker uses a private network segment, 172.40.1.0,docker may also use the 10.0.0.0 and 192.168.0.0 these two private network segments, the key to see if your routing table is configured, if not configured, it will be used, if your routing table is configured with all private network segments, then the Docker boot error.
When you start a Docker container, you can use IP link show or IP addr to view the current host network (we can see a docker0, and a veth22a38e6 virtual Nic-for the container):
hchen@ubuntu:~$ IP link Show
1:lo: <LOOPBACK,UP,LOWER_UP> MTU 65536 qdisc noqueue State ...
Link/loopback 00:00:00:00:00:00 BRD 00:00:00:00:00:00
2:eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> MTU 1500 Qdisc ...
Link/ether 00:0c:29:b7:67:7d BRD FF:FF:FF:FF:FF:FF
3:DOCKER0: <BROADCAST,MULTICAST,UP,LOWER_UP> MTU 1500 ...
Link/ether 56:84:7a:fe:97:99 BRD FF:FF:FF:FF:FF:FF
5:VETH22A38E6: <BROADCAST,UP,LOWER_UP> MTU 1500 Qdisc ...
Link/ether 8E:30:2A:AC:8C:D1 BRD FF:FF:FF:FF:FF:FF
So what should we do to make this look? Let's look at a set of commands:
# # First, we'll add a bridge lxcbr0, imitate Docker0
Brctl ADDBR Lxcbr0
Brctl STP lxcbr0 off
Ifconfig lxcbr0 192.168.10.1/24 up #为网桥设置IP地址
# # Next, we're going to create a network namespace-ns1
# Add one namesapce command to ns1 (use IP netns add command)
IP netns Add ns1
# Activates the loopback in namespace, that is, 127.0.0.1 (use IP netns exec ns1 to manipulate commands in ns1)
IP netns exec ns1 IP link set dev lo up
# # Then we need to add a pair of virtual cards
# Add a pair virtual network card, note the Veth type, one of the network adapters to press into the container
IP link Add veth-ns1 type Veth peer name lxcbr0.1
# put the veth-ns1 in the namespace ns1 so that there will be a new NIC in the container.
IP link set veth-ns1 netns ns1
# Change the veth-ns1 in the container to eth0 (the container will clash, not inside the container)
IP netns exec ns1 IP link set dev veth-ns1 name eth0
# Assign an IP address to the NIC in the container and activate it
IP netns exec ns1 ifconfig eth0 192.168.10.11/24 up
# above we put veth-ns1 this NIC in the container, and then we're going to add lxcbr0.1 to the Internet bridge.
Brctl addif lxcbr0 lxcbr0.1
# Add a routing rule to the container to allow the container to access the outside network
IP netns exec ns1 IP route add default via 192.168.10.1
# under/etc/netns create a network namespce name called NS1 directory,
# then set the resolv.conf for this namespace, so that the domain name can be accessed within the container.
Mkdir-p/etc/netns/ns1
echo "NameServer 8.8.8.8" >/etc/netns/ns1/resolv.conf
The above is basically the principle of the Docker network, but,
Docker's resolv.conf not in this way, but with the way the Mount Namesapce in the previous chapter
In addition, the Docker is to use the process of PID to do network namespace name.
Having learned this, you can even add a new network card to the Docker container you are running:
IP link Add peera type Veth peer name Peerb
Brctl addif Docker0 Peera
IP link Set Peera up
IP link set peerb netns ${container-pid}
IP netns exec ${container-pid} IP link set dev peerb name eth1
IP netns exec ${container-pid} IP link set eth1 up;
IP netns exec ${container-pid} IP addr Add ${routeable_ip} dev eth1;
The example above is a eth1 NIC for the running Docker container, and a static IP address that can be accessed externally.
This requires the external "physical network card" configuration into promiscuous mode, this eth1 network card will be sent out through the ARP protocol to send their own MAC address, and then the external switch will be to the IP address packet to the "Physical network card", because it is promiscuous mode, so eth1 can receive the relevant data, a look , is his own, then received. In this way, the Docker container's network and the outside pass.
Of course, whether it is Docker Nat way, or promiscuous mode will have performance problems, Nat Needless to say, there is a forwarding overhead, promiscuous mode, the network card received on the load will be completely to all the virtual network card, so even if a network card on the data, But it will also be affected by data on other network cards.
These two ways are not perfect, we know, the real solution to this network problem requires the use of VLAN technology, so Google's students for the Linux kernel to achieve a Ipvlan drive, which is basically tailored for Docker.
namespace file
Above is the current Linux namespace play. Now, let me take a look at other related things.
Let's run the PID.MNT program in the previous article (that is, the Mount proc program in the PID namespace) and don't quit.
$ sudo./pid.mnt
[sudo] password for hchen:
Parent [4599]-Start a container!
Container [1]-Inside the container!
Let's look at the PID of the parent-child process in another shell:
hchen@ubuntu:~$ pstree-p 4599
Pid.mnt (4599) ───bash (4600)
We can go to proc (/proc//ns) to view the ID of each namespace of the process (kernel version requires more than 3.8).
The following are the parent processes:
hchen@ubuntu:~$ sudo ls-l/proc/4599/ns
Total 0
lrwxrwxrwx 1 Root 0 April 7 22:01 IPC-> ipc:[4026531839]
lrwxrwxrwx 1 Root 0 April 7 22:01 mnt-> mnt:[4026531840]
lrwxrwxrwx 1 Root 0 April 7 22:01 net-> net:[4026531956]
lrwxrwxrwx 1 Root 0 April 7 22:01 pid-> pid:[4026531836]
lrwxrwxrwx 1 Root 0 April 7 22:01 user-> user:[4026531837]
lrwxrwxrwx 1 Root 0 April 7 22:01 UTS-> uts:[4026531838]
The following are child processes:
hchen@ubuntu:~$ sudo ls-l/proc/4600/ns
Total 0
lrwxrwxrwx 1 Root 0 April 7 22:01 IPC-> ipc:[4026531839]
lrwxrwxrwx 1 Root 0 April 7 22:01 mnt-> mnt:[4026532520]
lrwxrwxrwx 1 Root 0 April 7 22:01 net-> net:[4026531956]
lrwxrwxrwx 1 Root 0 April 7 22:01 pid-> pid:[4026532522]
lrwxrwxrwx 1 Root 0 April 7 22:01 user-> user:[4026531837]
lrwxrwxrwx 1 Root 0 April 7 22:01 UTS-> uts:[4026532521]
We can see that the ipc,net,user is the same ID, and the mnt,pid,uts is different. If two processes point to the same namespace numbers, they are under the same namespace, otherwise they are in different namespace.
These files also have another function, that is, once these files are opened, as long as their FD is occupied, then even if the PID is all the process has ended, the creation of the namespace will always exist. For example: We can hold this namespace by: Mount Bind/proc/4600/ns/uts ~/uts.
In addition, we talked about a setns system call in the previous chapter, whose function is declared as follows:
int setns (int fd, int nstype);
The first parameter is an FD, an open () system call that returns the FD after the file is opened, for example:
FD = open ("/proc/4600/ns/nts", o_rdonly); Get namespace File descriptor
Setns (FD, 0); Add a new namespace