Linux Namespace Introduction
We often hear that Docker is a virtualization tool that uses Linux Namespace and Cgroups, but what is Linux Namespace how it is used within Docker, and many of them are confused, Let's start by introducing Linux Namespace and how they are used in containers.
Concept
Linux Namespace is a feature of kernel that isolates a range of system resources, such as PID (Process ID), User ID, network, and so on. Generally seen here, a lot of people will think of a command chroot
, like chroot
allowing the current directory to become the root directory (Isolated), NAMESAPCE can also be on some resources, the process of isolation, these resources include the process tree, network interface, mount point and so on.
For example, a company sells its own computing resources to the outside world. The company has a well-performing server where each user buys a Tomcat instance to run their own application. Some naughty customers may accidentally enter the other person's Tomcat instance, modify or close some of these resources, which will cause each customer to interfere with each other. You might say that we can restrict the permissions of different users so that users can access only tomcat in their own name, but some operations may require system-level permissions, such as root. It is not possible to give each user root privileges, or to provide each user with a completely new physical host to isolate them from each other, so the Linux namespace here comes in handy. Using namespace, we can do the UID level isolation, that is, we can use the UID as the user of N, virtualized out of a namespace, in this namespace, the user is a root authority. But on a real physical machine, he is the user with the UID N, which solves the problem of user isolation. Of course this is just one of the simple features of namespace.
In addition to the user Namespace, the PID can also be virtual. namespaces establish different views of the system, and for each namespace, from the user it should look like a separate Linux computer, with its own init process (PID 1), the PID of the other process is incremented sequentially, the A and B spaces have the PID 1 init process, The process of the child container is mapped to the parent container's process, and the parent container can know the running state of each sub-container, and the child container is isolated from the child container. As we can see, Process 3 has a PID of 3 in the parent namespace, but within the sub-namespace, he is 1. That is, the user sees process 3 from within A child namespace A, like the Init process, that the process is its own initialization process, but from the whole host view, He's really just a space for the virtualization of the 3rd process.
Currently Linux implements six different types of namespace.
namespace Type |
System Call Parameters |
Kernel version |
Mount namespaces |
Clone_newns |
2.4.19 |
UTS namespaces |
Clone_newuts |
2.6.19 |
IPC namespaces |
Clone_newipc |
2.6.19 |
PID namespaces |
Clone_newpid |
2.6.24 |
Network namespaces |
Clone_newnet |
2.6.29 |
User namespaces |
Clone_newuser |
3.8 |
NAMESAPCE API mainly uses three system calls
clone()
-Create a new process. Depending on the system invocation parameters, which type of namespace is created, and their child processes are also included in the namespace
unshare()
-Move the process out of a namespace
setns()
-Add the process to the NAMESP
UTS Namespace
UTS namespace main isolation nodename
and domainname
two system identities. Within the UTS namespace, each namespace allowed to have its own hostname.
Below we will use go to do a UTS Namespace example. In fact, for Namespace this system call, using the C language to describe is the best, but the purpose of this book is to implement Docker, because Docker is the use of go development, then we use go to explain the whole. First look at the code, very simple:
package mainimport ( "os/exec" "syscall" "os" "log")func main() { cmd := exec.Command("sh") cmd.SysProcAttr = &syscall.SysProcAttr{ Cloneflags: syscall.CLONE_NEWUTS, } cmd.Stdin = os.Stdin cmd.Stdout = os.Stdout cmd.Stderr = os.Stderr if err := cmd.Run(); err != nil { log.Fatal(err) }}
To explain the code, exec.Command(‘sh‘)
to specify the execution environment of the current command, we use SH to do it by default. The following is the setting of the system invocation parameters, as we mentioned earlier, using CLONE_NEWUTS
this identifier to create a UTS Namespace. Go helps us encapsulate the invocation of the function, which is then clone()
entered into a SH runtime environment.
We run this program on Ubuntu 14.04, kernel version 3.13.0-65-generic,go version 1.7.3, execute go run main.go
, we use in this interactive environment to look at the pstree -pl
relationship between processes in the system
|-sshd(19820)---bash(19839)---go(19901)-+-main(19912)-+-sh(19915)--- pstree(19916)
And then we output the current PID.
# echo $$19915
Verify that our parent and child processes are not in the same UTS namespace
# readlink /proc/19912/ns/utsuts:[4026531838]# readlink /proc/19915/ns/utsuts:[4026532193]
Can see that they are indeed not in the same UTS namespace. Since the UTS namespace is isolated to hostname, then we modify the hostname in this environment should not affect the external host, let's do some experiments.
Execute within this SH environment
修改hostname 为bird然后打印出来 # hostname -b bird# hostnamebird
We also launch a shell to run on the host. Hostname look at the effect
[email protected]:~# hostnameiZ254rt8xf1Z
It can be seen that the external hostname is not affected by the internal modification, thereby understanding the role of UTS namespace.
IPC Namespace
The IPC Namespace is used to isolate System V IPC and POSIX message queues. Each IPC Namespace has their own System V IPC and POSIX message queue.
We changed the code a little bit on the basis of the previous version.
package mainimport ( "log" "os" "os/exec" "syscall")func main() { cmd := exec.Command("sh") cmd.SysProcAttr = &syscall.SysProcAttr{ Cloneflags: syscall.CLONE_NEWUTS | syscall.CLONE_NEWIPC, } cmd.Stdin = os.Stdin cmd.Stdout = os.Stdout cmd.Stderr = os.Stderr if err := cmd.Run(); err != nil { log.Fatal(err) }}
It can be seen that we only increase syscall.CLONE_NEWIPC
our desire to create IPC Namespace. Below we need to open two shells to demonstrate the effect of isolation.
First open a shell on the host
查看现有的ipc Message Queues[email protected]:~# ipcs -q------ Message Queues --------key msqid owner perms used-bytes messages下面我们创建一个message queue[email protected]:~# ipcmk -QMessage queue id: 0然后再查看一下 [email protected]:~# ipcs -q------ Message Queues --------key msqid owner perms used-bytes messages0x5e8f3f1e 0 root 644 0 0
Here we find that we can see a queue. Let's use another shell to run our program.
[email protected]:~/gocode/src/book# go run main.go# ipcs -q------ Message Queues --------key msqid owner perms used-bytes messages
Here we can find that in the newly created Namespace, we do not see the message queue created on the host, stating that our IPC Namespace was created successfully and the IPC has been quarantined.
PID Namesapce
The PID namespace is used to isolate the process ID. The same process can have different PID in different PID Namespace. It is understandable that in the Docker container, we ps -ef
often find that the PID of the process that the container is running in the foreground is 1, but we can find the same process with different PID in the container outside, ps -ef
which is PID namespace To do things.
Based on the previous code, we modified the code to add asyscall.CLONE_NEWPID
package mainimport ( "log" "os" "os/exec" "syscall")func main() { cmd := exec.Command("sh") cmd.SysProcAttr = &syscall.SysProcAttr{ Cloneflags: syscall.CLONE_NEWUTS | syscall.CLONE_NEWIPC | syscall.CLONE_NEWPID, } cmd.Stdin = os.Stdin cmd.Stdout = os.Stdout cmd.Stderr = os.Stderr if err := cmd.Run(); err != nil { log.Fatal(err) }}
We need to open two shells, first we look at the process tree on the host, find out the real PID of our process
[email protected]:~# pstree -pl |-sshd(894)-+-sshd(9455)---bash(9475)---bash(19619) | |-sshd(19715)---bash(19734) | |-sshd(19853)---bash(19872)---go(20179)-+-main(20190)-+-sh(20193) | | | |-{main}(20191) | | | `-{main}(20192) | | |-{go}(20180) | | |-{go}(20181) | | |-{go}(20182) | | `-{go}(20186) | `-sshd(20124)---bash(20144)---pstree(20196)
As you can see, our go main function runs with a PID of 20190. Now let's open another shell and run our code.
[email protected]:~/gocode/src/book# go run main.go# echo $$1
As you can see, we have printed the current namespace PID and found it to be 1, that is to say. This 20190 pid is mapped to the PID inside the NAMESAPCE for 1. PS can not be used here, because the PS and top commands will use the/proc content, we will explain in the following Mount Namesapce.
Mount Namespace
Mount namespace is used to isolate the mount point view that each process sees. The process seen in different namespace is not the same as the file system hierarchy. Calling in Mount namespace mount()
and umount()
only affects the file system in the current namespace, and has no effect on the global file system.
When you see this, you may think of it chroot()
. It also changes a subdirectory to a root node. But mount namespace not only achieves this functionality, but is also implemented in a more flexible and secure manner.
Mount namespace is the first NAMESAPCE type implemented by Linux, so its system invocation parameters are newns (the abbreviation for new namespace). It seems that people did not realize that there will be many types of namespace in the future to join the Linux family.
We made a little change to the code above, adding the Newns logo.
package mainimport ( "log" "os" "os/exec" "syscall")func main() { cmd := exec.Command("sh") cmd.SysProcAttr = &syscall.SysProcAttr{ Cloneflags: syscall.CLONE_NEWUTS | syscall.CLONE_NEWIPC | syscall.CLONE_NEWPID | syscall.CLONE_NEWNS, } cmd.Stdin = os.Stdin cmd.Stdout = os.Stdout cmd.Stderr = os.Stderr if err := cmd.Run(); err != nil { log.Fatal(err) }}
First we run the code and look at the contents of the/proc file. Proc is a file system that provides an additional mechanism to send information from the kernel and kernel modules to the process.
# Ls/proc1 19872 739 865 bus filesystems kpagecount pagetypeinfo SYSVIP C10 145 2 348 866 cgroups FS KPAGEFLAGS partitions timer_list100 1472 869 CmdLine interrupts Latency_stats sched_debug TIMER_STATS11 1475 20124 353-894 consoles Iomem loadavg schedstat tty1174 15 20129 6 776 9 Cpuinfo ioports Locks SCSI uptime1192 154 20144 28 37 49 937 Crypto IPMI Mdstat self version12 155 20215 29 38 5 607 796 945 Devices IRQ Meminfo slabinfo version_signature1255 16 20226 3 39 50 61 8 94 Diskstats kallsyms Misc Softirqs vmallocinfo1277 17 20229 30 391 51 62 827 967 DMA Kcore MODules stat vmstat1296 20231 836 driver Key-users Mounts Swaps XEN13 7 860 ACPI execdomains keys MTRR SYS zoneinfo1309 19853 733 862 buddyinfo FB kmsg net Sysrq-trigger
Because the/proc here is still the host, so we see the inside will be more chaotic, below we will/proc mount to our own namesapce below.
# mount -t proc proc /proc# ls /proc1 consoles execdomains ipmi kpagecount misc sched_debug swaps uptime5 cpuinfo fb irq kpageflags modules schedstat sys versionacpi crypto filesystems kallsyms latency_stats mounts scsi sysrq-trigger version_signaturebuddyinfo devices fs kcore loadavg mtrr self sysvipc vmallocinfobus diskstats interrupts key-users locks net slabinfo timer_list vmstatcgroups dma iomem keys mdstat pagetypeinfo softirqs timer_stats xencmdline driver ioports kmsg meminfo partitions stat tty zoneinfo
As you can see, a lot of commands are missing in an instant. Below we can use PS to see the process of the system.
# ps -efUID PID PPID C STIME TTY TIME CMDroot 1 0 0 20:15 pts/4 00:00:00 shroot 6 1 0 20:19 pts/4 00:00:00 ps -ef
As you can see, in the current NAMESAPCE, our SH process is the PID 1 process. This shows that the mount and outer space inside our current mount Namesapce are isolated, and the mount operation does not affect the outside. Docker volume also takes advantage of this feature.
User Namesapce
The user namespace is primarily an isolated user group ID. In other words, the user ID and group ID of a process can be different inside and outside of user namespace. It is commonly used to create a user namespace on a host with a non-root user, and then in user namespace to map to the root user. This means that the process has root privileges in user namespace, but there is no root permission outside of user namespace. Starting with Linux kernel 3.8, a non-root process can also create a user namespace, and this process can be mapped to root within the namespace and rooted in the namespace.
Let's go on to describe it as an example.
package mainimport ( "log" "os" "os/exec" "syscall")func main() { cmd := exec.Command("sh") cmd.SysProcAttr = &syscall.SysProcAttr{ Cloneflags: syscall.CLONE_NEWUTS | syscall.CLONE_NEWIPC | syscall.CLONE_NEWPID | syscall.CLONE_NEWNS | syscall.CLONE_NEWUSER, } cmd.SysProcAttr.Credential = &syscall.Credential{Uid: uint32(1), Gid: uint32(1)} cmd.Stdin = os.Stdin cmd.Stdout = os.Stdout cmd.Stderr = os.Stderr if err := cmd.Run(); err != nil { log.Fatal(err) } os.Exit(-1)}
We have added on the basis of the original syscall.CLONE_NEWUSER
. First we run this program as root, and on the host before running it we look at the current user and user group
[email protected]:~/gocode/src/book# iduid=0(root) gid=0(root) groups=0(root)
We can see that we are the root user, we run the program
[email protected]:~/gocode/src/book# go run main.go$ iduid=65534(nobody) gid=65534(nogroup) groups=65534(nogroup)
Network Namespace
Network namespace is used to isolate the network device, IP address port, such as the namespace. The network namespace allows each container to have its own independent networking device (virtual), and the application within the container can be bound to its own port, and the ports within each NAMESAPCE will not conflict with each other. Once the bridge is built on the host, it is convenient to implement the communication between the containers, and the same port can be used for each application within the container.
Again, we add a little bit to the original code. We have added syscall.CLONE_NEWNET
this identifier here.
package mainimport ( "log" "os" "os/exec" "syscall")func main() { cmd := exec.Command("sh") cmd.SysProcAttr = &syscall.SysProcAttr{ Cloneflags: syscall.CLONE_NEWUTS | syscall.CLONE_NEWIPC | syscall.CLONE_NEWPID | syscall.CLONE_NEWNS | syscall.CLONE_NEWUSER | syscall.CLONE_NEWNET, } cmd.SysProcAttr.Credential = &syscall.Credential{Uid: uint32(1), Gid: uint32(1)} cmd.Stdin = os.Stdin cmd.Stdout = os.Stdout cmd.Stderr = os.Stderr if err := cmd.Run(); err != nil { log.Fatal(err) } os.Exit(-1)}
First we look at our network devices on the host.
[email protected]:~/gocode/src/book# Ifconfigdocker0 Link encap:ethernet HWaddr 02:42:d7:5d:c3:b9 inet AD dr:192.168.0.1 bcast:0.0.0.0 mask:255.255.240.0 up broadcast multicast mtu:1500 metric:1 RX packets: 0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) eth0 Link encap:ethernet HWaddr 00:16:3e:00:38:cc inet addr:10.170.174.187 bcast:10.170.175.255 Ma sk:255.255.248.0 up broadcast RUNNING multicast mtu:1500 metric:1 RX packets:5605 errors:0 dropped:0 o verruns:0 frame:0 TX packets:1819 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:7129227 (7 .1 MB) TX bytes:159780 (159.7 KB) eth1 Link encap:ethernet HWaddr 00:16:3e:00:6d:4d inet addr:101.200.126.2 bcast:101.200.127.255 mask:255.255.252.0 up BROADCAST RUNNING multicast mtu:1500 metric:1 RX packets:15433 errors:0 dropped:0 overruns:0 frame:0 TX packets:6888 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:13287762 ( 13.2 MB) TX bytes:1787482 (1.7 mb) Lo Link encap:local Loopback inet addr:127.0.0.1 mask:255.0.0.0 Up LOOPBACK RUNNING mtu:65536 metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)
Can see our host on the lo, eth0, eth1 and other network equipment, the following we run a program to the network NAMESPCE inside to see.
[email protected]:~/gocode/src/book# go run main.go$ ifconfig$
We found that there were no network devices in namespace. This will demonstrate network isolation between the namespace and the host.
Summary
In this section we mainly introduce the Linux Namespace, a total of six categories of Namespace, we have a brief introduction, and then take the Go language as an example to do a demo, so that everyone convenient to have an intuitive understanding, we will use in the later chapters of this knowledge, And for these namespace applications, there will be more complicated examples in the later chapters waiting for you.
Reprint please specify the original text connection << write your own docker>>
One of the books on DIY Docker: Linux Namespace