One of the books on DIY Docker: Linux Namespace

Source: Internet
Author: User
Tags message queue posix gocode

Linux Namespace Introduction

We often hear that Docker is a virtualization tool that uses Linux Namespace and Cgroups, but what is Linux Namespace how it is used within Docker, and many of them are confused, Let's start by introducing Linux Namespace and how they are used in containers.

Concept

Linux Namespace is a feature of kernel that isolates a range of system resources, such as PID (Process ID), User ID, network, and so on. Generally seen here, a lot of people will think of a command chroot , like chroot allowing the current directory to become the root directory (Isolated), NAMESAPCE can also be on some resources, the process of isolation, these resources include the process tree, network interface, mount point and so on.

For example, a company sells its own computing resources to the outside world. The company has a well-performing server where each user buys a Tomcat instance to run their own application. Some naughty customers may accidentally enter the other person's Tomcat instance, modify or close some of these resources, which will cause each customer to interfere with each other. You might say that we can restrict the permissions of different users so that users can access only tomcat in their own name, but some operations may require system-level permissions, such as root. It is not possible to give each user root privileges, or to provide each user with a completely new physical host to isolate them from each other, so the Linux namespace here comes in handy. Using namespace, we can do the UID level isolation, that is, we can use the UID as the user of N, virtualized out of a namespace, in this namespace, the user is a root authority. But on a real physical machine, he is the user with the UID N, which solves the problem of user isolation. Of course this is just one of the simple features of namespace.

In addition to the user Namespace, the PID can also be virtual. namespaces establish different views of the system, and for each namespace, from the user it should look like a separate Linux computer, with its own init process (PID 1), the PID of the other process is incremented sequentially, the A and B spaces have the PID 1 init process, The process of the child container is mapped to the parent container's process, and the parent container can know the running state of each sub-container, and the child container is isolated from the child container. As we can see, Process 3 has a PID of 3 in the parent namespace, but within the sub-namespace, he is 1. That is, the user sees process 3 from within A child namespace A, like the Init process, that the process is its own initialization process, but from the whole host view, He's really just a space for the virtualization of the 3rd process.

Currently Linux implements six different types of namespace.

namespace Type System Call Parameters Kernel version
Mount namespaces Clone_newns 2.4.19
UTS namespaces Clone_newuts 2.6.19
IPC namespaces Clone_newipc 2.6.19
PID namespaces Clone_newpid 2.6.24
Network namespaces Clone_newnet 2.6.29
User namespaces Clone_newuser 3.8

NAMESAPCE API mainly uses three system calls

    • clone()-Create a new process. Depending on the system invocation parameters, which type of namespace is created, and their child processes are also included in the namespace
    • unshare()-Move the process out of a namespace
    • setns()-Add the process to the NAMESP
UTS Namespace

UTS namespace main isolation nodename and domainname two system identities. Within the UTS namespace, each namespace allowed to have its own hostname.

Below we will use go to do a UTS Namespace example. In fact, for Namespace this system call, using the C language to describe is the best, but the purpose of this book is to implement Docker, because Docker is the use of go development, then we use go to explain the whole. First look at the code, very simple:

package mainimport (    "os/exec"    "syscall"    "os"    "log")func main() {    cmd := exec.Command("sh")    cmd.SysProcAttr = &syscall.SysProcAttr{        Cloneflags: syscall.CLONE_NEWUTS,    }    cmd.Stdin = os.Stdin    cmd.Stdout = os.Stdout    cmd.Stderr = os.Stderr    if err := cmd.Run(); err != nil {        log.Fatal(err)    }}

To explain the code, exec.Command(‘sh‘) to specify the execution environment of the current command, we use SH to do it by default. The following is the setting of the system invocation parameters, as we mentioned earlier, using CLONE_NEWUTS this identifier to create a UTS Namespace. Go helps us encapsulate the invocation of the function, which is then clone() entered into a SH runtime environment.

We run this program on Ubuntu 14.04, kernel version 3.13.0-65-generic,go version 1.7.3, execute go run main.go , we use in this interactive environment to look at the pstree -pl relationship between processes in the system

|-sshd(19820)---bash(19839)---go(19901)-+-main(19912)-+-sh(19915)---    pstree(19916)   

And then we output the current PID.

# echo $$19915

Verify that our parent and child processes are not in the same UTS namespace

# readlink /proc/19912/ns/utsuts:[4026531838]# readlink /proc/19915/ns/utsuts:[4026532193]

Can see that they are indeed not in the same UTS namespace. Since the UTS namespace is isolated to hostname, then we modify the hostname in this environment should not affect the external host, let's do some experiments.

Execute within this SH environment

修改hostname 为bird然后打印出来 # hostname -b bird# hostnamebird    

We also launch a shell to run on the host. Hostname look at the effect

[email protected]:~# hostnameiZ254rt8xf1Z

It can be seen that the external hostname is not affected by the internal modification, thereby understanding the role of UTS namespace.

IPC Namespace

The IPC Namespace is used to isolate System V IPC and POSIX message queues. Each IPC Namespace has their own System V IPC and POSIX message queue.

We changed the code a little bit on the basis of the previous version.

package mainimport (    "log"    "os"    "os/exec"    "syscall")func main() {    cmd := exec.Command("sh")    cmd.SysProcAttr = &syscall.SysProcAttr{        Cloneflags: syscall.CLONE_NEWUTS | syscall.CLONE_NEWIPC,    }    cmd.Stdin = os.Stdin    cmd.Stdout = os.Stdout    cmd.Stderr = os.Stderr    if err := cmd.Run(); err != nil {        log.Fatal(err)    }}

It can be seen that we only increase syscall.CLONE_NEWIPC our desire to create IPC Namespace. Below we need to open two shells to demonstrate the effect of isolation.

First open a shell on the host

查看现有的ipc Message Queues[email protected]:~# ipcs -q------ Message Queues --------key        msqid      owner      perms      used-bytes   messages下面我们创建一个message queue[email protected]:~# ipcmk -QMessage queue id: 0然后再查看一下 [email protected]:~# ipcs -q------ Message Queues --------key        msqid      owner      perms      used-bytes   messages0x5e8f3f1e 0          root       644        0            0

Here we find that we can see a queue. Let's use another shell to run our program.

[email protected]:~/gocode/src/book# go run main.go# ipcs -q------ Message Queues --------key        msqid      owner      perms      used-bytes   messages

Here we can find that in the newly created Namespace, we do not see the message queue created on the host, stating that our IPC Namespace was created successfully and the IPC has been quarantined.

PID Namesapce

The PID namespace is used to isolate the process ID. The same process can have different PID in different PID Namespace. It is understandable that in the Docker container, we ps -ef often find that the PID of the process that the container is running in the foreground is 1, but we can find the same process with different PID in the container outside, ps -ef which is PID namespace To do things.

Based on the previous code, we modified the code to add asyscall.CLONE_NEWPID

package mainimport (    "log"    "os"    "os/exec"    "syscall")func main() {    cmd := exec.Command("sh")    cmd.SysProcAttr = &syscall.SysProcAttr{        Cloneflags: syscall.CLONE_NEWUTS | syscall.CLONE_NEWIPC | syscall.CLONE_NEWPID,    }    cmd.Stdin = os.Stdin    cmd.Stdout = os.Stdout    cmd.Stderr = os.Stderr    if err := cmd.Run(); err != nil {        log.Fatal(err)    }}

We need to open two shells, first we look at the process tree on the host, find out the real PID of our process

[email protected]:~# pstree -pl |-sshd(894)-+-sshd(9455)---bash(9475)---bash(19619)    |           |-sshd(19715)---bash(19734)    |           |-sshd(19853)---bash(19872)---go(20179)-+-main(20190)-+-sh(20193)    |           |                                       |             |-{main}(20191)    |           |                                       |             `-{main}(20192)    |           |                                       |-{go}(20180)    |           |                                       |-{go}(20181)    |           |                                       |-{go}(20182)    |           |                                       `-{go}(20186)    |           `-sshd(20124)---bash(20144)---pstree(20196)

As you can see, our go main function runs with a PID of 20190. Now let's open another shell and run our code.

[email protected]:~/gocode/src/book# go run main.go# echo $$1

As you can see, we have printed the current namespace PID and found it to be 1, that is to say. This 20190 pid is mapped to the PID inside the NAMESAPCE for 1. PS can not be used here, because the PS and top commands will use the/proc content, we will explain in the following Mount Namesapce.

Mount Namespace

Mount namespace is used to isolate the mount point view that each process sees. The process seen in different namespace is not the same as the file system hierarchy. Calling in Mount namespace mount() and umount() only affects the file system in the current namespace, and has no effect on the global file system.

When you see this, you may think of it chroot() . It also changes a subdirectory to a root node. But mount namespace not only achieves this functionality, but is also implemented in a more flexible and secure manner.

Mount namespace is the first NAMESAPCE type implemented by Linux, so its system invocation parameters are newns (the abbreviation for new namespace). It seems that people did not realize that there will be many types of namespace in the future to join the Linux family.

We made a little change to the code above, adding the Newns logo.

package mainimport (    "log"    "os"    "os/exec"    "syscall")func main() {    cmd := exec.Command("sh")    cmd.SysProcAttr = &syscall.SysProcAttr{        Cloneflags: syscall.CLONE_NEWUTS | syscall.CLONE_NEWIPC | syscall.CLONE_NEWPID | syscall.CLONE_NEWNS,    }    cmd.Stdin = os.Stdin    cmd.Stdout = os.Stdout    cmd.Stderr = os.Stderr    if err := cmd.Run(); err != nil {        log.Fatal(err)    }}

First we run the code and look at the contents of the/proc file. Proc is a file system that provides an additional mechanism to send information from the kernel and kernel modules to the process.

# Ls/proc1 19872 739 865 bus filesystems kpagecount pagetypeinfo SYSVIP   C10 145 2 348 866 cgroups FS KPAGEFLAGS partitions timer_list100    1472 869 CmdLine interrupts Latency_stats sched_debug TIMER_STATS11  1475 20124 353-894 consoles Iomem loadavg schedstat tty1174 15 20129  6 776 9 Cpuinfo ioports Locks SCSI uptime1192 154 20144 28 37 49         937 Crypto IPMI Mdstat self version12 155 20215 29 38 5 607 796 945 Devices IRQ Meminfo slabinfo version_signature1255 16 20226 3 39 50 61 8 94        Diskstats kallsyms Misc Softirqs vmallocinfo1277 17 20229 30 391 51 62 827 967 DMA Kcore MODules stat vmstat1296 20231 836 driver Key-users Mounts        Swaps XEN13 7 860 ACPI execdomains keys MTRR SYS zoneinfo1309 19853 733 862 buddyinfo FB kmsg net Sysrq-trigger

Because the/proc here is still the host, so we see the inside will be more chaotic, below we will/proc mount to our own namesapce below.

# mount -t proc proc /proc# ls /proc1      consoles   execdomains  ipmi       kpagecount     misc      sched_debug  swaps          uptime5      cpuinfo    fb       irq        kpageflags     modules       schedstat    sys        versionacpi       crypto     filesystems  kallsyms   latency_stats  mounts    scsi     sysrq-trigger  version_signaturebuddyinfo  devices    fs       kcore      loadavg        mtrr      self     sysvipc        vmallocinfobus    diskstats  interrupts   key-users  locks      net       slabinfo timer_list     vmstatcgroups    dma        iomem    keys       mdstat         pagetypeinfo  softirqs timer_stats    xencmdline    driver     ioports      kmsg       meminfo        partitions    stat     tty        zoneinfo

As you can see, a lot of commands are missing in an instant. Below we can use PS to see the process of the system.

# ps -efUID        PID  PPID  C STIME TTY          TIME CMDroot         1     0  0 20:15 pts/4    00:00:00 shroot         6     1  0 20:19 pts/4    00:00:00 ps -ef

As you can see, in the current NAMESAPCE, our SH process is the PID 1 process. This shows that the mount and outer space inside our current mount Namesapce are isolated, and the mount operation does not affect the outside. Docker volume also takes advantage of this feature.

User Namesapce

The user namespace is primarily an isolated user group ID. In other words, the user ID and group ID of a process can be different inside and outside of user namespace. It is commonly used to create a user namespace on a host with a non-root user, and then in user namespace to map to the root user. This means that the process has root privileges in user namespace, but there is no root permission outside of user namespace. Starting with Linux kernel 3.8, a non-root process can also create a user namespace, and this process can be mapped to root within the namespace and rooted in the namespace.

Let's go on to describe it as an example.

package mainimport (    "log"    "os"    "os/exec"    "syscall")func main() {    cmd := exec.Command("sh")    cmd.SysProcAttr = &syscall.SysProcAttr{        Cloneflags: syscall.CLONE_NEWUTS | syscall.CLONE_NEWIPC | syscall.CLONE_NEWPID | syscall.CLONE_NEWNS |            syscall.CLONE_NEWUSER,    }    cmd.SysProcAttr.Credential = &syscall.Credential{Uid: uint32(1), Gid: uint32(1)}    cmd.Stdin = os.Stdin    cmd.Stdout = os.Stdout    cmd.Stderr = os.Stderr    if err := cmd.Run(); err != nil {        log.Fatal(err)    }    os.Exit(-1)}

We have added on the basis of the original syscall.CLONE_NEWUSER . First we run this program as root, and on the host before running it we look at the current user and user group

[email protected]:~/gocode/src/book# iduid=0(root) gid=0(root) groups=0(root)

We can see that we are the root user, we run the program

[email protected]:~/gocode/src/book# go run main.go$ iduid=65534(nobody) gid=65534(nogroup) groups=65534(nogroup)
Network Namespace

Network namespace is used to isolate the network device, IP address port, such as the namespace. The network namespace allows each container to have its own independent networking device (virtual), and the application within the container can be bound to its own port, and the ports within each NAMESAPCE will not conflict with each other. Once the bridge is built on the host, it is convenient to implement the communication between the containers, and the same port can be used for each application within the container.

Again, we add a little bit to the original code. We have added syscall.CLONE_NEWNET this identifier here.

package mainimport (    "log"    "os"    "os/exec"    "syscall")func main() {    cmd := exec.Command("sh")    cmd.SysProcAttr = &syscall.SysProcAttr{        Cloneflags: syscall.CLONE_NEWUTS | syscall.CLONE_NEWIPC | syscall.CLONE_NEWPID | syscall.CLONE_NEWNS |            syscall.CLONE_NEWUSER | syscall.CLONE_NEWNET,    }    cmd.SysProcAttr.Credential = &syscall.Credential{Uid: uint32(1), Gid: uint32(1)}    cmd.Stdin = os.Stdin    cmd.Stdout = os.Stdout    cmd.Stderr = os.Stderr    if err := cmd.Run(); err != nil {        log.Fatal(err)    }    os.Exit(-1)}

First we look at our network devices on the host.

[email protected]:~/gocode/src/book# Ifconfigdocker0 Link encap:ethernet HWaddr 02:42:d7:5d:c3:b9 inet AD dr:192.168.0.1 bcast:0.0.0.0 mask:255.255.240.0 up broadcast multicast mtu:1500 metric:1 RX packets:          0 errors:0 dropped:0 overruns:0 frame:0          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:0  (0.0 B) TX bytes:0 (0.0 B) eth0 Link encap:ethernet HWaddr 00:16:3e:00:38:cc inet addr:10.170.174.187 bcast:10.170.175.255 Ma sk:255.255.248.0 up broadcast RUNNING multicast mtu:1500 metric:1 RX packets:5605 errors:0 dropped:0 o          verruns:0 frame:0 TX packets:1819 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:7129227 (7 .1 MB) TX bytes:159780 (159.7 KB) eth1 Link encap:ethernet HWaddr 00:16:3e:00:6d:4d inet addr:101.200.126.2 bcast:101.200.127.255 mask:255.255.252.0 up BROADCAST RUNNING multicast mtu:1500 metric:1 RX packets:15433 errors:0 dropped:0 overruns:0 frame:0 TX packets:6888 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:13287762 (          13.2 MB) TX bytes:1787482 (1.7 mb) Lo Link encap:local Loopback inet addr:127.0.0.1 mask:255.0.0.0          Up LOOPBACK RUNNING mtu:65536 metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:0  (0.0 B) TX bytes:0 (0.0 B)

Can see our host on the lo, eth0, eth1 and other network equipment, the following we run a program to the network NAMESPCE inside to see.

[email protected]:~/gocode/src/book# go run main.go$ ifconfig$

We found that there were no network devices in namespace. This will demonstrate network isolation between the namespace and the host.

Summary

In this section we mainly introduce the Linux Namespace, a total of six categories of Namespace, we have a brief introduction, and then take the Go language as an example to do a demo, so that everyone convenient to have an intuitive understanding, we will use in the later chapters of this knowledge, And for these namespace applications, there will be more complicated examples in the later chapters waiting for you.

Reprint please specify the original text connection << write your own docker>>

One of the books on DIY Docker: Linux Namespace

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.