Docker basic technology: Linux Namespace (top)

Source: Internet
Author: User
Tags message queue uuid docker run

Guide Now the hottest technology is Docker, many people think Docker is a new technology, but in fact, Docker in addition to its programming language with go relatively new, actually it is really not a new thing, that is, a new bottle of old wine, the so-called "the Stuff". Docker and Docker derived things with a lot of cool technology, I will use a few articles to introduce these technologies to you, hope that through these articles you can build a cottage version of Docker. Start with the Linux namespace first.

Introduction

Linux namespace is a kernel-level environment isolation method provided by Linux. Don't know if you remember a long ago Unix has a system call called chroot (by modifying the root directory to put the user into a specific directory), Chroot provides a simple isolation mode: chroot internal file system cannot access external content. On this basis, Linux namespace provides an isolation mechanism for UTS, IPC, Mount, PID, network, user, and so on.

For example, we all know that the PID of the Super Father process under Linux is 1, so, like chroot, if we can jail the user's process space to a process branch, and the PID of the super-parent process that the following process sees as Chroot, is 1, So you can achieve the effect of resource isolation (the processes in different PID namespace cannot see each other)

Linux Namespace has the following types,

mostly three system calls

Mostly three system calls
Clone () – Implements a thread's system call to create a new process and can be isolated by designing the above parameters.
? Unshare () – Leaving a process out of a namespace
? Setns () – Add a process to a namespace
Unshare () and Setns () are relatively simple, we can own man, I do not say here.

Let's take a look at some examples (the following test programs are best run on Linux kernels for more than 3.8 versions, I use Ubuntu 14.04).

Clone () system call

First, let's take a look at one of the simplest clone () system invocation examples, (later, our program will make changes based on this program):

#define _gnu_source#include #include #include #include #include #include/* Define a stack for clone, stack size 1M */#define STACK_SIZE (10 * 1024x768) static char container_stack[stack_size];char* const container_args[] = {    "/bin/bash",    null};int Container_main (void* Arg) {    printf ("Container-inside the container!/n");    /* Execute a shell directly so we can see if the resources in the process space are quarantined    /EXECV (container_args[0], Container_args);    printf ("Something ' s wrong!/n");    return 1;} int main () {    printf ("Parent-start a container!/n");    /* Call the Clone function, where a function is passed, and there is a stack space (why the tail pointer, because the stack is reversed) *    /int container_pid = Clone (Container_main, container_stack+ Stack_size, SIGCHLD, NULL);    /* Wait for the child process to end *    /Waitpid (container_pid, NULL, 0);    printf ("Parent-container stopped!/n");    return 0;}

From the above program, we can see that this and pthread are basically the same gameplay. However, for the above program, there is no difference in the process space of the parent-child process, and the child process can access it.

Below, let's look at a few examples of what Linux namespace is.

UTS Namespace

The following code, I omitted the above header files and data structure definition, only the most important part.

int Container_main (void* arg) {    printf ("Container-inside the container!/n");    SetHostName ("container", 10); /* Set hostname *    /EXECV (container_args[0], Container_args);    printf ("Something ' s wrong!/n");    return 1;} int main () {    printf ("Parent-start a container!/n");    int container_pid = Clone (Container_main, Container_stack+stack_size,            clone_newuts | SIGCHLD, NULL); /* Enable Clone_newuts namespace Isolation *    /Waitpid (container_pid, NULL, 0);    printf ("Parent-container stopped!/n");    return 0;}

Running the above program you will find (requires root permission), the hostname of the child process becomes container.

[Email protected]:~$ sudo./utsparent-start a container! Container-inside the container! [Email protected]:~# hostnamecontainer[email protected]:~# Uname-ncontainer
IPC Namespace

IPC Full name inter-process communication, is a way of communication between the Unix/linux process, IPC has shared memory, Semaphore, message queue and other methods. So, in order to isolate, we also need to isolate the IPC so that only processes under the same namespace can communicate with each other. If you are familiar with the principles of IPC, you will know that IPC needs to have a global ID, that is, the global, then it means that our namespace need to isolate this ID, not to let other namespace process to see.

To start IPC isolation, we only need to add the CLONE_NEWIPC parameter when cloning is called.

int container_pid = Clone (Container_main, Container_stack+stack_size,            clone_newuts | CLONE_NEWIPC | SIGCHLD, NULL);

First, we first create an IPC queue (as shown below, the global queue ID is 0)

[Email protected]:~$ ipcmk-qmessage queue Id:0[email protected]:~$ ipcs-q------Message Queues--------Key        Msqid
   owner      perms      used-bytes   messages0xd0d56eb2 0          hchen      644        0            0

If we run a program without CLONE_NEWIPC, we will see that this full-boot IPC Queue is still visible in the child process.

[Email protected]:~$ sudo./utsparent-start a container! Container-inside the container! [Email protected]:~# ipcs-q------Message Queues--------Key        msqid      owner      perms      used-bytes   MESSAGES0XD0D56EB2 0          hchen      644        0            0

However, if we run the CLONE_NEWIPC program, we will have the following result:

[Email protected]:~$ Sudo./ipcparent-start a container! Container-inside the container! [Email protected]:~/linux_namespace# ipcs-q------Message Queues--------Key        msqid      owner      perms      Used-bytes   Messages

We can see that the IPC has been quarantined.

PID Namespace

We continue to modify the above program:

int Container_main (void* arg) {/* To view the PID of the    subprocess, we can see that its output subprocess has a PID of 1 *    /printf ("Container [%5d]-Inside the Containe r!/n ", Getpid ());    SetHostName ("Container", ten);    EXECV (Container_args[0], Container_args);    printf ("Something ' s wrong!/n");    return 1;} int main () {    printf ("Parent [%5d]-Start a container!/n", getpid ());    /* Enable PID namespace-clone_newpid*/    int container_pid = CLONE (Container_main, Container_stack+stack_size,            clone_newuts | Clone_newpid | SIGCHLD, NULL);    Waitpid (container_pid, NULL, 0);    printf ("Parent-container stopped!/n");    return 0;}

The result of the operation is as follows (we can see that the PID of the subprocess is 1):

[Email protected]:~$ sudo./pidparent [3474]-Start a container! Container [    1]-Inside the container![ Email protected]:~# Echo $$

PID 1, in the traditional UNIX system, PID 1 process is init, the status is very special. As the parent process of all processes, he has many privileges (such as masking signals, etc.), and it also checks the state of all processes, and we know that if a child process is out of the parent process (the parent process does not wait for it), then Init is responsible for reclaiming the resource and ending the child process. Therefore, to achieve the separation process space, the first to create a PID 1 process, preferably like chroot, the processing of the PID in the container into 1.
However, we will find that we can still see all the processes by entering commands such as ps,top in the shell of the child process. The description is not completely isolated. This is because, like PS, top these commands read the/proc file system, so because the/proc file system is the same in both parent and child processes, these commands display the same things.
Therefore, we also need to isolate the file system.

Mount Namespace

In the following routines, we have mount namespace enabled and the/proc file system is re-mount in the child process.

int Container_main (void* arg) {    printf ("Container [%5d]-Inside the container!/n", Getpid ());    SetHostName ("Container", ten);    /* Re-mount the proc file system to/proc *    /System ("MOUNT-T proc Proc/proc");    EXECV (Container_args[0], Container_args);    printf ("Something ' s wrong!/n");    return 1;} int main () {    printf ("Parent [%5d]-Start a container!/n", getpid ());    /* Enable Mount Namespace-Add clone_newns parameter *    /int container_pid = Clone (Container_main, Container_stack+stack_size,            Clone_newuts | Clone_newpid | clone_newns | SIGCHLD, NULL);    Waitpid (container_pid, NULL, 0);    printf ("Parent-container stopped!/n");    return 0;}

The results of the operation are as follows:

[Email protected]:~$ sudo./pid.mntparent [3502]-Start a container! Container [    1]-Inside the container![ Email protected]:~# ps-elff s UID        PID  PPID  C PRI  NI ADDR SZ wchan  stime TTY time          CMD4 s root         1     0  0   0-  6917 wait   19:55 pts/2    00:00:00/bin/bash0 R root     1  0   0-  5671-      19:56 pts/2    00:00:00 ps-elf

Above, we can see only two processes, and the pid=1 process is our/bin/bash. We can also see a lot of clean in the/proc directory:

[Email protected]:~# ls/proc1          DMA          key-users   net            sysvipc16         driver       kmsg        pagetypeinfo   timer_listacpi       execdomains  kpagecount  partitions     timer_statsasound     FB           Kpageflags  sched_debug    ttybuddyinfo  filesystems  loadavg     schedstat      Uptimebus        FS           locks       SCSI           versioncgroups    interrupts   mdstat      self           version_ Signaturecmdline    iomem        meminfo     slabinfo       vmallocinfoconsoles   ioports      Misc        Softirqs       vmstatcpuinfo    IRQ          modules     stat           zoneinfocrypto     kallsyms     Mounts      swapsdevices    kcore        MPT         sysdiskstats  keys         mtrr        Sysrq-trigger

, we can see that the top command in the sub-process only sees two processes.

Here, say more. After the Mount namespace is created through clone_newns, the parent process copies its own file structure to the child process. All of the Mount operations in the new namespace in the child process affect only their own file system, without any impact to the outside world. This allows for more rigorous isolation.

You might ask, is there any other file system that we need to mount? Yes.

Docker's Mount Namespace

Below I will demonstrate a "cottage image" that mimics the Docker Mount Namespace.

First of all, we need a rootfs, that is, we need to make a copy of those commands in the image we want to do in a rootfs directory, we imitate Linux to build the following directory:

[Email protected]:~/rootfs$ lsbin  Dev  etc  Home  Lib  lib64  mnt  opt  proc  root  run  sbin  sys  tmp  usr  var

Then we copy some of the commands we need into the Rootfs/bin directory (SH command must be copied in, otherwise we can't chroot)

[Email protected]:~/rootfs$ ls/bin./usr/bin./bin:bash   chown  gzip less  mount       netstat  rm     tabs  Tee      top       ttycat    CP     hostname  LN    mountpoint  ping     sed    tac   Test     Touch     umountchgrp  echo   IP        ls    mv          PS       sh     tail  Timeout  tr        unamechmod  grep   kill      more  NC          pwd      sleep  tar   toe      truncate  Which./usr/bin:awk  env  groups  head  ID  mesg  sort  strace  tail  top  uniq  vi  WC  xargs

Note: You can use the LDD command to copy the so files associated with these commands to the corresponding directory:

[Email protected]:~/rootfs/bin$ ldd bash    linux-vdso.so.1  = (0x00007fffd33fc000)    libtinfo.so.5 = >/lib/x86_64-linux-gnu/libtinfo.so.5 (0x00007f4bd42c2000)    libdl.so.2 =/lib/x86_64-linux-gnu/ Libdl.so.2 (0x00007f4bd40be000)    libc.so.6 =/lib/x86_64-linux-gnu/libc.so.6 (0x00007f4bd3cf8000)    / Lib64/ld-linux-x86-64.so.2 (0x00007f4bd4504000)

Here are some of the so files in my rootfs:

[email protected]:~/rootfs$ ls./lib64./lib/x86_64-linux-gnu/./lib64:ld-linux-x86-64.so.2./lib/x86_64- Linux-gnu/:libacl.so.1 libmemusage.so libnss_files-2.19.so libpython3.4m.so.1libacl.so.1.1.0 libmount.so. 1 libnss_files.so.2 libpython3.4m.so.1.0libattr.so.1 libmount.so.1.1.0 libnss_hesiod-2.19.so Lib Resolv-2.19.solibblkid.so.1 libm.so.6 libnss_hesiod.so.2 libresolv.so.2libc-2.19.so libncurses.so .5 libnss_nis-2.19.so libselinux.so.1libcap.a libncurses.so.5.9 libnss_nisplus-2.19.so libtinfo.     so.5libcap.so libncursesw.so.5 libnss_nisplus.so.2 libtinfo.so.5.9libcap.so.2 libncursesw.so.5.9 Libnss_nis.so.2 libutil-2.19.solibcap.so.2.24 libnsl-2.19.so libpcre.so.3 Libutil.so.1libc.s O.6 libnsl.so.1 libprocps.so.3 libuuid.so.1libdl-2.19.so libnss_compat-2.19.so libpthread-2       .19.so libz.so.1libdl.so.2Libnss_compat.so.2 libpthread.so.0libgpm.so.2 libnss_dns-2.19.so libpython2.7.so.1libm-2.19.so Libnss_dns . so.2 libpython2.7.so.1.0

Include some of the configuration files that these commands depend on:

[email protected]:~/rootfs$ ls/ETCBASH.BASHRC  Group  hostname  hosts  Ld.so.cache  nsswitch.conf  passwd  profileresolv.conf  Shadow

You're going to say, I am, I hope it was set when the container was started, not hard code in the mirror. For example:/etc/hosts,/etc/hostname, and the DNS/etc/resolv.conf file. Good. So we're outside the ROOTFS, we'll create a conf directory and put those files in this directory.

[email protected]:~$ ls./confhostname     hosts     resolv.conf

In this way, our parent process can dynamically set the configuration of these files required by the container and then mount them into the container, so that the configuration in the container's image is more flexible.

Well, finally to our program.

#define _gnu_source#include #include #include #include #include #include #include stack_size (1024x768) static C Har container_stack[stack_size];char* const container_args[] = {"/bin/bash", "-l", Null};int container_main (void    * arg) {printf ("Container [%5d]-Inside the container!/n", Getpid ());    Set hostname sethostname ("container", 10);  Remount "/proc" to make sure the "top" and "PS" Show container ' s information if (Mount ("proc", "Rootfs/proc", "Proc",    0, NULL)!=0) {perror ("proc");    } if (Mount ("Sysfs", "Rootfs/sys", "Sysfs", 0, NULL)!=0) {perror ("sys");    } if (Mount ("None", "rootfs/tmp", "Tmpfs", 0, NULL)!=0) {perror ("tmp");    } if (Mount ("Udev", "Rootfs/dev", "Devtmpfs", 0, NULL)!=0) {perror ("dev");    } if (Mount ("Devpts", "rootfs/dev/pts", "Devpts", 0, NULL)!=0) {perror ("dev/pts");    } if (Mount ("Shm", "Rootfs/dev/shm", "Tmpfs", 0, NULL)!=0) {perror ("Dev/shm"); } if (MoUNT ("Tmpfs", "Rootfs/run", "Tmpfs", 0, NULL)!=0) {perror ("Run");     }/* * Emulate Docker's mount-related profile from an outgoing container * You can view:/var/lib/docker/containers//directory, * you will see these files for Docker.          */if (Mount ("Conf/hosts", "rootfs/etc/hosts", "none", Ms_bind, NULL)!=0 | |          Mount ("Conf/hostname", "Rootfs/etc/hostname", "none", Ms_bind, NULL)!=0 | |    Mount ("conf/resolv.conf", "rootfs/etc/resolv.conf", "none", Ms_bind, NULL)!=0) {perror ("conf");        }/* Emulate the-V,--volume=[] parameter in the Docker Run command */if (Mount ("/tmp/t1", "rootfs/mnt", "none", Ms_bind, NULL)!=0) {    Perror ("mnt");    }/* Chroot Quarantine Directory */if (ChDir ("./rootfs")! = 0 | | chroot ("./")! = 0) {perror ("chdir/chroot");    } execv (Container_args[0], Container_args);    Perror ("exec");    printf ("Something ' s wrong!/n"); return 1;}    int main () {printf ("Parent [%5d]-Start a container!/n", getpid ()); int container_pid = Clone (Container_main, Container_stack+stack_size, Clone_newuts | CLONE_NEWIPC | Clone_newpid | clone_newns |    SIGCHLD, NULL);    Waitpid (container_pid, NULL, 0);    printf ("Parent-container stopped!/n"); return 0;}

Sudo runs the above program, you will see the following mount information and a so-called "mirror":

[email protected]:~$ sudo./mountparent [4517]-Start a container! Container [1]-Inside the container! [email protected]:/# mountproc on/proc type proc (rw,relatime) Sysfs on/sys type Sysfs (rw,relatime) None on/tmp Typ e Tmpfs (rw,relatime) udev on/dev type DEVTMPFS (rw,relatime,size=493976k,nr_inodes=123494,mode=755) devpts on/dev/pts Type devpts (rw,relatime,mode=600,ptmxmode=000) Tmpfs on/run type TMPFS (rw,relatime)/dev/disk/by-uuid/ 18086e3b-d805-4515-9e91-7efb2fe5c0e2 on/etc/hosts type EXT4 (rw,relatime,errors=remount-ro,data=ordered)/dev/disk/ By-uuid/18086e3b-d805-4515-9e91-7efb2fe5c0e2 on/etc/hostname type EXT4 (rw,relatime,errors=remount-ro,data=ordered )/dev/disk/by-uuid/18086e3b-d805-4515-9e91-7efb2fe5c0e2 on/etc/resolv.conf type EXT4 (rw,relatime,errors= remount-ro,data=ordered) [email protected]:/# ls/bin/usr/bin/bin:bash chmod echo hostname less more MV Pi ng RM sleep tail Test top truncate unamecat chown grep IP Ln Mount nc ps sed tabs tar timeout touch TTY whichchgrp CP gzip kill LS mountpoint n  Etstat pwd sh tac tee toe tr Umount/usr/bin:awk env groups head ID MESG sort strace tail Top Uniq VI WC xargs

about how to make a chroot directory, here is a tool called Debootstrapchroot, you can follow the link to see (The English OH)

The next thing that you can play by yourself, I believe in your imagination. :)

Today's content is introduced here, in the Docker basic technology: Linux Namespace (next), I will introduce you to the user Namespace, Network Namespace and other things Namespace.

Originally from: http://os.51cto.com/art/201609/517640.htm

Free to provide the latest Linux technology tutorials Books, for open-source technology enthusiasts to do more and better: http://www.linuxprobe.com/

Docker basic technology: Linux Namespace (top)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.