Docker Basic technology: Linux Namespace (bottom)

Source: Internet
Author: User
Tags sprintf

Guide In the Docker basic technology: Linux Namespace (previous) we learned that UTD, IPC, PID, Mount four Namespace, we imitate Docker to make a rather quite cottage mirror. In this article, I would like to introduce to you the Linux user and network namespace

User Namespace

User namespace mainly uses the parameters of the Clone_newuser, after using this parameter, the internal UID and GID see has been different from the outside. By default, the container does not have a UID, the system is automatically set on the maximum UID65534, the default UID definition file is "/proc/sys/kernel/overflowuid".
To map the UID in the container to the UID of the real system, you need to modify the two files of/proc/$$/uid_map and/proc/$$/gid_map. The format of these two files is:

Id-inside-ns Id-outside-ns Length
Ps:
The first field Id-inside-ns represents the UID or GID displayed in the container,
The second field, Id-outside-ns, represents the true UID or GID mapped outside the container.
The third field represents the scope of the map, usually filled in 1, indicating that one by one corresponds. Example: Mapping a real uid=1000 to a uid=0 inside a container
$cat/proc/2465/uid_map 0 1000 1
Example: Mapping namespace internal UID to external settings shaping

Maps the namespace internal UID from 0 to the external 0-based UID with a maximum range of unsigned 32-bit shaping

$cat/proc/$$/uid_map 0 0 4294967295

It is important to note that:

The process of writing these two files requires that the Cap_setuid (cap_setgid) permission in this namespace (see capabilities) must be the user namespace process of the parent or child of this user namespace.
Also, one of the following conditions is required:

1) The parent process maps effective uid/gid to the child process's user namespace

2) If the parent process has Cap_setuid/cap_setgid permissions, it will be able to map to any uid/gid in the parent process.

attached: Some other rules
#define _gnu_source#include #include #include #include #include #include #include #include, #include #include STACK _size (1024x768 * 1024x768) staticcharcontainer_stack[stack_size];char*constcontainer_args[] = {"/bin/bash", NULL};INTPIPEFD    [2];voidset_map (char* file,intinside_id,intoutside_id,intlen) {file* mapfd =fopen (file, "w");        if (NULL = = MAPFD) {perror ("Open file Error");    Return    } fprintf (MAPFD, "%d%d%d", inside_id, outside_id, Len); Fclose (MAPFD);}    Voidset_uid_map (pid_t Pid,intinside_id,intoutside_id,intlen) {charfile[256];    sprintf (file, "/proc/%d/uid_map", PID); Set_map (file, inside_id, outside_id, Len);}    Voidset_gid_map (pid_t Pid,intinside_id,intoutside_id,intlen) {charfile[256];    sprintf (file, "/proc/%d/gid_map", PID); Set_map (file, inside_id, outside_id, Len);}    Intcontainer_main (void* Arg) {printf ("Container [%5d]–inside the container!/n", Getpid ());  printf ("Container:euid =%ld;     Egid =%ld, Uid=%ld, gid=%ld/n ",       (long) Geteuid (), (long) Getegid (), (long) Getuid (), (long) getgid ());    /* Wait for the parent process to notify and then go down (synchronization between processes) */charch;    Close (pipefd[1]);    Read (Pipefd[0], &ch, 1);    printf ("Container [%5d]–setup hostname!/n", Getpid ());    Set hostname sethostname ("container", 10);    Remount "/proc" to make sure the "top" and "PS" Show container ' s Information mount ("proc", "/proc", "Proc", 0, NULL);    EXECV (Container_args[0], Container_args);    printf ("Something ' s wrong!/n"); RETURN1;}    Intmain () {Constintgid=getgid (), Uid=getuid ();  printf ("Parent:euid =%ld;    Egid =%ld, Uid=%ld, gid=%ld/n ", (long) Geteuid (), (long) Getegid (), (long) Getuid (), (long) getgid ());    Pipe (PIPEFD);    printf ("Parent [%5d]–start a container!/n", getpid ()); Intcontainer_pid = Clone (Container_main, container_stack+stack_size, clone_newuts | Clone_newpid | clone_newns | Clone_newuser |    SIGCHLD, NULL); printf ("Parent [%5d]–container [%5d]!/n", Getpid (), container_pID);    To map the Uid/gid,//We need edit the/proc/pid/uid_map (or/proc/pid/gid_map) in parent//the file format is Id-inside-ns id-outside-ns length//if no mapping,//The UID would be taken From/proc/sys/kernel/over    Flowuid//The GID would be taken From/proc/sys/kernel/overflowgid Set_uid_map (container_pid, 0, UID, 1);    Set_gid_map (container_pid, 0, GID, 1);    printf ("Parent [%5d]–user/group mapping done!/n", Getpid ());    /* Notify child process */close (pipefd[1]);    Waitpid (container_pid, NULL, 0);    printf ("Parent–container stopped!/n"); Return0;}

The above program, we used a pipe to synchronize the parent-child process, why do you want to do this? Because there is a EXECV system call in the child process, this system call will overwrite the process space of the current subprocess, we want to do the user before EXECV Namespace's uid/gid mapping, so that the/bin/bash of Execv run will be the prompt for the # number because we set the inside-uid of the UID to 0.

the whole program works as follows:
[Email protected]:~ $iduid =1000 (hchen) gid=1000 (hchen) groups=1000 (hchen) [email protected]:~$./user#<– To Hchen user run Parent:euid = +;  Egid = uid=1000, gid=1000parent [3262]–start a container! Parent [3262]–container [3263]! Parent [3262]–user/groupmappingdone! Container [    1]–inside the container! Container:euid = 0;  Egid = 0, uid=0, uid/gid in Gid=0#<-container are 0 container [    1]–setuphostname![ Email protected]:~# ID #<--We can see the user in the container and the command line prompt is the root user uid=0 (root) gid=0 (root) groups=0 (root), 65534 (Nogroup)

We note that user Namespace is run as a normal user, but other Namespace require root privileges, so what if I want to use multiple Namespace at the same time? In general, we first create the user Namespace with a generic user, It then maps this generic user to root, creating additional namesapce in the container using root, which can improve the security of the container.

Network Namespace

Under Linux, we generally use the IP command to create the network Namespace, but the source of the Docker, it does not use the IP command with the raw socket to send some "strange" data, I use the IP command analysis.

Docker Network Analysis

First, let's take a look at the diagram, which is basically the network of Docker on the host.


The private network segments that Docker may use to run are: 172.40.1.0 and 10.0.0.0, 192.168.0.0, three private network segments, and if your environment already uses these three private segments, Docker starts up with an error. When you launch a Docker container, you can use IP link show or IP addr show to see the current host's network (we can see a docker0, and a veth22a38e6 virtual network card for the container):

[Email protected]:~$ IP link show1:lo:  MTU 65536 qdisc noqueue    State ... link/loopback00:00:00:00:00:00 BRD 00:00:00:00:00:002:eth0:  MTU Qdisc ...    link/ether00:0c:29:b7:67:7d BRD FF:FF:FF:FF:FF:FF3:DOCKER0:  MTU    ... link/ether56:84:7a:fe:97:99 BRD ff:ff:ff:ff:ff:ff5:veth22a38e6:  MTU Qdisc ...    LINK/ETHER8E:30:2A:AC:8C:D1 BRD FF:FF:FF:FF:FF:FF

So what do we do with this? Let's take a look at a set of commands:

 # # First, let's add a bridge lxcbr0, mimic docker0brctl addbr lxcbr0brctl stp lxcbr0 offifconfiglxcbr0 192.168.10.1/24up# Set the IP address for the bridge # # Next, we want to create a network namespace–ns1# add a NAMESAPCE command for NS1 (using the IP netns add command) IP netns add ns1# to activate namespace, that is, 127 .0.0.1 (use IP netns exec ns1 to manipulate commands in ns1) IP netnsexecns1 IP linksetdev lo up## then we need to add a pair of virtual network cards # Add a pair virtual network card, note the Veth type, one of which A network card to press into the container IP link add veth-ns1typeveth Peer name lxcbr0.1# veth-ns1 to namespace ns1, so the container will have a new NIC IP linksetveth-ns1 NE TNS ns1# renamed the VETH-NS1 in the container to eth0 (the container will not be in conflict, the container will not be) IP netnsexecns1 IP linksetdev veth-ns1 name eth0# assign an IP address to the network card in the container and activate it IP Netnsexecns1ifconfigeth0 192.168.10.11/24up# above we put veth-ns1 this NIC into the container, and then we want to add lxcbr0.1 on the Internet bridge brctl addif lxcbr0 lxcbr0.1# adds a routing rule to the container that allows the container to access the outside network IP netnsexecns1 IP route add default via 192.168.10.1# Create network NAMESPC under/etc/netns E name is ns1 directory, # Then set the resolv.conf for this namespace, so that the container can access the domain name Mkdir-p/etc/netns/ns1echo "nameserver 8.8.8.8″>/etc/ Netns/ns1/resolv.conf 

The above is basically the principle of the Docker network, but the resolv.conf of Docker is not in this way, but in the way of Mount Namesapce. In addition, Docker uses the PID of the process to do the name of the network namespace.

add a new NIC to the Docker container:
IP link Add peeratypeveth peer name Peerbbrctl addif docker0 peeraip linksetpeera upip linksetpeerb netns ${container-pid} IP netnsexec${container-pid} IP linksetdev peerb name eth1ip netnsexec${container-pid} IP linkseteth1 up; IP netnsexec${co NTAINER-PID} IP addr Add ${routeable_ip} dev eth1;

The example above is that we add a eth1 NIC to the running Docker container, and give a static IP address that can be accessed externally.

This need to the external "physical network card" configuration into promiscuous mode, so that the ETH1 network card will be sent out through the ARP protocol to send its own MAC address, and then the external switch will be to the IP address of the packet to the "Physical network card", because it is promiscuous mode, so eth1 can receive the relevant data, a look , is their own, then received. In this way, the network of the Docker container is outside.

Of course, whether it is Docker NAT, or promiscuous mode will have a performance problem, Nat Needless to say, there is a forwarding overhead, promiscuous mode, the network card received the load will be completely to all the virtual network card, so even if there is no data on a network card, But it will also be affected by the data on the other NIC.

Neither of these approaches is perfect, and we know that the real solution to this network problem requires the use of VLAN technology, and Google has a Ipvlan driver for the Linux kernel, which is basically tailored for Docker.

namespace File

First we run the PID.MNT program in the previous section (that is, the Mount proc program in the PID namespace), and then do not exit.

$ sudo./pid.mnt[sudo] passwordforhchen:parent [4599]–start a container! Container [    1]–inside the container!

Then we look at the PID of the parent-child process in another shell:

[Email protected]:~$ pstree-p 4599pid.mnt (4599) ───bash (4600)

We can go to proc (/proc/$$/ns) to see the ID of each namespace of the process (kernel version requires more than 3.8).

The following shows the parent process:

[Email protected]:~ $sudols-l/proc/4599/nstotal 0lrwxrwxrwx 1 root root 0  April  7 22:01 IPC--ipc:[4026531839] lrwxrwxrwx 1 root root 0  April  7 22:01 mnt--mnt:[4026531840]lrwxrwxrwx 1 root root 0  April  7 22:01 net-> ; net:[4026531956]lrwxrwxrwx 1 root root 0  April  7 22:01 pid--pid:[4026531836]lrwxrwxrwx 1 root root 0  April 
   7 22:01 User-user:[4026531837]lrwxrwxrwx 1 root root 0  April  7 22:01 UTS--uts:[4026531838]

The following is a presentation of the child process:

[Email protected]:~ $sudols-l/proc/4600/nstotal 0lrwxrwxrwx 1 root root 0  April  7 22:01 IPC--ipc:[4026531839] lrwxrwxrwx 1 root root 0  April  7 22:01 mnt--mnt:[4026532520]lrwxrwxrwx 1 root root 0  April  7 22:01 net-> ; net:[4026531956]lrwxrwxrwx 1 root root 0  April  7 22:01 pid--pid:[4026532522]lrwxrwxrwx 1 root root 0  April  7 22:01 User-user:[4026531837]lrwxrwxrwx 1 root root 0  April  7 22:01 UTS-uts:[4026532521]

We can see that the ipc,net,user is the same ID, and mnt,pid,uts are not the same. If two processes point to the same namespace number, it means they are under the same namespace, otherwise they are in different namespace. Once these files have been opened, as long as their FD is occupied, the created namespace will persist even if all the processes that the PID belongs to have ended. For example: We can pass: mount–bind/proc/4600/ns/uts ~/uts to hold this namespace.

In addition, we talked about a setns system call in the previous article, and its function is declared as follows:

Intsetns (Intfd,intnstype);

The first parameter is an FD, which is an open () system that calls the FD returned after opening the above file, for example:

FD = open ("/proc/4600/ns/nts", o_rdonly); Get namespace file descriptor Setns (FD, 0);//Add New namespace

Originally from: http://www.linuxprobe.com/docker-linux-namespace-2.html

Free to provide the latest Linux technology tutorials Books, for open-source technology enthusiasts to do more and better: http://www.linuxprobe.com/

Docker Basic technology: Linux Namespace (bottom)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.