Dockone WeChat Share (85): Docker storage mode selection Suggestions

Source: Internet
Author: User
Tags dio posix docker ps docker run
This is a creation in Article, where the information may have evolved or changed.
"Editor's word" Docker storage provides a specific implementation of the read-write layer that manages layered mirrors and containers. Initially Docker could only run on Ubuntu releases that supported the Aufs file system, but because Aufs failed to join the Linux kernel, Docker was internally through the graphdriver mechanism for compatibility, extensibility, and scalability To implement support for different file systems. This share through a customer implementation case in-depth look at several storage methods Docker, and give some technical selection recommendations.

Docker storage mode:
    1. Introduction and analysis of AUFS
    2. Introduction and analysis of Device Mapper
    3. Introduction and analysis of OVERLAYFS
    4. Introduction and analysis of Btrfs
    5. Introduction and analysis of ZFS


The first part of the problem diagnosis

From the start of an implementation project, we need to help customers to containerized their applications and publish them on a few people cloud platforms. The customer's application is the traditional was application. The application was manually deployed through the was console interface, temporarily unable to automate application deployment through Dockerfile, and the final image was done through Docker commit. The Mirror Boot execution command is startwas.sh and the application log is output to standard output through tail. Start container, was server failed to start, error log is as follows:

The was server standard log files StartServer.log and Native_stderr.log do not have more detailed error messages. Finally, you can find an error message located in the configuration directory =_=!:

File Access IO exception, view the properties of the corresponding directory file:

Up to now, it is possible to initially determine the problem with Docker storage (storage drive) in the layered management of mirrored containers.

The current host is CentOS 7.2, kernel 3.10.0. and view current host information is Docker 1.12.0, storage method is overlay, host filesystem is XFS:

To verify our inference, we have tried the following:

Try 1: Mount the entire Washome directory using the data volume Mount method. (A data volume is a directory or file of a docker host, loaded into a container by mount, and is not controlled by storage-driven.) To re-create the mirrored boot container, the was server starts normally.

Try 2: Change the way the Docker engine is stored, change it to device mapper, pull the mirror back, and start the container, and the was server will start normally.

So is this question a common problem?

Try 3: On the other host, start the original image, the problem is not reproducible.

After several tests found that some machines had problems with the same kernel, system version, and Docker version, some machines did not have problems, and eventually discovered that CentOS and overlay compatibility issues were the result of the new file system provided by the CentOS. At the same time, we found the issue report from the Docker community about the issue: https://github.com/docker/docker/issues/9572 The fix for this issue is in kernel 4.4.6. To sum up, we have come to the conclusion that the root cause of this problem is that OVERLAYFS has a compatibility problem with XFS.

The cause of the matter ends here, let's take a closer look at some of the storage methods of Docker, and give some suggestions for technology selection.

Part II Overview

When Docker launches the container, it needs to create a filesystem to provide a mount point for the rootfs. The lowest-level boot filesystem Bootfs mainly contains bootloader and kernel,bootloader are primarily boot-loaded kernel, and kernel is bootfs when Umount is loaded into memory. ROOTFS contains standard directories and files such as/dev,/proc,/bin,/etc in a typical Linux system.

The core part of the Docker model is the efficient use of the layered mirroring mechanism, which can be inherited by layering, and based on the underlying image (without the parent image), you can make a variety of specific application images. Docker 1.10 introduces a new addressable storage model that uses a secure content hash instead of a random UUID management image. At the same time, Docker provides migration tools to migrate existing images to the new model. Different Docker containers can share some of the underlying file system layers, plus their own unique read-write layer, which greatly improves storage efficiency. The main mechanism is to layer the model and mount the different directories to the same virtual file system.

Docker storage provides a specific implementation of a read-write layer that manages layered mirrors and containers. Initially Docker could only run on Ubuntu releases that support the Aufs file system, but because Aufs failed to join the Linux kernel, in search of compatibility, extensibility, Docker implements support for different file systems internally through this extensible way of graphdriver mechanisms.

There are several different drivers for Docker:
    • AUFS
    • Device Mapper
    • Btrfs
    • Overlayfs
    • Zfs


The third part of the scheme analysis

AUFS

AUFS (ANOTHERUNIONFS) is a kind of union FS, which is a file-level storage driver. The so-called UnionFS is the merging of directories of different physical locations into the same directory. Simply to support file systems that mount different directories to the same virtual file system. This file system can overlay and modify the file layer by level. No matter how many layers are read-only at the bottom, only the topmost file system is writable. When a file needs to be modified, Aufs creates a copy of the file, and uses cow to copy the file from the read-only layer to the writable layer for modification, and the result is saved to the writable layer. In Docker, the read-only layer underneath is the image, and the writable layer is container. The structure looks like this:


Example

Run an instance application is to delete a file/etc/shadow, see the results of Aufs:
# Docker Run CentOS Rm/etc/shadow
# ls-la/var/lib/docker/aufs/diff/$ (Docker PS--NO-TRUNC-LQ)/etc

Total 8
Drwxr-xr-x 2 root root 4096 Sep 2 18:35.
Drwxr-xr-x 5 root root 4096 Sep 2 18:35..
-r--r--r--2 root root 0 Sep 2 18:35. Wh.shadow

directory Structure

    • Container mount point (only loaded when the container is running)
      /var/lib/docker/aufs/mnt/$CONTAINER _id/

    • Branching (and mirroring different files, read-only alive reading and writing)
      /var/lib/docker/aufs/diff/$CONTAINER _or_image_id/

    • Mirror Index Table (each mirror references the mirror name)
      /var/lib/docker/aufs/layers/


other

AUFS the size of the disk space that the file system can use
# df-h/var/lib/docker/

Filesystem Size used Avail use% mounted on
/DEV/VDA1 20G 4.0G 15G 22%/


System Mount Mode

Started Docker
Docker PS

CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
3f2e9de1d9d5 mesos/bamboo:v0.1c "/usr/bin/bamboo-hap 5 days ago up 5 days Mesos-20150825-162813-1248613158-5050-1-s0.88c909bc-6301-423a-8283-5456435f12d3
dc9a7b000300 mesos/nginx:base "/bin/sh-c nginx" 7 days ago up 7 days 0.0.0.0:31 967->80/tcp mesos-20150825-162813-1248613158-5050-1-s0.42667cb2-1134-4b1a-b11d-3c565d4de418
1b466b5ad049 mesos/marathon:omega.v0.1 "/usr/bin/dataman_ma 7 days ago up hours Dataman-marathon
0a01eb99c9e7 mesos/nginx:base "/bin/sh-c nginx" 7 days ago up 7 days 0.0.0.0:31 587->80/tcp MESOS-20150825-162813-1248613158-5050-1-S0.4F525828-1217-4B3D-A169-BC0EB901EEF1
c2fb2e8bd482 mesos/dns:v0.1c "/usr/bin/dataman_me 7 days ago up 7 days mesos-20150825-162813-1248613158-5050-1-s0.82d500eb-c3f0-4a00-9f7b-767260d1ee9a
df102527214d mesos/zookeeper:omega.v0.1 "/data/run/dataman_z 8 days ago up 8 days Dataman-zookeeper
B076A43693C1 mesos/slave:omega.v0.1 "/usr/sbin/mesos-sla 8 days ago up 8 days Dataman-slave
e32e9fc9a788 mesos/master:omega.v0.1 "/usr/sbin/mesos-mas 8 days ago up 8 days Dataman-master
c8454c90664e shadowsocks_server "/usr/local/bin/ssse 9 days ago up 9 days 0.0.0.0:57 980->57980/tcp Shadowsocks
6dcd5bd46348 registry:v0.1 "docker-registry" 9 days ago up 9 days 0.0.0.0:50 00->5000/tcp Dataman-registry

Control system mount points
grep aufs/proc/mounts

/dev/mapper/ubuntu--vg-root/var/lib/docker/aufs ext4 rw,relatime,errors=remount-ro,data=ordered 0 0
NONE/VAR/LIB/DOCKER/AUFS/MNT/6DCD5BD463482EDF33DC1B0324CF2BA4511C038350E745B195065522EDBFFB48 Aufs Rw,relatime, Si=d9c018051ec07f56,dio,dirperm1 0 0
none/var/lib/docker/aufs/mnt/c8454c90664e9a2a2abbccbe31a588a1f4a5835b5741a8913df68a9e27783170 Aufs Rw,relatime, Si=d9c018051ba00f56,dio,dirperm1 0 0
NONE/VAR/LIB/DOCKER/AUFS/MNT/E32E9FC9A788E73FC7EFC0111D7E02E538830234377D09B54FFC67363B408FCA Aufs Rw,relatime, Si=d9c018051b336f56,dio,dirperm1 0 0
None/var/lib/docker/aufs/mnt/b076a43693c1d5899cda7ef8244f3d7bc1d102179bc6f5cd295f2d70307e2c24 Aufs Rw,relatime, Si=d9c018051bfecf56,dio,dirperm1 0 0
none/var/lib/docker/aufs/mnt/df102527214d5886505889b74c07fda5d10b10a4b46c6dab3669dcbf095b4154 Aufs Rw,relatime, Si=d9c01807933e1f56,dio,dirperm1 0 0
none/var/lib/docker/aufs/mnt/c2fb2e8bd4822234633d6fd813bf9b24f9658d8d97319b1180cb119ca5ba654c Aufs Rw,relatime, Si=d9c01806c735ff56,dio,dirperm1 0 0
None/var/lib/docker/aufs/mnt/0a01eb99c9e702ebf82f30ad351d5a5a283326388cd41978cab3f5c5b7528d94 Aufs Rw,relatime, Si=d9c018051bfebf56,dio,dirperm1 0 0
none/var/lib/docker/aufs/mnt/1b466b5ad049d6a1747d837482264e66a87871658c1738dfd8cac80b7ddcf146 Aufs Rw,relatime, Si=d9c018052b2b1f56,dio,dirperm1 0 0
none/var/lib/docker/aufs/mnt/dc9a7b000300a36c170e4e6ce77b5aac1069b2c38f424142045a5ae418164241 Aufs Rw,relatime, Si=d9c01806d9ddff56,dio,dirperm1 0 0
None/var/lib/docker/aufs/mnt/3f2e9de1d9d51919e1b6505fd7d3f11452c5f00f17816b61e6f6e97c6648b1ab Aufs Rw,relatime, Si=d9c01806c708ff56,dio,dirperm1 0 0

Analysis
    1. Although Aufs is the storage method supported by the first version of Docker, it has not yet been added to the kernel mainline (CentOS cannot be used directly).
    2. From the principle analysis, the AUFS mount () method is very fast, so the container is created quickly, the read-write access has native efficiency, the performance of sequential read-write and Random read and write is greater than that of KVM, and Docker's AUFS can effectively use storage and memory.
    3. Aufs performance is stable and has a large number of production deployments and extensive community support.
    4. Rename system calls are not supported and can result in failure when performing "copy" and "unlink".
    5. When writing large files (such as logs or databases, etc.) dynamically mount the problem of multiple directory paths, resulting in more branch, the performance of the lookup file is slower. (Workaround: Important data is mounted directly using the-v parameter.) )


Device Mapper

Device Mapper is supported after the Linux kernel 2.6.9, provides a mapping framework mechanism from logical device to physical device, under which the user can make the management strategy of implementing storage resources conveniently according to their own needs. Docker's device mapper uses thin provisioning snapshot to manage images and containers.

thin-provisioning Snapshot

Snapshot is a feature provided by LVM that can create a virtual snapshot (Snapshot) for the origin (original device) without disrupting service operation. Thin-provisioning is a technology that leverages the virtualization approach to reduce physical storage deployment. Thin-provisioning snapshot is a combination of thin-provisioning and snapshoting two technologies that allow multiple virtual devices to be mounted simultaneously to a data volume for data sharing purposes. The characteristics of thin-provisioning snapshot are as follows:
    1. Different snaptshot can be mounted on the same origin, saving disk space.
    2. The cow action is triggered when multiple snapshot are mounted on the same origin and a write operation occurs on the origin. This will not reduce efficiency.
    3. The thin-provisioning snapshot supports recursive operations, which means that one snapshot can act as the origin of another snapshot with no depth restrictions.
    4. On snapshot, you can create a logical volume that does not occupy disk space until the actual write operation (Cow,snapshot write) occurs.


While Aufs and OVERLAYFS are file-level storage, Device Mapper is block-level storage, and all operations are performed directly on blocks, not files. The device mapper driver creates a resource pool on the block device, and then creates a basic device with a file system on the resource pool, all of which are snapshots of the base device, and the container is a snapshot of the mirror. So in the container, you see that the file system is a snapshot of the file system of the underlying device on the resource pool and does not allocate space for the container. When a new file is to be written, it is allocated a new block within the container's mirror and writes the data, which is called time allocation. When you want to modify an existing file, then use Cow to allocate block space for the container snapshot, and copy the data you want to modify into the new block in the container snapshot and modify it. Device Mapper Drivers Create a 100G file that contains mirrors and containers by default. Each container is limited to a volume of 10G size and can be configured to adjust itself. The structure looks like this:

You can get more information from "Docker info" or via Dmsetup ls. To view information about the Docker device mapper:


Analysis
    1. Device Mapper File System compatibility is better, and is stored as a file, reducing inode consumption.
    2. Each time a container write data is a new block, the block must be allocated from the pool, the actual writing is a sloppy file, although its utilization is very high, but the performance is not good, because the additional VFS overhead.
    3. Each container has its own block device, which is the real disk storage, so when n containers are started, it will load n times from disk into memory, consuming large memory.
    4. Docker's device mapper default mode is LOOP-LVM and performance is not up to production requirements. In the production environment recommended DIRECT-LVM mode directly write the original block device, good performance.


Overlayfs

Overlay is supported after Linux kernel 3.18, but also a union FS, and aufs different layers are overlay only two layers: a upper file system and a lower file system, representing Docker's mirror layer and container layer, respectively. When you need to modify a file, use Cow to copy the file from the read-only lower to the writable upper for modification, and the result is saved in the upper layer. In Docker, the read-only layer underneath is the image, and the writable layer is container. The structure looks like this:


Analysis
    1. Enter the mainstream Linux kernel from kernel 3.18. Simple design, fast speed, faster than AUFS and device mapper. In some cases, it is faster than btrfs. Is the future of Docker storage options. Because OVERLAYFS only two layers, not multilayer, so Overlayfs "copy-up" Operation faster than Aufs. This can reduce the delay in operation.
    2. OVERLAYFS supports page cache sharing, which allows multiple containers to access the same file to share a single page cache to improve memory usage.
    3. OVERLAYFS consumes the inode, and as the mirrors and containers increase, the inode encounters bottlenecks. Overlay2 can solve this problem. Under overlay, in order to solve the inode problem, consider hanging the/var/lib/docker on a separate file system or adding system Inode settings.
    4. There is a compatibility issue. Open (2) only completes part of the POSIX standard, and some operations of OVERLAYFS are not POSIX compliant. For example: Call Fd1=open ("foo", O_rdonly), and then call Fd2=open ("foo", O_RDWR) app to expect FD1 and fd2 to be the same file. Then because the copy operation occurs after the first open (2) operation, it is considered to be two different files.
    5. Rename system calls are not supported and will result in failure when "copy" and "unlink" are executed.


Btrfs

Btrfs is known as the next-generation copy file system, which is incorporated into the Linux kernel and is file-level storage, but can operate the underlying device directly, like device mapper. Btrfs manages image container layering using subvolumes and snapshots. Btrfs configures a part of the filesystem as a complete sub-file system, called Subvolume,snapshot is a real-time read-write copy of Subvolumn, chunk is the allocation unit, usually 1GB. With Subvolume, a large file system can be divided into sub-file systems that share the underlying device space, are allocated from the underlying device when disk space is needed, and similar applications call malloc () to allocate memory. To make the most of the device space, Btrfs divides disk space into multiple chunk. Each chunk can use a different disk space allocation policy. For example, some chunk only store metadata, and some chunk only store data. This model has many advantages, such as Btrfs's support for dynamically adding devices. After the user adds a new disk to the system, you can use Btrfs's command to add the device to the file system. Btrfs treats a large file system as a resource pool, configures it as a multiple complete sub-file system, and can add a new sub-file system to the former resource pool, while the base image is a snapshot of the sub-file system, each of which has its own snapshot, which is a snapshot of subvolume.


Analysis
    1. Btrfs is a next-generation file system that replaces device Mapper, many of which are still in the development phase and have not yet released a formal version, and its technical advantages over EXT4 or other more mature file systems include rich features such as support for sub-volumes, snapshots, File system built-in compression and built-in RAID support.
    2. Page cache sharing is not supported, n containers accessing the same file need to be cached n times. Not suitable for high density container scenarios.
    3. The current Btrfs version uses "small writes", which causes performance issues. And you need to use Btrfs Native command Btrfs filesys show instead of DF.
    4. Btrfs uses "journaling" to write data to disk, which affects the performance of sequential writes.
    5. Btrfs file system will be fragmented, causing performance problems. Current Btrfs version, which can be checked for random write and defragmentation by specifying Autodefrag on Mount.


Zfs

The ZFS file system is a revolutionary new file system that fundamentally changes the way the file system is managed, and ZFS completely discards the "volume management", no longer creates virtual volumes, but instead centralizes all of the devices into one storage pool for management, using the concept of "storage pools" to manage physical storage space. Historically, file systems were built on top of physical devices. In order to manage these physical devices and provide redundancy for the data, the concept of "volume management" provides a single device image. ZFS is created on top of the virtual, known as "Zpools" storage pool. Each storage pool consists of a number of virtual appliances (Vsan devices,vdevs). These virtual devices can be raw disks, or they may be a RAID1 mirrored device, or a multi-disk group that is a non-standard RAID level. The file system on the Zpool can then use the total storage capacity of these virtual devices. Docker ZFS leverages snapshots and clones, which are real-time copies of ZFS, snapshots is read-only, clones is read-write, and clones is created from snapshot.

Let's take a look at the use of ZFS in Docker. First, a ZFS file system is assigned to the underlying layer of the mirror from the Zpool, while the other mirror layer is a clone of the ZFS file system snapshot, the snapshot is read-only, and the clone is writable, and a writable layer is generated at the topmost layer of the mirror when the container starts. As shown in the following:


Analysis
    1. ZFS is similar to Btrfs for next-generation file systems. ZFS is mature in Linux (ZoL) port, but it is not recommended to use Docker's ZFS storage on a production environment unless you have experience with a ZFS file system.
    2. Beware of ZFS memory issues, because ZFS was originally designed for a Sun Solaris server with a large amount of memory.
    3. ZFS's "deduplication" feature, because it takes up a lot of memory, is recommended to turn off. However, if you use San,nas or other hard disk RAID technology, you can continue to use this feature.
    4. The ZFS caching feature is suitable for high-density scenarios.
    5. ZFS 128K block Write, intent log and delay write can reduce fragmentation.
    6. Compared to ZFS fuse implementations, it is recommended to use Linux native ZFS drivers.


Part IV Summary

Also, list the advantages and disadvantages of Docker's various storage methods:

The above is the five kinds of docker storage methods of introduction and analysis, as a theoretical basis, choose their own Docker storage mode. You can also do some validation tests, such as IO performance testing, to determine how to store your own application scenarios. At the same time, there are two points worth proposing:
    1. Improve performance with SSD (Solid state Devices) storage.
    2. Consider using data volume mounts to improve performance.


The above content is organized according to the September 29, 2016 night group sharing content. Share people Fan Bin, a few people cloud architect, has worked in the HP CMS Research and Development Center and Bi Yihui WebLogic Group for many years, in Java,j2ee,soa, enterprise applications and other aspects of research and development work has accumulated rich experience. Docker,mesos has a research, familiar with and love cloud computing, distributed and other fields of related technologies. Dockone Weekly will organize the technology to share, welcome interested students add: Liyingjiesz, into group participation, you want to listen to the topic or want to share the topic can give us a message.
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.