For Mushroom Street, the annual 11.11 has become the biggest test of the year, the test is the system stability, disaster tolerance, emergency fault handling, operation and other aspects of the capacity. Mushroom Street Private cloud platform, from scratch, has been through nearly a year of development, production environment experienced 3 times, stability has been preliminarily verified. In this paper, I will discuss the private cloud platform of Mushroom Street from the perspective of architecture, technology selection and application.
In addition, Archsummit Global Architect Summit Beijing station will be held December 18, 2015 ~ 19th in Beijing International Conference Center, the General Assembly set up the "secret double 11 behind the Technical Contest" topic to in-depth interpretation of the double 11 behind the technical story, welcome attention.
Mushroom Street Private Cloud Platform (hereinafter referred to as Mushroom Street private cloud) is the mushroom street facing the internal upper layer of the basic platform provided by the business. Through the service and platform of infrastructure, the upper business can be more focused on the business itself, rather than on the difference of the underlying operating environment. It provides the IAAS/PAAS layer of cloud services to the upper tier through the Docker CaaS layer and the KVM IaaS layer to improve the utilization of physical resources, as well as the efficiency of business deployment and delivery, and to facilitate the split and micro-service of application architectures.
In the architecture selection, we feel Docker lightweight, second-level start-up, standardized packaging/deployment/operation of the solution, the rapid distribution of mirrors, based on the image of grayscale release features, are very suitable for our application scenarios. Docker's own cluster management capabilities were immature at the time, so we didn't choose the swarm, but we used the industry's most mature OpenStack, which managed both Docker and KVM virtual machines. Relatively speaking, Docker is suitable for stateless, distributed business, KVM is suitable for security, isolation requires higher business.
For the upper-class business, it does not need to care whether it is running in a container or a KVM virtual machine. The future of the idea is the application of micro-service, the top of the business to split, into a micro-service, thus docking PAAs based on container deployment and grayscale release.
Technical framework
Before I introduce the preparation of the Double 11, let me briefly introduce the technical architecture of the private cloud on Mushroom Street.
We are using the Openstack+novadocker+docker architecture pattern, Novadocker is stackforge an open source project, it as a Nova plug-in, By calling the Docker RESTful interface to control the container's start and stop actions. Each docker is the so-called "fat container", it will have a separate IP address, through the Supervisord to manage the child process within the container, common such as sshd, monitoring agents and other processes.
On the basis of IaaS, we have studied the arrangement and scheduling of the PAAs layer, and implemented the elastic scaling and gray level upgrade, and supported some scheduling strategies. We have implemented continuous integration (CI) through Docker and Jenkins. The project in Git, if a git push occurs, triggers the Jenkins job to build automatically, and if the build succeeds, it generates Docker image and push to the mirrored warehouse. Based on CI-generated Docker Image, a PAAs API or interface can be used to develop an instance update of the test environment and eventually an instance update of the production environment for continuous integration and continuous delivery.
On the network, we do not adopt the NAT network mode provided by the Docker default, Nat can cause a certain performance loss. Through OpenStack, we support Linux Bridge and Openvswitch, do not need to start iptables,docker performance close to the physical machine 95%.
Prepare for work stability
Fight Double 11, the most important of course is to ensure stability. Through nearly a year of product and practical use, we have accumulated a wealth of experience to improve stability.
For those who have encountered problems, the need to timely use a variety of ways to solve or avoid.
For example, CentOS6.5 support for network namespace is not good, creating Linux bridge within the Docker container can cause kernel crash, Upstream fixes the bug in 2.6.32-504, so the kernel version of the online cluster must be upgraded to 2.6.32-504 or above.
For example, CentOS6.5 's device mapper presence Dm-thin discard causes the kernel to be randomly crash, a problem we have discovered and solved as early as April, the solution is to close discard support, Add "--storage-opt dm.mountopt=nodiscard--storage-opt Dm.blkdiscard=false" to the Docker configuration and strictly prohibit disk matching. Because disk mismatch can cause the entire device mapper to not allocate disk space, and the entire file system into read-only, causing serious problems.
Monitoring
We focused on the monitoring of containers before the double 11.
Prior to this, we have developed a set of container tools. There are two main functions: first, the load value can be calculated with the container as granularity, and the QPS limit flow of the container grain size may be carried out according to the load value. The second is to replace the original top, free, iostat, uptime and other commands to ensure that the operation of the container in the use of commonly used commands when the value of the container, rather than the value of the entire physical machine. After the double 11 we will also transplant the LXCFS to our platform.
On the host, we have increased the multi-dimensional threshold monitoring and alarm, including the key process of survivability monitoring/semantic monitoring, kernel log monitoring, real-time PID number monitoring, network connection tracking number monitoring, container oom monitoring alarm and so on.
Real-time PID quantity monitoring
Why monitor the number of real-time PID? Because the current Linux kernel of the isolation of PID support is not perfect. No Linux distributions have been made to pid_max restrictions on the size of the container for the PID.
There has been a real case is: a user wrote the program has bugs, the creation of the thread is not timely recovery, the container generated a large number of threads, the last host can not execute the command or SSH landing, the newspaper's fault is "Bash:fork:Cannot allocate memory", However, it is sufficient to see the free memory at this time.
Why is that? The root cause is that Pid_max (/proc/sys/kernel/pid_max) in the kernel is shared globally. When the number of PID numbers in a container reaches 32768, it causes the host and other containers to fail to create a new process. The latest 4.3-RC1 supports Pid_max restrictions on each container.
Memory usage Monitoring
It is worth mentioning that we found that the memory usage value provided by Cgroup is inaccurate, lower than the real memory value. Because the kernel memcg can not reclaim slab cache, also do not limit the dirty cache, so it is difficult to estimate the actual memory usage of the container. Once the memory usage rate that has occurred is 70-80%, the situation of oom occurs. To this end, we have adjusted the container memory calculation algorithm, according to experience value, the cache of 40% as an RSS, adjusted memory calculation than before a lot of precision.
Log Disorderly Order
Another problem is running Docker. Host kernel logs often produce character chaos, which can result in log monitoring not getting the correct keyword to alert.
The analysis found that this was related to the rsyslogd we ran in the host and Docker containers. Because there is only one log_buf buffer in the kernel, all PRINTK printed logs are placed in this buffer first, and RSYSLOGD on Docker host and container log from kernel's log_buf buffer via syslog. Cause log clutter. The problem can be solved by modifying the Rsyslog configuration in the container to allow the host to read the kernel log.
Isolating switch
Normally our containers are strictly segregated, we do the isolation including CPU, memory and disk IO, network IO and so on. But a double 11 business can be more than 10 times times or dozens of times times the usual amount. We have done a lot of switches for double, in the case of pressure, we can for individual containers for dynamic CPU, memory, such as expansion or shrink capacity, adjust and even release the disk IOPS speed limit and the network TC speed limit.
Health monitoring
We have also developed regular health monitoring, regular scanning of potential risks on the line, and really do early detection problems, solve problems.
Disaster preparedness and emergency fault handling
In addition to stability, disaster preparedness capacity is also necessary, we have done a lot of disaster preparedness plans and technical preparation. For example, we have studied the method of off-line recovery of data in Docker without starting Docker daemon. Specifically, the Dmsetup create command creates a temporary DM device that maps to the DM device number used by the Docker instance, and can recover the original data by mount the temporary device.
We also support cold migration of Docker containers. Through the management platform interface can be a key to achieve the migration across the physical machine.
Docking with the existing operation and maintenance system
The Docker cluster must be able to seamlessly connect with the existing operation and maintenance system in order to respond quickly and truly achieve a second-level elasticity expansion/reduction. We have a unified container management platform to achieve the management of multiple Docker clusters, from the issued instructions to the completion of container creation can be completed in 7 seconds.
Performance optimization
We have also done a lot of optimization from the system level to Docker, such as the performance bottleneck for disk IO, we tuned up like Vm.dirty_expire_centisecs,vm.dirty_writeback_centisecs, Vm.extra_ Free_kbytes such a kernel parameter. Also introduced the Facebook open source software Flashcache, the SSD as cache, significantly improve the Docker container IO performance.
We also optimize the time for Docker pull mirroring by merging the mirroring hierarchy. When Docker pull, each layer of verification takes a long time, by reducing the number of layers, not only small size, Docker pull times also greatly shortened.
Mirror |
Number of file layers |
File size |
Docker Pull Time |
Original Mirror |
13 |
1.051 GB |
2m13 |
New Mirror |
1 |
674.4 MB |
0m26 |
Summarize
In general, double 11 is a year-end test for the private cloud of Mushroom street, and we are well prepared for it. With the growing scale of Docker cluster deployments, we still have a lot of technical problems to solve, including the isolation of the container itself, the elastic scheduling problem of the cluster, and so on. At the same time, we are also concerned about the development of Docker-related Open-source software kubernetes, Mesos, Hyper, Criu, Runc, the future will introduce the thermal migration of containers, Docker daemon thermal upgrades and other characteristics.
Thank you for reading, I hope to help you, thank you for your support for this site!