This is a creation in Article, where the information may have evolved or changed.
"The editor's words" B station has been concerned about the development of Docker, last year successfully implemented Docker on the core SLB (tengine) cluster. This year we chose Mesos after comparing the various Docker implementations. Combined with CI&CD, the Docker on-line process of the entire business was opened, and fully automatic expansion capacity was realized. This time combined with our implementation of the road, share the key points and difficulties encountered:
- Self-Research Docker network plug-in introduction;
- CD implementation and optimization in the Bili PAAs platform;
- The implementation scheme of fully automatic expansion and contraction capacity is applied.
- Nginx dynamic upstream is used in conjunction with Docker.
B Station has been concerned about the development of Docker, last year successfully implemented Docker on the core SLB (tengine) cluster, small size but large traffic, no CI & CD work. As the volume of business grows, there is an increasing demand for application scaling. However, in the absence of Docker standardization, the application of expansion needs to expand the server, the operation is heavy. At the same time, in order to reduce the problems caused by testing and inconsistent online environment, we plan to fully docker the business and cooperate with CI & CD to get the whole business on-line and achieve the second-level dynamic scaling capacity.
The following is our implementation of the road, the overall structure of the chart is as follows:
Why Choose Mesos?
The kubernetes is too heavy and has a wide range of functions. We mainly focus on the Mesos scheduling function, and light weight easier to maintain. In addition, we chose the Macvlan network.
Docker Network Selection
The network that comes with Docker doesn't meet our needs.
Bridger: Docker assigns a private IP, which communicates with the container through bridge. The Iptables map port is required for different host if you want to communicate. As the container grows, port management can be confusing, and iptables rules are becoming more and more.
Host: Using a host's network, different containers cannot listen to the same port.
None: Docker does not assign networks to containers and assigns them manually.
While we were unable to select Docker's network solution, we found that the latest version of Docker 1.12 provided two additional network drivers: overlay and Macvlan.
Overlay: The packet transmission of UDP is encapsulated on the basis of the original TCP/IP packet. When there is a problem with the network to catch the packet, it will be more troublesome. Moreover, overlay relies on the server's CPU to solve UDP packets, which can cause the Docker network performance is very unstable, the performance loss is more serious, in the production environment is difficult to use.
Macvlan: Configure the VLAN on the switch and then configure the physical NIC on the host to receive the corresponding VLAN. Docker driver specifies Macvlan when creating the network. The Macvlan network of Docker is measured, and the container running on the Macvlan network is 10~15% about the performance loss of the container running in the host network, but the overall performance is stable and there is no jitter. This is acceptable.
Based on Macvlan, we developed our own Ipam Driver plugin-based on consul.
When Docker creates a Macvlan network, it drives a consul that is designated for its own development. The IP of free and used is recorded in the consul. Such as:
The IPAM driver is present on each host and is exposed to the Docker call via the socket. When Docker creates a container, IPAM plugin requests a free IP address from consul. When the container is deleted, IPAM plugin releases the IP to consul. Because Ipam plugin on all hosts is connected to the same consul, the IP address uniqueness of all containers is guaranteed.
The problem we encountered with Ipam plugin:
1)Consul IPAM plugin is present on each host, called by socket, and is currently started using a container.
When Docker daemon restarts the load network, because the container is not started, the socket file for Consul IPAM plugin is not found, causing Docker daemon to retry the request for Ipam, extending the daemon startup time, with the following error:
Level=warning msg= "Unable to locate plugin:consul, retrying in 1s"
Level=warning msg= "Unable to locate Plugin:consul, retrying in 2s"
Level=warning msg= "Unable to locate Plugin:consul, retrying in 4s"
Level=warning msg= "Unable to locate Plugin:consul, retrying in 8s"
Level=warning msg= "Failed to retrieve ipam driver for network \" Vlan1062\ "
Solution Solutions: There are three ways Docker recognizes plugin:
- Sock files is UNIX domain sockets.
- Spec files are text files containing a URL, such as Unix:///other.sock or tcp://localhost:8080.
- JSON files is text files containing a full JSON specification for the plugin.
At the earliest, we identified Ipam Plugin through the. Sock method. The non-local ipam Plugin is now invoked in the manner of the. spec file. This way, the Docker daemon is not affected by Ipam plugin when it restarts.
2)When you delete a network created with consul Ipam through Docker network RM, the gateway address is released to Consul, and the next time you create the container request IP, you get the IP address of the gateway, which causes the gateway IP addresses to conflict.
Solution Solutions: The IP address is detected when the container is removed, and if it is a gateway IP, it is not allowed to be added to the Consul free list.
Based on the above background, we tested the Swarmkit of Docker 1.11 + Swarm and Docker 1.12 When we first started the selection. The Docker 1.11 + swarm network does not have a Macvlan drive, and the Docker 1.12 integrated Swarmkit can only use the overlay network and overlay performance is poor. Finally we used the Docker 1.12 + Mesos.
CI & CD
For CI, we used Jenkins, which is currently in great use in the company. Jenkins divides the pipeline into multiple Step,step 1 build out to build the war package. Step 2 Build Docker image and push to warehouse.
The first step: Build out the desired war and save the war package to a fixed directory. The second step: Build Docker image, will automatically find the previous build out of the war package, and through the written dockerfile build image, the image name is the application name. When the image is built successfully, it is push to our private warehouse. Each time the image is built, the last Tag,tag is the release version number. Follow-up we plan to separate CI from Jenkins and build the war packages and mirrors from the self-built PAAs platform.
We have developed the Docker-based PAAs platform (ongoing development). The main functions of the platform include information input, application deployment, monitoring, container management, application expansion and so on. The CD is on the PAA.
When you want to deploy a new business system, you must first enter the application related information on the PAAs system, such as the base image address, container resource configuration, number of containers, network, health check, etc.
CD, you need to select the version number of the image, which is the tag mentioned above
We also support the control of iteration speed, which is the setting of the iteration scale
This setting refers to each iteration of the container of 20%, while the surviving container cannot be less than 100% before the iteration.
The problem we encountered:Controls the iteration scale.
The marathon has two parameters to control the iteration scale:
- Minimumhealthcapacity (Optional. default:1.0) The minimum container ratio in the health state;
- Maximumovercapacity (Optional. default:1.0) The proportion of containers that can be iterated at the same time.
If there is a Java application deployed through Tomcat, allocated four containers, the default configuration of the next iteration, Marathon can start four new containers at the same time, the successful start to delete four old containers. Four new Tomcat containers in the instant service delivery, because the request volume is too large, it is necessary to immediately expand the number of threads to warm up, resulting in the incoming request processing time to extend or even timeout (b station because of the large number of requests, the request set timeout time is very short, in this case, the request will 504 timeout).
Workaround:
For applications that require a high demand for preheating, strictly controlling the iteration scale, such as setting Maximumovercapacity to 0.1, the iteration can only create a new 10% container at the same time, the 10% container starts successfully and deletes the corresponding old container before the new 10% container continues to iterate.
For applications with a small amount of requests, the maximumovercapacity can be adjusted appropriately to speed up the iteration speed.
Dynamic scaling capacity
During holidays or activities, temporary expansion of the application is required in order to cope with the temporary high QPS. Or when the average resource usage of a business exceeds a certain limit, it automatically expands. We have two ways of expansion: 1, Manual expansion, 2, the establishment of certain rules, when triggered to the rules, automatic expansion. Our Bili PAAs platform provides both of these approaches, with the lower level being the marathon-based scale API. This paper focuses on the automatic expansion of the rules-based scaling capacity.
Auto-scaling capacity relies on several components in the general architecture diagram: Monitoring Agent, Nginx+upsync+consul, Marathon Hook, Bili PaaS.
Monitor Agent:We have self-developed Docker monitoring agent, packaged into containers, deployed on each Docker host, through the Docker stats interface to obtain the container CPU, memory, IO and other information, information input influxdb, and in Grafana display.
Bili PaaS:The application can choose the scaling capacity rules when entering the PAAs platform, such as: Average CPU > 300% OR MEM > 4G. The PAAs platform timed polling determines the load situation of the application, and if the expansion rules are reached, the nodes are increased according to a certain proportion. Essentially, call Marathon's API for scaling.
Marathon Hook:The marathon event stream is monitored through the/v2/events interface provided by Marathon. The Bili PAAs platform invokes marathon APIs when manually expanding or triggering automatic scaling of rules on the Bili PAAs platform. Each operation of the marathon generates events that are exposed through the/v2/events interface. The Marathon Hook program registers all container information in the consul. When Marathon deletes or creates a container, the Marathon hook updates the Docker container information in the consul, ensuring that the information in the consul is consistent with the information in the Marathon and is up-to-date.
Nginx+upsync+consul:When the marathon expansion is complete, the ip:port of the new container must be added to the SLB (Tengine/nginx) upstream and reload SLB before it can be served externally. But performance degrades when Tengine/nginx reload. To avoid the performance loss caused by frequent reload SLB, we used dynamic Upstream:nginx + Upsync + Consul. Upsync is a Weibo open source module that uses Consul to save upstream server information. Nginx will get the latest upstream server information from Consul when it starts, and the Nginx will set up a TCP connection hook to Consul, when the data in Consul is changed, the Nginx will be notified immediately. The Nginx worker process updates its own upstream server information. The whole process does not require reload Nginx. Note: The function of Upsync is to dynamically update upstream server, reload Nginx is required when there are vhost changes.
The problem we encountered:
1)Nginx + Upsync will produce shutting down when reload. Because the Nginx hook to Consul link can not be broken in time. Having mentioned issue on GitHub for this issue, the author's reply has been resolved. Personal tests have found that shuttding down is still present. Also, HTTP upstream check modules for Upsync and Tengine cannot be compiled at the same time.
Solution: Tengine + dyups. We are trying to replace Nginx + dyups with Tengine + dyups. The disadvantage of dyups is that upstream information is kept in memory. Reload/restart will be lost when Tengine. You need to synchronize your upstream information to disk. Based on this, we have tengine + dyups encapsulation, by an agent process Hook Consul, found that there is more time to actively update tengine, and provide a Web management interface. It is currently being tested in-house.
2)Docker hook-> Marathon Hook, we were the first to hook Docker's events. This requires a hook service on each host. When an iteration is applied, Docker immediately generates an event for create container. Hook program monitoring to update consul, and then consul notify Nginx to update. The problem is that the service in the container has not yet started successfully (such as Tomcat), and has already provided services to the outside. This can cause many requests to fail, resulting in a restart request.
Solution:Marathon Hook. There is a health check configuration in the marathon. Such as
We stipulate that all Web services must provide a health check interface that, along with the service, any HTTP code other than 200 represents an application exception. Marathon the health state of this container is uknow when the container is first started. When the health check succeeds, Marathon displays the state healthy of this container and generates an event. Marathon Hook Program through the hook this event, you can accurately capture the container in the application to start the success of the time, and update consul, synchronous Nginx, external services.
3)COMMAND Health Check is lost after Marathon failover, there are three ways to add a health check to a container by Marathon: HTTP TCP command
When using HTTP TCP, check is initiated by marathon and cannot select the port at check time, Marathon will check with its assigned port. We are not actually using the marathon mapped port. We have chosen the command method to initiate a curl request inside the container to determine the state of the application in the container. When marathon occurs failover, command health check is lost and all container states are displayed unknow. A restart or an iterative application is required to recover.
Q&a
Q: Hello to ask your company's automatic expansion is for the application of it is not for mesos resource pool monitoring and do mesos agent expansion?
A: The current automatic expansion is for the application. Mesos Agent expansion, the first physical machine information into the PAAs platform, manually click on the PAAs platform expansion, the background will call ansible, minute rapid expansion Mesos Agent.
Q: Is it OK to make sure that nginx+upsync+upsteam check is not used together? What is the Nginx version of your company?
A: Tested Nginx 1.8 and 1.10, confirming that it cannot be compiled together. The most common nginx (SLB) We use is Tengine 2.1.1, which is deployed on Docker.
Q: Since it is the package, the bottom with Mesos than kubernets is not much flexibility, right?
A: For the PAAs platform, we want only the resources to dispatch this function, other functions we still want to be able to implement, and Mesos is focused on scheduling resources, and has experienced a lot of levels of testing. And Kubernetes currently provides a lot of services, we do not need, so chose Mesos.
Q: The container is monitored by monitor agent, what is inside the container? Or is it an internal burial point? Or is it a EFK? is monitoring using Prometheus?
A:prometheus is not used, we are using our own monitoring agent---InfluxDB. There are various monitoring methods in the container. Useful elk, there are other buried points, such as STATSD, based on the dapper paper to achieve full-link tracking.
Q: Network selection This piece, also researched other network solution? For example, calico, weave and so on, why choose Macvlan?
A: Our selection of the first step is the first choice of standard, from the CoreOS-led or Docker official CNM inside the selection, currently because of our container program or go docker, so chose CNM, that from the CNM standard inside the choice is: 1. Based on the Xvlan overlay;2. A calico;3 based on a three-tier route. Based on the two-layer isolation of the Macvlan, the actual above solutions we have used, based on the hope that the principle of the simplest possible final selection or Macvlan.
Q:bili PAAs platform, automatic expansion and manual expansion, the application of a lot of which way? Will the resources be re-dispatched after automatic expansion? Will the existing business be interrupted?
A: The use of more is based on the development of a good strategy, automatic expansion. Provide services externally through nginx dynamic upstream, without disrupting business.
Q: About log collection every container runs a logstash? Like elk can't search the display context?
A: There is no running logstash inside the container. Currently in the remote Logstash cluster listening to a UDP port, the application directly push the log to the Logstash UDP port, and then logstash the log to the Kafka,kafka consumers have two, one is Elasticsearch, One is HDFs. Generally with elk enough. When verbose logging is required, it is queried by operations through HDFs.
Q: I would like to ask the following nginx some of the dynamic configuration files are encapsulated inside the container? Or was it volume by the way it was mounted? Is there a service similar to configuration center? How does this piece of information come to be realized?
The upstream of A:nginx is generated locally from consul dynamically, and is persisted to the host through volume mounts. There is a configuration center. When the business is Docker, the business configuration is driven to the configuration center, and Docker is not in a configuration that stores business dependencies.
The above content is organized according to the September 27, 2016 night group sharing content. Share people
Wuan, Bilibili operations Engineer, is currently responsible for the B station operation and operation of Docker implementation. has been focused on the development of Docker, the implementation of a container platform combining Docker and Mesos。 Dockone Weekly will organize the technology to share, welcome interested students add: Liyingjiesz, into group participation, you want to listen to the topic or want to share the topic can give us a message.