From Mesos to Kubernetes
The previous scheduling framework was based on Mesos self-research. The language used is Python. Run for about two years, and has been relatively stable. But as the business grows, the problems of the existing framework are gradually exposed.
- Scheduling speed encountered bottlenecks, affecting the deployment of large business speed.
- There is no good support for stateful services.
There are two solutions to this problem, one for improved refactoring of existing systems and the other for migration to Kubernetes. We ultimately chose to migrate to Kubernetes based on the following considerations.
- Kubernetes's architectural design is straightforward, container-managed pumping looks good, re-used and two-time development, and there's no need to create repetitive wheels. Similar concepts have been introduced in more typical pod-like and Mesos.
- Kubernetes has gradually become the mainstream of the industry. The community is active and new features are constantly being added, which leads to the Kubernetes becoming heavier, but the basic architecture and core functions are always relatively stable.
- Compared to Mesos, the cost of development based on Kubernetes is lower, especially after familiarity. Easy to promote the use of k8s. In addition to the main business platform Bay, our load balancing platform, Kafka platform and scheduled task platform are all basic Kubernetes.
Overall architecture
Resource Layer
This layer is mainly cluster resources, mainly including memory, CPU, storage, network and so on. The main components of the operation are Docker daemon, Kubelet, cadvisor, network plug-ins, etc., mainly for the upper level to provide resources.
Control layer (Kubernetes master)
The control layer mainly includes the master component of Kubernetes, Scheduler, controller, and API Server. Provides control over the Kubernetes cluster resources.
Access Layer (Watch Service, API)
This layer contains a lot of things, mainly including the various platforms used to access the components developed by the Kubernetes cluster. Mainly includes the following components:
- Container platform Bay.
One is to provide the deployment system with container/container group creation, update, delete, expansion and other interfaces, is a layer of Kubernetes native API package, mainly for the deployment system docking. The second is used to register the service discovery, business monitoring alarm and other required configuration.
- Load Balancer Platform (Balance).
Our load Balancer platform uses mainly Haproxy. is also running in a container. Due to the particularity of the load balancing system, it is not running on the container platform Bay, but has developed a platform to manage the HAProxy separately. It is easy to create and manage HAProxy clusters and automatically or manually expand when business traffic changes.
- Kafka platform (Kafka). It mainly provides services to create a management Kafka cluster on Kubernetes. We have developed a set of local disk-based scheduling frameworks on Kubernetes that enable pods to be dispatched based on local disk information in the cluster. A shared storage system is not used primarily for performance reasons. The newest Kubernetes seems to have added a scheduling feature similar to the local disk, but it has not been supported in our development.
- Scheduled task platform.
This platform was developed prior to Kubernetes supporting Cron job. It is also the first service to access Kubernetes.
Management layer (Castle Black;monitor;auto scale)
It is mainly based on some configuration or information provided by the access layer to accomplish certain functions.
- Castle Black, this service is a more critical service. This service through the Kubernetes watch API, timely synchronization of business container information, according to container information for service registration and anti-registration. We are primarily registered Consul and DNS. The Kafka platform and load Balancer platform also relies on this service for service registration. At the same time provide the external query interface, can query the real-time container information business.
- Monitor, this is mainly the monitoring of the business container, mainly including the total number of containers, the number of abnormal containers and the registration information consistency and other information, CPU and memory resources such as monitoring we use cadvisor and our internal monitoring system to achieve.
- Auto scale, we did not use the automatic expansion mechanism of the Kubernetes itself, but developed it separately, mainly to support a more flexible scaling strategy.
Configuration Layer (ETCD)
The configuration information required by the application layer's components is written to ETCD by the access layer's services. The application-level component uses the watch ETCD to get the updated configuration in a timely manner.
The following diagram illustrates how some of the components described above are combined to enable the business to provide services externally on our container platform. Send a request to Kubernetes Apiserver via the bay platform, create a successful deployment,pod and after the health check passes, Castle Black Watch to pod information, register Ip,port and other information on the Consul, HAProxy watch corresponding Consul key, add pod to the backend list, and provide services to the outside.
Monitoring and alarm Cadvisor
The collection of our monitoring metrics is mainly based on Cavisor. The main reasons for not using Heapster are as follows:
- We have developed two times for Cavisor, and the internal index system is also better, and the application time is longer.
- Heapster uses the pull model, although it is parallel pulling, but in the case of large cluster size, it is possible to become a performance bottleneck, and is currently unable to scale horizontally.
- Many of the aggregation metrics provided by default in Heapster are not needed. There is no need to maintain two surveillance systems.
Internal indicators and alarm systems
Indicators and alarms are used in our internal relatively mature system.
Log Collection
Logspout Kafka Es/hdfs, the log collection we use is also ELK, but differs from the usual ELK. Our L here is logspout, an open source software primarily used to collect container logs. We have developed it two times to support dynamic topic collection. We inject topic into the container in the form of environment variables. Logspout automatically discovers the container and extracts the topic, sending the container's logs to the Kafka corresponding topic. So each of our business logs has its own topic, rather than hitting a large topic inside. After the log hits the Kafka, there will be a corresponding consumer consumption log, landing ES and HDFS. ES is primarily used for log queries, and HDFS is used primarily for log backups.
The entire log collection process, as required:
Network Solutions
The host-local Bridge, the network part we do relatively simple. First of all our hosts are assigned a C segment of the IP pool, this address segment of each IP can be routed across the host. The IP address from x.x.x.2 to x.x.x.255, the container can use the address is x.x.x.3 to x.x.x.224, this IP number is sufficient. Then create a Linux Bridge on the host for that address segment. The IP of the container is allocated from x.x.x.3 to x.x.x.224 in this address space, and the container's Veth pair is hung on a Linux Bridge and communicates across hosts through the Linux Brigde. There is basically no loss in performance.
Concrete implementation we use the bridge and host-local the two of the board plug-in, bridge is mainly used to mount/unload the container Veth pair to the Linux Bridge, host-local mainly use the local configuration to assign IP to the container.
The above process is as follows:
The allocation of IP pools is provided by our cloud service providers, and we do not need to control the allocation and routing configuration of specific IP pools.
Keng
The above mainly introduces the knowledge in the container and Kubernetes application of some of the current situation, in this process we also stepped on a lot of pits, here to share with you.
Etcdv3 version Issue
The storage backend used by default for newer versions of Kubernetes is ETCD3. Etcd Choose the wrong version, there will be pits. ETCD version prior to 3.10, the Delete API of V3 does not return the deleted value by default. Causes Kubernetes API server to not receive the delete event. The resources that are occupied will not be released. Eventually leads to resource exhaustion. Scheduler can no longer dispatch any tasks. More information can be seen in this issue (https://github.com/coreos/etcd/issues/4620).
Pod Eviction
This is a feature of Kubernetes, if node is offline for network or machine reasons, it becomes unready state. The node controller of Kubernetes will delete the pod on that node, called Pod Eviction. This feature should be reasonable, but before about 1.5, all of node's pods will be deleted when all node in the cluster becomes unready state. This is actually unreasonable, because the approximate rate is that the API Server's machine network is out of the question, so this time should not be all node on the pod all deleted. This feature has been improved by the latest version, in which the ready node in the cluster has reached a certain number of cases to pod eviction for the not-ready node. This is more reasonable. It is also important to be sure that you have high availability of API Server.
MLM Plugin Docker daemon Restart IP leak
When using the MLM Network plug-in, if the Docker daemon restarts, the new IP will be reassigned, but the old IP will not be released, causing the IP address to leak. Due to time and energy issues, we have taken a more tricky approach, before the Docker Dameon starts, we will default to the local IP all released. This is done through the Supervisor startup script. It is hoped that the subsequent Kubernetes community can fundamentally address this problem.
Docker Bug
Docker also encountered some bugs during its use. For example, Docker PS will get stuck, using portmapping will encounter port leaks and so on. We have internally maintained a branch and fixed a similar problem. Docker Daemon is the foundation, its stability must be guaranteed, the stability of the whole system is guaranteed.
Rate Limit
Kubernetes Controller Manager, Scheduler, and API Server all have the default rate limit, and when the cluster is large, the default rate limit is definitely not enough and needs to be adjusted.
Https://www.kubernetes.org.cn/2508.html
The application of Kubernetes in the knowledge