This is a creation in Article, where the information may have evolved or changed.
"Editor's word" Hujiang current container technology main application scenario: OCS courseware business stateless application; Based on Apache Mesos+marathon implementation Hujiang container system scheduling management; Consul + Consul Template + nginx for automatic discovery and registration of services , Prometheus + Grafana + alertmanager alarm to implement container monitoring alarm. This sharing will be explained in the following ways:
- Select Container Technology Reasons
- Container Technology Selection
- Container storage
- Container Network
- Monitoring alarms
- Image Management
- Scheduling management
- Service Registration Discovery
- Automating deployment
- Automated scaling capacity
"Shenzhen station |3 Day burning brain-type Kubernetes training camp" Training content includes: kubernetes overview, architecture, logging and monitoring, deployment, autonomous driving, service discovery, network solutions, such as core mechanism analysis, advanced article--kubernetes scheduling work principle, Resource management and source code analysis.
Select Container Technology Reasons
- Lightweight
- Fast delivery (MS level)
- Environmental standardization
- Flexible deployment migration Management
- High resource utilization
- CI, CD Natural Advantage
- Cloudy platform Support
- Open source
Container Technology Selection
Store:
- Local Storage (Devicemapper LVM Direct)
- Shared Storage (Ceph)
Internet:
- Container Interconnection (Overlay)
- Container physical network interworking (bridge, host)
Monitoring:
- Prometheus + cadvisor + grafana (No business attributes)
- Prometheus + Mesosexporter + Grafana (based on different business, add business attributes, based on business monitoring)
Image management:
Scheduling Management Platform:
Deployment release:
- Jenkins + Marathon Deploy
Automated scaling Capacity:
Store
Storage classification:
Logs and shared storage, on the storage Docker and kubernetes have some differences, docker companies to promote the Volume-driver concept, that is, all storage is driven, local storage and network storage is only the corresponding driver is different.
- Local Storage (Local disk)
- Shared storage (Ceph file system)
Storage usage:
- Local storage, log file persistence (local volume mapped to container)
- Shared storage, stateful service state retention and data persistence (Ceph-fuse, equivalent to accessing local storage)
Storage FAQs:
- Problem description, Devicemapper loop mode, Docker does not map the container log to the host, the container generates a large number of logs, resulting in containers or local volume logic use 100%, the container abnormally down.
- Solution, container log local storage, enhanced container storage monitoring (Docker Version>=1.13.0,docker System DF view), container storage scheme selection Devicemapper Direct and based on logical volume monitoring (Docker monitoring first).
Internet
Network classification
There are several ways to enable container communication across hosts.
Sophomore, sophomore layer is the physical network card and container Network Bridge to the same Linux bridge, the container network can be connected to the Linux bridge, you can also connect the container network to the container Bridge Docker0, and then bridge the Docker0 to the Linux bridge , making the container network and the host network in the same two-tier network. Common implementation Scenarios host, Macvlan.
Nat, Nat is the default Docker network, which uses iptables address translation to implement host IP-to-container communication. Container external IP is the host's Ip,nat performance loss is relatively large, but as long as the host between the three-layer IP can be reached, the container can communicate.
Tunnel (Overlay) mode, VPN, Ipip, Vxlan and so on are tunnel technology, that is, in the container packets between packets of one layer or multilayer other data protocol header, to achieve the effect of connectivity. Docker's libnetwork supports Vxlan overlay mode, weave also supports Vxlan mode for UDP and overlay, flannel, calico, and so on. A global kv store (SDN Controller, Etcd, Consul, ZooKeeper) is generally required to hold control information. This way is generally only three layers can be reached, the container can be interoperable. Overlay Mode container has independent IP, the performance is generally better than NAT, different overlay schemes vary greatly.
The routing approach, the routing (SDN) scheme, is to allow the container to communicate with the container, the container, and the host by the way of the route setting. For example: Calico BGP Routing Scheme (non-IPIP). This approach generally applies to a single data center, most commonly used in the same VLAN, and different VLANs need to set up routing rules. The performance of the routing scheme is very low, and the performance of the host network is close.
Features of mainstream network solutions
Default Support (bridge, none, container, host), the default Docker supported network scenarios, configuration management convenience, without the need to introduce third-party tools to reduce operational costs, but bridge mode network performance loss is too high, Host mode Many scenarios below do not solve the port problem well, IP management tools need IP and port allocation management.
Overlay Vxlan A solution that compromises performance and availability, requires global kv storage, high kernel requirements (>3.16), and although packet unpacking is required, the process takes place in the kernel with better performance than flannel.
Flannel,flannel by default, UDP packet packets, in high concurrency, there will be packet loss problem, package packets are on the client side, there will be some performance losses; requires that all hosts on the same network can be routed directly; it will cause IP drift, which requires frequent encapsulation between source and target hosts, To consume a large amount of CPU resources while simultaneously combining other features of the overlay network.
Calico,calico BGP requires a physical network to support BGP routing protocols, while containers have a large intrusion on physical network device performance, especially when used in conjunction with existing core business.
Macvlan, performance is second only to the host mode, the container network and physical network are completely open, but need to evaluate the existing network equipment configuration bottlenecks, especially the number of containers increase the introduction of some in the traditional network can not occur in the problem.
Performance comparison of mainstream network scenarios
Reference Ucloud Cloud Host a Performance test report: Http://cmgs.me/life/docker-network-cloud.
CPU pressure mainly depends on load, the final ranking is host < Calico (BGP) < Calico (IPIP) = Flannel (VXLAN) = Docker (VXLAN) < flannel (UDP) < Weave (UDP)
Compare the test Macvlan and the bare network card on the physical machine:
Network selection
Based on the current status of Hujiang network (physical network 0 changes), while this project is just in the trial application phase, we use host, overlay network, host mode to solve the SLB container IP address and port fixed, container and external network communication, overlay network to achieve cross-host container network interoperability.
Marathon Network configuration:
Overlay network configuration is very simple, relying on distributed kv storage zookeeper, based on marathon framework configuration, to realize the interoperability between containers, containers and physical network.
Overlay Network Architecture:
To create a overlay network:
Docker Network create-d Overlay--subnet value NetworkName
Docker launches the specified network storage:
--cluster-store=zk://zk1.yeshj.com:2181,zk2.yeshj.com:2181,zk3.yeshj.com:2181,zk4.yeshj.com:2181,zk5.yeshj.com:2181/store
Marathon configuration:
Note: The use of the Docker network encountered a lot of pits, such as the kernel version or the Docker version is too low to cause instability, it is recommended that the use of the Linux kernel to 4.4,docker version rose to more than 1.12. Network optimization This block, follow-up plan in-depth study calico BGP Network and Macvlan, hope to have a greater improvement in network performance.
Monitoring alarms
Common Docker Monitoring
Cadvisor, Datadog
The Prometheus is ideal for monitoring container-based infrastructure. High dimensional data model, time series is identified by a measure name and a set of key-value pairs. The flexible query language allows querying and plotting of data. Advanced Metric types like summary (summaries), build ratios from the total number of specified time spans or alarm at any time of exception and no dependencies, making it a reliable system for debugging during outages. It is precisely because of its flexible query drawing statements that we finally choose Prometheus.
Detailed comparison of each monitoring system reference: http://dockone.io/article/397.
Prometheus Features
- Container-based basic monitoring, all components are Docker, maintenance and deployment management is convenient
- Distributed architecture
- Data collection extensions do not rely on distributed storage
- Support for multiple service discoveries (default support Kubernetes, Consul, EC2, Azure)
- Common service monitoring with ready-made collector exporter (HAProxy, MySQL, PostgreSQL, Memcached, Redis, etc.)
- Sequential-based KV storage database for very flexible query plotting
Selection of monitoring options
Selection: Cadvisor + Prometheus + Grafana
Docker monitoring this block we tried the traditional Zabbix + python implementation of the physical host-based container monitoring scheme, but the dynamic change of the container IP, the whole monitoring system to monitor the false alarm rate is particularly high, historical data can not be traced; then try Cadvisor + Prometheus + Grafana Way, today mainly shows this set of monitoring system.
System Architecture:
Component function characteristics
Cadvisor: Container Data Collector, based on the container name or ID main collection of host container CPU, memory, network, file system, container status and other basic information, Cadvisor Metrics API interface provides to Prometheus, easy to persist and flexible query display.
Prometheus: Sequential-based KV storage database with flexible class SQL query syntax, based on query drawing.
Grafana: Data presentation platform, supporting a variety of data sources (Prometheus, Zabbix, Elasticsearch, InfluxDB, opentsdb, graphite, etc.).
Nodeexporter: Host data logger, the current deployment of open source programs have problems, temporarily use the existing Zabbix solution to replace the physical machine monitoring.
Altermanager:grafana after 4.0 version Support alarm management, support email, slack and other alarm methods.
Monitoring page
Cadvisor
Grafana
Optimization
Optimization scheme: Mesos exporter/metrics + promatheus + Grafana
The container can be monitored by adding business attributes, based on business logic monitoring, by centralizing the Mesos Masters Metrics, and adding a business tag to the container name.
Image Management
Tool selection
VMware Harbor Open Source Tools
System architecture
Docker Hub mirrored storage warehouse back-end Storage: Ceph (storage docker images and MySQL data)
Dockerhub function
- Project management: Add, delete, change, query
- Project member Management: Images management, user management
- Remote Mirror Warehouse synchronous replication
- Project, warehouse, image query Search
- System Management
- User Management
- Target Management
- Remote Responsible Policy Management
- Docker Client Push, pull mirror
- deleting warehouses and mirrors
Harbor Management Platform Display
Multi-Machine Room deployment
Each DC deploys a set of harbor to synchronize each DC image via harbor remote replication:
Office Harbor = DC01 Harbor = DC02 Harbor
Using DNS zone resolution to implement different DC pull current DC Harbor mirroring, push action push to office Harbor.
Scheduling management
Scheduling System Selection
- Docker Swarm, the official website to develop the default Docker container cluster The simplest management tool, production applications less, but as a lot of developers default selection tool, the official website community support; After 12.5, the default integration into the Docker Engine simplifies cluster management , with easy-to-understand policies and filters, but because it cannot handle node failures, it is not recommended to use in a real production environment. The swarm and Docker environments are well-integrated, use the same API as the Docker engine, and work well with Docker compose, so it's great for developers who don't know much about other schedulers.
- Kubernetes, the community is most active and developing at its best, and the logic of Kubernetes is different from the standard Docker, but its concept of pod and service allows developers to think about what a combination of these containers is when using containers, which is really interesting. Google offers a very simple way to use kubernetes on its cluster solution, making kubernetes a reasonable choice for developers who already use the Google ecosystem.
- Mesos, open source Tools, Mesos & Marathon A perfect combination of solutions. Ability to dispatch tasks like other Mesos frameworks, with strong compatibility, support for plug-in management integration, flexibility, and a large number of enterprise user experience, the highest stability; Have a description JSON file similar to Docker compose to make the task configuration, These features make it an excellent solution for running containers on a cluster.
By comparing the characteristics of three scheduling frameworks, we chose Mesos + Marathon as container cluster resource management and scheduling scheme.
System architecture
Scheduling system Components
Mesos,apache Open Source Unified resource management and scheduling platform, known as the kernel of distributed system; provides failure detection, task release, task tracking, task monitoring, low level resource management, and fine-grained resource sharing to scale up to thousands of nodes.
Marathon, the framework for managing long-running applications (long-running applications) on Apache Mesos, implementing discovery of services, providing rest API services for deployments, with authorization and SSL, configuration constraints, Realize service discovery and load balancing through Haproxy, and of course other third-party tools consul.
Chronos, the framework for managing short, timed, one-time tasks on Apache Mesos, job scheduler with fault-tolerant features that can handle dependencies and ISO8601-based scheduling instead of Cron's open source products. The job can be orchestrated to support the use of Mesos as the job executor.
ZooKeeper distributed, open source distributed Application Coordination Service is an open source implementation of Google Chubby and an important component of Hadoop and HBase. Software that provides consistent services for distributed applications, including configuration maintenance, name services, distributed synchronization, and group services.
Mesos+marathon Functional characteristics
- High availability, support for automatic switching of multi-master nodes (primary and Standby mode)
- Supports multiple container environments (Docker, Mesos Docker)
- Support for stateful services such as databases
- Use the Web management interface for configuring operations and monitoring System status
- Constraint rules (Constraints), such as restricting the distribution of tasks to specific nodes, port allocations, etc.
- Service discovery, Load balancing (Mesos DNS, marathon-lb)
- Supports health checks for fault tolerance (HTTP, TCP)
- Support for event subscriptions for integration into other systems
- Running indicator monitoring interface (metrics centralized monitoring)
- Complete easy-to-use REST API
Mesos Resource Management Scheduling principle
The Mesos framework is an application that runs distributed applications and has two components:
- Scheduler: Interacts with Mesos, subscribes to resources, and then loads tasks from the server in Mesos.
- Executor: Obtain information from the framework's environment variable configuration and run the task from the server in Mesos.
First, the Mesos primary server queries the available resources to the scheduler, the second scheduler sends the load task to the primary server, the primary server communicates to the slave server, loads the task execution from the server to the executor command, the executor executes the task, reports the state feedback to the slave server, and finally informs the scheduler. Managing multiple actuators from the server, each executor is a container that used to use the Linux container LXC and now uses the Docker container.
Mesos failure recovery and high availability
Mesos Primary server uses zookeeper for service election and discovery. There is a registrar that records all running any and all information from the server, using Multipaxos for log replication to achieve consistency.
Mesos has a recovery mechanism from the server, regardless of when a server freezes, the user's task can continue to run, from the server will be some key information such as task information status updates persisted to the local disk, These tasks (similar to passivation and wakeup in Java) can be resumed from disk upon reboot.
Scheduling System Management Platform
Marathon scheduling Platform
Marathon configuration JSON
Marathon Configuring the Web
Mesos
Chronos
Service Registration Discovery
Scheme
Nginx/marathon + Consul Agent + Consul Server + Consul Template
System components
- Consul Agent: Gets the status of all the containers on the current machine (service name, physical node IP, service port, etc.) and registers to Consul server.
- Consul server: A distributed storage system that maintains service information.
- Consul Template: Reads the service of the Consul server and renders the module build configuration file Once the configuration has changed to reload the configuration.
System architecture
The user accesses the front-end application (nginx/marathon-lb/haproxy), application through the app Configuration file to obtain the provided service from the backend application, and then returns it to the user. The consul agent is installed on the backend application and the consul agent is added to the consul cluster. The Consul template connects to the Consul cluster server, dynamically pulls the backend application service information from the Consul service repository, which is written to the front-end application configuration file, After a write is completed (that is, when the background service changes), Consul-template automatically tells the front-end app to reload by command, enabling the front-end application to dynamically discover the backend service and the purpose of applying the new configuration file.
Using Nginx + Consul template Dynamic server discovery and registration as an example:
- The consul cluster is made up of 3, 5, or 7 consul servers to ensure high availability and master elections for the entire cluster.
- The Consul agent is deployed on each mesos slave node to capture information such as the current node service name, IP address, service port, and service status, and update to the Consul cluster in a timely manner.
- Consul template timely read the Consul Cluster service information (service name, IP, service port, service status), once this information changes, update the Nginx configuration file, and reload Nginx.
Configuration, service Management
Consul cluster (3 nodes):
Docker run-d--net=host--name=consul dockerhub.domain.com/consul:0.6.4-server-advertise consul_server01_ip- Recursor dnsserver01_ip-recursor dnsserver02_ip-retry-join Consul_server02_ip-retry-join consul_server03_ip
Consul Agent:
Docker run-d--net=host--name=consul--restart=always dockerhub.domain.com/consul:0.6.4-advertise consul_agent_ip- Recursor dnsserver01_ip-recursor dnsserver02_ip-retry-join Consul_server01_ip-retry-join consul_server02_ip- Retry-join consul_server03_ip
Consul Template:
Consul-template--zk=zk://zk1.yeshj.com:2181,zk2.yeshj.com:2181,zk3.yeshj.com:2181,zk4.yeshj.com:2181, Zk5.yeshj.com:2181/mesos "
Consul server/agent Config:
Cat Consul_base.json
{
"Datacenter": "Shanghai",
"Data_dir": "/data",
"Ui_dir": "/webui",
"Client_addr": "0.0.0.0",
"Log_level": "INFO",
"Ports": {
"DNS": 53
},
"Rejoin_after_leave": True
}
Consul Template config:
/bin/sh-c echo "Upstream app {{{range service \" $SERVICE \ "}} server {{. Address}}: $CONTAINER _port; {{Else}}server 127.0.0.1:65535; {{End}}} server {listen default_server; location/{limit_rate_after $LIMIT _after; limit_rate $LIMIT _rate; Proxy_pass http://a pp }} "> $CT _file; Nginx-c/etc/nginx/nginx.conf & Consul_template_log=debug consul-template-consul= $HOST: 8500-template "$CT _file: $NX _file:nginx-s Reload "
consul-template-consul=consul_agent_ip:8500-template/etc/consul-templates/nginx.conf:/etc/nginx/conf.d/ App.conf:nginx-s Reload
Automating deployment
Scheme
Jenkins + Git + Marathon Deployment plugin + docker image build script
System components
Git program source code, Dockerfile, Marathon.json and other programs and configuration file version management.
Jenkins implementation of CI, CD tools, provide a variety of plug-ins, Marathon Deployment plugin through the Jenkins, Marathon release process, only need to provide Marathon certification information and configuration, you can complete the release.
Implementing logic
Create Gitlab Certification account + Build Jenkins Freestyle task + Add Gitlab info + + Create build trigger + build process (docker image, push image) = Add Docker Registry credential = Post Build (build marathon deploy)
Automated scaling capacity
Scheme
Marathon-lb-autoscale
System architecture
Implementation method
Docker container
Startup parameters
Marathon configuration file
{
"id": "Marathon-lb-autoscale",
"Args": [
"--marathon", "http://leader.mesos:8080",
"--haproxy", "http://marathon-lb.marathon.mesos:9090",
"--apps", "Nginx"
],
"CPUs": 0.5,
"Mem": 16.0,
"Instances": 1,
"Container": {
"Type": "DOCKER",
"Docker": {
"Image": "Mesosphere/marathon-lb-autoscale",
"Network": "HOST",
"Forcepullimage": True
}
}
}
Q&a
Q: How is Prometheus specific to play? For example, what do you do with hot and cold storage metrics data? Does the alarm only use the Grafana 4.0?
A: You can have a private look at the first close the document, here 1:30 will not be able to tell; metrics data in a sequential manner in Prometheus, do not do any processing, based on SQL-like query; Query statements can be set to filter, or conditional query; Grafana 4.0 and later only support the alarm, while the traditional monitoring alarm also has zabbix.
Q: How is QoS implemented for different applications? What is the tenant management and granularity of the container?
A: Flow control This can be considered in the Haproxy, or nginx this block to do API Gateway, the current container for their own use, tenant management This piece is not broken down.
Q: What is the process of marathon deployment, the new version is to replace the old version is to create a new app, delete the old, or how the app naming specification has no suggestions. How can the container's IP be associated with the upstream upstream domain name after the container is deployed and how to register it on Nginx?
A: First to a certain proportion (2) to publish a new instance, and then delete the old instance, the final implementation of the same number of instances; The naming is best regulated to prevent conflicts; Nginx + Consul serve R + Agent + template can solve all your questions.
Q: Are there network problems causing system anomalies in this network environment?
A: We have encountered a low-level docker 1.9.1 bug that caused the network to lose packets, and later upgraded Docker to 12.5, problem resolution; currently using Kernel 4.4.18,docker 1.13.0, there are no network problems.
Q: Ask the Prometheus cluster to do this? Back-End database didn't think about OPENTSDB?
A: There is no performance bottleneck at present, if there are performance problems can be considered sub-room deployment, the final display and alarm unified on the Grafana; no opentsdb is used.
The above content is organized according to the May 9, 2017 night group sharing content. Share people
Xiu Xudong, Hujiang Education and operations architect. At present, it is mainly engaged in container technology learning research, CEPH distributed Storage System maintenance, Hujiang part of business operation and maintenance related work。 Dockone Weekly will organize the technology to share, welcome interested students add: Liyingjiesz, into group participation, you want to listen to the topic or want to share the topic can give us a message.