This is a creation in Article, where the information may have evolved or changed. 
"Editor's words" The PAAs platform of a joint-stock commercial Bank is developed by WISE2C and rancher, based on rancher. Based on the business scenario and the special needs of the banking industry, and in order to achieve a smooth upgrade of the later rancher version, we made a logical abstraction on the rancher. 
 
 
"Shenzhen station |3 Day burning brain-type Kubernetes training camp" Training content includes: kubernetes overview, architecture, logging and monitoring, deployment, autonomous driving, service discovery, network solutions, such as core mechanism analysis, advanced article--kubernetes scheduling work principle, Resource management and source code analysis.  
 
1. Software architecture and Deployment scenarios
The overall software architecture is as follows: 
 
The top-level dcos, as a unified management platform, enables centralized control of the cloud platform through PAAs and the API provided by IaaS. The blue part on the left is the native Rancher,dcos and the Red Customization section accesses the rancher through the API. As no changes have been made to rancher, a smooth upgrade to the rancher version greater than 1.2 can be done. 
 
The red part is the custom Logic Abstraction section, which can be summed up into the following microservices in terms of functional responsibilities (described in more detail later): 
 
 
  
  - Authentication
- Resource Management
- Application Orchestration
- Elastic Scaling
- Log Collection
- Monitoring alarms
- Mirrored Warehouse
These microservices are deployed in a way that deploys the infrastructure stack to the environment as rancher, deploying these microservices using a separate rancher environment, as shown in the deployment topology: 
 
Each dashed box in the figure corresponds to an environment in the rancher; the "extended env" Environment uses the rancher server host as the agent, which runs the customized microservices. Other environments correspond to a specific network for a tenant, and a single network internal traffic does not use rancher native overlay, whereas a flattened network implemented with WISE2C, traffic between networks is controlled by an external firewall. 
 
2. Roles and Permissions model
The role and permission model of the PAAs platform inherits part of the concept of rancher and introduces its own content. The main difference is in two ways: 
 
 
  
  - The PAAs platform introduces the management of the mirrored warehouse, which is not in rancher, that is, the permissions of the role, in addition to the operation rancher, can also operate the mirror warehouse. The image warehouse is consistent with the PAAs's permission model;
- In addition, the customer introduced the concept of tenants, which is different from rancher, tenants can be across multiple rancher environment;
Rancher Permissions Model:
 
 
  
  
 Platform Administrator:
 
 Have all the privileges of the entire rancher platform;
 Environment users:
 
 Owner, with all the permissions of the environment;
 
 Member, with all the privileges except for the internal user authorization of the environment;
 
 Restricted user, which has all the privileges in the environment except for the authorization of users and the basic resources of operation;
 
 Read only, which has the ReadOnly permission for the internal resources of the environment.
PAAs Platform Permissions Model
 
 
  
  
 Platform Administrator:
 Platform Administrator privileges equivalent to rancher plus all permissions to the mirrored warehouse management;
 Tenant Internal role:
 
 The tenant administrator, who has all the permissions to manage tenant resources and authorize users within the tenant, plus all permissions to the mirrored warehouse management.
 
 Senior members, in the PAAs platform with the user authorization in the tenant and the operation of the basic resources of all rights, in the Mirror Warehouse, the Mirror warehouse has set the mirror synchronization rules, create, delete the Mirror warehouse namespace, change the mirror state and other permissions.
 
 Restricted member, in the PAAs platform has the tenant within the user authorization and operation of the underlying resources of all rights, in the mirror warehouse belongs to the namespace, with upload, download the image permissions.
 
 Read only, within the PAAs platform, has permissions to view the tenant class resources, within the namespace of the Mirror warehouse, with permissions to view the mirrored warehouse resources.
The specific mapping relationship is as follows: 
 
The software design for the authentication section is as follows: 
 
All API requests for PAAs access are proxied to the specific microservices within the system through API Proxy authentication control. PAAs does not directly participate in tenant additions and deletions, API Proxy to obtain user roles and tenant information through Keystone Communication with external PAAs. 
 
3. Resource Management
 
Network section
 
 
  
  - Because the financial industry is more demanding in terms of network security, rancher is able to provide a overlay network that is based on an internal environment. Overlay will inevitably result in a lot of messages that cannot be filtered transparently by the security device, which is unacceptable in the industry, so a flat network must be used.
- For security reasons, there is a need for multiple service internals within the same stack to be deployed separately to different network partitions, with the current rancher managed network obviously unable to meet the requirements, so multiple networks must be supported.
For the support of flat networks, I have detailed descriptions in previous articles (implementing a flat network based on the rancher 1.2), mainly using ebtable to control traffic directly on the Linux bridge, thus avoiding the use of overlay; External security devices can transparently see the flow of traffic between the various containers. 
 
For multi-network support, we achieve this by implementing a layer of abstract logic on top of rancher. The entire model evolved into an environment in which the network maps to rancher (a flat network running inside the environment). This section mainly deals with the management of all networks in the platform and maintains the mapping between tenants and the network. 
 
Here's an example to describe the process: 
 
The platform administrator creates a network on the PAAs that specifies the parameters of the network (subnet masks, gateways, the owning security domain, the owning isolation domain, and so on), which are saved to the database; 
 
The platform administrator assigns the first network to the tenant as needed, at which point the abstraction layer needs to really create the environment for the network on the rancher, as well as the system-level application stacks needed to create the monitoring, logging, and customization systems; 
 
When the platform administrator allocates more than a second network to the tenant, the abstraction layer also needs to establish an env link between the rancher environment and the rancher environment corresponding to the tenant's other network, otherwise the application stack across the rancher environment cannot be used between the service rancher DNS for mutual visits. 
 
Storage section
Customer PAAs in the storage portion of the final selection of NFS as its storage solution, earlier also discussed the use of ceph and so on, this part of my previous article (explore the use of block storage in the container) has also been specifically analyzed why do not choose that scenario. 
 
Because a single tenant can have multiple networks (that is, multiple rancher environments), the volume created in rancher Rancher-nfs driver is based on an environmental level. To be able to map the volume to the tenant level, we do this layer of mapping in the abstraction layer. 
 
The specific process is as follows:
The platform administrator specifies the parameters in the PAAs to create an NFS server; As with the network, just save the data to the database; 
 
The platform administrator assigns NFS server to the tenant, at which point the abstraction layer actually operates multiple rancher environments for the tenant network, adding the system stack to provide RANCHER-NFS driver in a context-by-environment basis; 
 
Assuming that users within the tenant create, delete, and update volume, the abstraction layer needs to operate volume within the rancher environment corresponding to the tenant network. 
 
The reason for this abstraction is that the customer has the need to deploy the application stack across the network, so the storage must be shared across the rancher environment based on the granularity of the tenant. In addition to the Management of NFS server, the client has its own special requirements: 
 
Physical storage is graded by performance, and the same tenant should be able to have an NFS server with gold, silver, and bronze medals at the same time. Based on the level of business, you can specify a different level of NFS server for different levels of microservices. 
 
Thus, the use of storage with the current rancher is different: the same tenant can be associated with multiple NFS servers, and users in the tenant can specify NFS server when they create the volume. 
 
4. Application Orchestration
In application orchestration, based on the special security requirements of the financial industry, customers require that the application stack be able to deploy the same application stack across the network based on the security level of the microservices, that is, multiple microservices in the application stack may span multiple domains across different networks. 
 
In order to realize the deployment of microservices across the network, in addition to the basic resources (network and storage) model abstraction, the entire application stack deployment process needs to be adjusted accordingly; 
 
In addition, the description of the application stack no longer uses the rancher catalog, but is based on a set of open-source Tosca template standards, which is designed to facilitate penetration with OpenStack and other platforms, using the same template later to describe the entire IaaS and PAAs orchestration; 
 
To update the application stack and internal microservices, it is required to provide a unified interface, which is implemented by updating the application stack by issuing a new Tosca template. 
 
In solving the cross-network (rancher environment) deployment of the application stack and Tosca-based orchestration, we operate in the abstraction layer as follows: 
 
Accept the user input Tosca template, and then by the Translator module to do template syntax check and translation, the final output can be deployed to each rancher environment Rancher-compose files and other additional information; 
 
The Orchestration module requires a resource-level check of the return information of the translator, such as whether the tenant has the necessary network (rancher environment) to deploy the application stack; 
 
Based on the return information of the translator, according to the dependence of the microservices among the various networks, the order of the Rancher-compose is decided, and then the deployment of the Rancher-compose in the network (rancher environment) is not dependent; 
 
Based on the deployment of the application stack in the rancher environment, follow-up rancher-compose are deployed according to the order of dependency; 
 
After ensuring that the current application stack is deployed in all rancher environments, the elastic scaling rules of the application stack are issued to the elastic scaling module. 
 
 
 
5. Elastic Scaling
Auto-scaling is a requirement that customers customize based on their business scenarios, roughly as follows: 
 
First, the elastic scaling strategy is based on the time period, that is, according to the day period, you can set a time period of the day to adopt which elastic scaling strategy; 
 
There are three types of elastic scaling strategies: 
 
 
  
  - Average of CPU utilization of all containers under micro-service;
- Average of memory usage of all containers based on microservices;
- Based on the time period, the number of containers is directly extended or shrunk to a maximum or minimum value as long as the time interval is entered, and the number of containers is restored when leaving from that time interval;
Supports elastic scaling strategy enabling and de-enabling for a micro-service; 
 
In the monitoring of CPU and memory, there are the following rules: 
 
 
  
  - can set the upper and lower thresholds of monitoring indicators;
- Can be set for a long time, longer than the specified length of time, the number of containers increased or decreased;
- Can be set to trigger the scaling behavior, the number of single container increase or decrease value;
- The maximum and minimum values of the number of containers can be adjusted by the elastic scaling.
- The time interval after which the elastic scaling action can be configured and then triggered;
The implementation of the elastic scaling function is broadly divided into two types depending on the type of policy: 
 
 
  
  
 Time-based strategy, the strategy is mainly to match the current time and the policy time interval, once the time period of the time-based policy is found based on the index of the microservices, to find and change the target micro-service container number;
 The memory and CPU utilization-based policies themselves do not monitor CPU and memory information, but rely on the monitoring module. After adding or updating the elastic scaling strategy of a micro-service on the application orchestration side, the elastic scaling module transforms the elastic scaling strategy of the micro-service into the monitoring alarm policy, and monitors the alarm information from the Monitoring alarm module. When an alarm is received, the elasticity is found in the mapping table maintained by itself, which is the micro-service that triggers the alarm, and then based on a series of rules to decide whether to scale the number of containers for the microservices and how many to adjust at a time.
 
Because the elastic scaling strategy is set in each time interval, it is necessary to maintain a large number of timers. Once the rule is set, it is equivalent to defining a time-based state machine for a 24-hour period for microservices. When the number of micro-services, how to ensure that both the management of these state machines, timers, but also can not consume too much system resources is the difficulty of software design. 
 
In addition, because each running instance is running independent state machine, how to do the high availability (redundancy) of elastic scaling can guarantee the data synchronization of redundant parts, and it is worth to think deeply. 
 
 
6. Log Collection
The customer's PAAs collection of logs is mainly divided into three types according to the source of the logs: 
 
 
  
  - Host log collection;
- Container log collection;
- Application log collection;
 
For the host and container log collection is relatively simple, mainly through the collection of the file contents of the specified directory, and then the collected log information is formatted and sent to the Kafka cluster uniformly; 
 
For the application, the log collection is relatively complex, both the intrusion of the business container and the ability to collect timely logs. We specify the path and extension of the application log in the container by defining the log_files of a microservices in the Tosca template. 
 
When the container is dispatched to a host, the Log collection module on the host will know the Application log directory in the container based on the container tag, and by analyzing the container's details, we can obtain the path of the log directory mapped to the host in the container, thus collecting the application log in the container. Converted to a collection of specific file content on the host. The specific collection method is to use Logstash, the program automatically modifies the Logstach configuration file to add the log source. 
 
After collecting all the logs into Kafka, the customer then uses the third-party log analysis tool to perform specific filtering, analysis, search, and multi-dimensional representation of the logs. 
 
7. Monitoring Alarms
The customer's monitoring needs are broadly as follows: 
 
The Resource usage and health of the tenant host cluster, including: 
 
 
  
  - Number of container hosts and overall resource usage for tenant clusters
- Number of container hosts and resource usage in different network regions, such as areas of the tenant cluster, etc.
- The total number of containers in a cluster, the operation and distribution of container host nodes in each container
- Resource usage for each container host node, run container list
Application (including stack and service) monitoring data, monitoring data including the Application container list (container IP, host), application operation (health, resource consumption) and so on. 
 
Each container uses information such as CPU, memory, network, storage, tags, port number, etc. to monitor and provide restful APIs. 
 
Events and other information to the event audit database, and support the configuration of event alarm rules, when the alarm function activated, according to the pre-set alarm rules, from the event audit database to read and filter information, converted to the syslog format, and then send the alarm information through the message queue to the external PAAs platform. 
 
The realization of this part mainly uses Bosun platform, the container aspect collects the monitoring data from the Cadvisor, the host side is directly reads the host real-time information, to the rancher audit log, mainly through reads rancher the database to realize. After all the monitoring data is aggregated to bosun, a layer of bosun is used to set up the alarm rules in custom format, on the other hand to implement Bosun docking active MQ to send the monitoring information to the message queue, thus docking the third party to monitor the big data platform. 
 
8. Mirror Warehouse
The image warehouse is divided into the test warehouse and the generating warehouse, both of which realize the docking with the PAAs platform's permission model, realize single sign-on and unified authentication control. 
 
It is also worth mentioning that the customer's process of synchronizing the mirrors from the test warehouse to the production warehouse is divided into manual and Automatic, as follows: 
 
 
  
  - After the image is submitted to the test warehouse, the default is "in development" status;
- After the image is developed, the restricted user notifies the advanced user through the external collaborative management platform to synchronize the image from the test warehouse to the production warehouse;
- After the advanced user logs in to the Test warehouse, the mirror synchronization rules can be modified, and before the official synchronization, the advanced user can modify the "pending" status of the Mirror to "in development";
- If a corresponding namespace already exists in the production warehouse, and the advanced user has checked for automatic synchronization, the test warehouse synchronizes the mirror during the synchronization cycle time and changes the "pending" mirroring state to "in sync". If the synchronization is successful, the status is automatically updated to "synchronized"; otherwise "pending";
- If the corresponding namespace already exists in the production warehouse, and the advanced user has checked the manual synchronization, the advanced user needs to manually click the "Sync" button in the test warehouse to synchronize the mirror to the production warehouse; If the synchronization is successful, the status is automatically updated to synchronized; otherwise, "to be synchronized".
Q&a
 
Q: Elastic This piece, expand to say, shrink words there is a problem, is also a user request in this container, how to do?  
 
 
  
  
A: We do not have special treatment for this situation in this project, but we have taken this issue into account in our other products. The correct approach should be to set a grace time for destroying containers based on ingress, i.e., during this time, new traffic is not allowed to import the containers that are about to be destroyed, which are automatically destroyed after the grace time expires. 
 
 
 
Q: Thanks for sharing, to the elastic extension of the section to ask: You share the elastic telescopic scene business periodicity is very obvious, so based on the time interval trigger to take different scaling operations, if the business is not obvious, scaling mechanism can be processed? How to deal with?  
 
 
  
  
A: In the current project, the customer expressly requires a period of 1 days. In our own PAAs product line, elastic scaling can adjust cycles (e.g., week, month, etc.), and it can also be based on CPU, memory, or one of the monitored items, not by time period. You can understand that as long as monitoring items can be monitored, they can be used as a basis for elastic scaling. 
 
 
 
Q: I am currently concerned about the log this piece, how to put the log together, can say the specific point?  
 
 
  
  
A: We are collecting logs after the unified send to the Kafka cluster, you can understand as a copy of a centralized storage to the Kafka cluster. The focus here is not a difficult point, the difficulty lies in the collection of logs, involving three levels: host, container, application. Our approach is to deploy the containerized logstash on each host and then modify its configuration template through the program to collect logs from different directories. These directories correspond to the host log, the directory where the container logs are mapped to the host, and the directory where the application logs are mapped to the host. 
 
 
 
Q: According to the log tag to get the Application log directory, what is the container label specific format, the collection log information contains node information, container information, application information and other platform, application-related metadata fields?  
 
 
  
  
A: The log label here can be customized, equivalent to the host on the daemon program will listen to the host container creation, destruction and other event, once the container is found to create, check its label, whether there is a custom "log directory Information", "Log file extension information." These log directories have corresponding volume mounted on the host, so by analyzing the container's inspect information, you can find the directory that the log directory maps to the host. and the node information you mentioned, these are the Log Collection service container on each host is defined when it is started, and all the logs collected and sent out by it will have the host tag. 
 
 
 
Q: About the time value of log collection, is the local time or the system time of the log collection point, how to maintain consistency? Ntp?  
 
 
  
  
A: Is the local time of the log collection point, specifically through NTP, but be aware that the need to ensure that the container time and host time (time zone) to maintain consistency. 
 
 
 
Q: Another problem with elastic scaling, if it is not cyclical elasticity, consider avoiding the unnecessary elastic scaling operation caused by short-term pulse phenomenon?  
 
 
  
  
A: So in the elastic scaling rule there is a parameter: "Retrigger time" can also be understood as a safe period, after another scaling action, it must wait until the end of the time slice to trigger the elastic scaling behavior again. Otherwise, it will not respond. 
 
 
The above content is organized according to the April 27, 2017 night group sharing content. Share people 
Chen Legi, Rui Yun Zhihe (wice2c) architect, years of software and communications industry research and development experience, has been in the Nokia Siemens Networks, CLP Kohuaiun engaged in cloud computing, SDN and network equipment research and development work. Yu Ruiyun is currently employed in the research and development center, mainly engaged in WISE2C container-based PAAs platform products, as well as customer-specific PAAs platform development. Dockone Weekly will organize the technology to share, welcome interested students add: Liyingjiesz, into group participation, you want to listen to the topic or want to share the topic can give us a message.