Hello everyone, I am a full-stack development engineer from CODING, I was fortunate to be involved in the development of Coding-job's containerized orchestration platform in CODING. You may be more aware of CODING, Coding.net is a one-stop development platform with code hosting, task management, product demonstrations and webide functions. The overall functionality looks more complex and more fragmented.
This is our Coding architecture evolution process. So how to judge a system complex, personally think to see two indicators, one is how long it takes the OPS staff to deploy the new code online. For example, I was in the start-up team, each deployment is a small amount of traffic after nine o'clock in the evening, the deployment of the evening meal also said half an hour will be almost updated online. Finally found that four or five hours have not been online. Why is it? There may be some front-end styling issues, and some configuration files are not doing well. Why do you find these problems? In many cases it is because of the disunity of our online and offline states. This is one of the indicators, and another indicator is how long it will take for a new colleague to deploy the entire system development environment after coming to the company. For example, I am a back-end programmer, I have to install a virtual machine, Nginx server, database, the language of the compilation environment, and run NPM installs to set up some front-end code base, these operations, coupled with domestic wonderful network problems, we usually spend half to one business day. Using Coding-job to deploy the local development environment, the time can be reduced to half an hour.
Coding website at the beginning is a simple Java War package, compile, package, upload, and then go online, it is a relatively simple operation. With the rapid development of the business, such as the text messaging, e-mail, SMS push, how these things to merge it together, is a more difficult thing. The first two years of micro-service is also relatively fire, we have also adopted a micro-service architecture, allowing developers to use his most adept framework and language, choose a microservices to do. such as text messages, text message module is written, interface documents written, to write Coding backstage colleagues, and then a great effort on the end of the son.
This way our code can keep up with the Coding of the rapid business development. But when we suddenly find a problem, it feels like no one in the whole company will remember how each micro-service was compiled, configured, and started. So I think there is something like a black box that allows us to put a bunch of code inside. Do not control the black box is how to configure the operation, put it to the line up, let it run up, code in the inside how to configure is the box of things, my outside environment, also does not affect the normal operation of the box. This black box seems to solve our problem, and that, in fact, is virtualization technology. When Docker started to fire in early 2015, we studied the traditional virtual machine and Docker technology at the time and finally used Docker.
The diagram is the traditional scenario-the Virtual machine architecture diagram, after we bought the host (Infrastructure) installed host operating system (Operating system), want to implement virtualization technology, it is necessary to install a set of virtual machine management software (Hypervisor), such as VMware VSphere, which allows you to run three black boxes, is actually a virtual machine that runs three operating systems (Guestos) on three VMs, and on top of these three operating systems, we run what we really want to run, namely App1, APP2, APP3 These three micro services. So want to come back, why do we have to start three virtual machines in weird way? Why spend a lot of physical resources on an optional upper OS?
When people finally figured this out, they began to study lightweight virtualization technology. The Liunx kernel namespace (name space) and control group Cgroups (controlling Groups) have come into being, which realizes the isolation of the inter-process namespace, the isolation of the network environment, and the process resource control. Docker relies on these technologies to suddenly get hot. When we were running microservices with Docker, the virtual machine management software in the diagram was replaced with a Docker engine, each running in a Docker container, and the container shared an operating system with the host, which saved a lot of physical resources, and the speed of the virtual machine was up to the second of the minute. I/O performance is also close to running directly on the host.
Speaking of namespaces, we all know that Linux will only have a PID 1 process init, after the namespace, in each process namespace, the PID list is isolated from each other, in the container, you can also have a PID 1 process, which enables the separation of inter-process namespace. In the control group, it is possible to control the physical resources occupied by each container, such as specifying CPU usage and reading and writing speed of block devices. Combining these two technologies, what we can do is to allow the Docker container to be as close to the virtual machine as possible in isolation while maintaining some performance.
So we move the Coding microservices one by one to Docker, deploy them as containers, and you'll think of this as a scenario.
But in fact it may be. In Coding our large and small micro-service more than 50, just like eggs can not be placed in a basket, containers can not be placed on the same cloud host, but neatly separated to put, otherwise how can be called distributed? Our microservices, for file systems, have some dependencies on each other's networks, strung together like a spider's web, which is required for the host location and the corresponding host configuration allowed by the microservices. So it takes a central thing to help us store the configuration of each service and the code they run, even the Docker image. And this central thing is the arrangement, as we listen to the concert, the stage of the conductor, it tells the cloud host to do what task, what configuration to run the task, with what code to execute it, and then tell another machine, what tasks and so on.
For this reason, we have researched two industry comparison fire two open source container management framework, namely this central thing.
Apache Mesos is a cluster of operating systems, it will be a cluster of all the physical resources to be used to abstract a computer for users, such as your cluster has 100 servers, each server a CPU, the straightforward to say that Apache Mesos can give you a 100 CPU of the computer.
Mesos do a better job is the allocation of physical resources. As you can see in the diagram, Mesos has the concept of the framework, which is equivalent to what we typically call an application, each of which consists of a scheduler (Scheduler) and an executor (Executor). The scheduler is responsible for dispatching tasks, and the executor is responsible for performing the tasks. You can see the middle part of the figure has Mesos Master, the figure below a few Mesos Slave, Master is the steward, the equivalent of contractor, and Slave is slaves, responsible for the rest. These Slave are actually our cloud host, each Slave is a host, this side can see a physical resource allocation process, first Slave1 found it has 4GB of memory and 4CPU of free physical resources, so it reports to Mesos Master, asked if it has a task can Do, and then Mesos Master will ask the currently available FRAMEWORK1 scheduler if there are any jobs that can be assigned to SLAVE1, the scheduler takes two tasks out of the free physical resource and responds to Mesos master, then Mesos Master and forwards Slav E1,slave1 began to work again. Of course, you will find that Slave1 has 1GB of memory and a CPU is idle, then it can be requested by Mesos Master FRAMEWORK2 to request the task to do.
So Mesos can bring us the benefits, one is efficient, we through the physical resources of the host to quantify and the resource requirements of the task, so that multiple tasks can be run on the same host, and make full use of its physical resources, second, scalability, Mesos provides more than 50 kinds of Framework Help us deal with a variety of tasks.
Another named Kubernetes Docker container Cluster Management framework is also particularly hot, it borrowed from Google's Borg, for the containerized system to provide resource scheduling, deployment, service discovery and other functions. It is more prone to container orchestration and has a system self-healing function, so let's look at its architecture.
I first talk about the architecture diagram, on the left is a control of the node, the right is a table Slave, I can use KUBERCTL in the upper left corner to submit a job, let Kubernetes help you assign a Slave above the assignment. In Kubernetes, the unit of dispatch task is pod, Chinese is pod. Like a pod, there are a lot of beans inside, what are these beans? These beans are actually our Docker containers, these containers share a pod, and inside the Kubernetes is a concept of shared container groups. All Docker containers in a Pod are shared namespaces, which solves the need for some special scenarios. Because the best practice for using Docker is to do only one thing in each container, not to use a Docker container as a virtual machine. This means that there are times when the B process needs to monitor (watch) the process number (PID) of a process or IPC communication with a process. If it is a container namespace, this is obviously not directly possible, and this is the Pod application scenario.
So let's take a look at how a pod is defined, and the pod definition shown here is a configuration file that we submit jobs to Kubernetes through Kubectl. There are multiple Container in the Pod, where it defines a Container is the Nginx and Nginx configuration, such as the image name and container port.
Next we talk about its self-healing system, which kubernetes the control of one or more pods by using a replica controller (Replication). Is the configuration file we submitted to the Replication controller, which represents the template for the pod used by this controller, and the replicas field represents how much we want kubernetes to produce with this pod template Pod running at the same time, this involves a concept called an ideal state (desired) inside the kubernetes. After submitting this configuration file, if any Pod is down for some reason, then the original three copy should be two, and Kubernetes will find a way to achieve the desired state, so it will start a new copy and eventually become three copies, this is Kubernetes's self-healing system.
After listening to Mesos resource management and kubernetes self-healing management, I think they have done quite mature, but want to put them into production, may still encounter some pits. For example, if you want to use Mesos, the first thing to consider is that the services we Coding typically run are some resident services, not batch tasks, so you need to install Marathon for a long-running Task management Framework (framework) to do this. If you are experiencing performance problems in a production environment, how to tune is a tricky thing. And in Kubernetes, we don't want the self-healing system to restore our service to any machine, and now Coding has a few microservices that have stringent requirements for machines and operating locations. In addition to this, the two frameworks are also in rapid development iterations, the documentation still needs to be supplemented, so into production, we think it is premature. Considering that we can do most of the functionality we want using native Docker APIs, we started to develop a container orchestration platform, Coding-job.
Coding-job's task configuration is divided into three tiers, service, Job, and task, and the service in the diagram is the core, which means that it is dependent on many services. Each job defines one thing to do, and each job is subdivided into tasks, each assigned to it, and each task can be run on different machines using different configurations.
Coding-job is the C/S architecture, on the client side, there are command line version of the operation. These are push (push profile operations) and up/down/update, respectively, to the job's start, stop, and update jobs as configured.
In addition, Coding-job server will show a WebUI interface, we can learn the status of each container through WebUI, through the INSPECT operation to provide the container specific information (consistent with Docker INSPECT), LOG function for viewing the container log,h Istory is the configuration that displays the history container for each JOB. In addition, the Coding-job server will periodically request Slave system resource usage and Docker runtime data, which will also be displayed on WebUI.
This is the use of Coding-job architecture diagram, Host-1, Host-2 these are Slave machines. After pulling up the latest code, the developer generates a Docker image file with the pre-packaged commands and uploads the image to the private Docker Registry via a docker push. At the same time, the developer updates the task profile based on the latest code tags, via the coding-job client PUSH to the ETCD used by the Coding-job server. This ETCD plays a role in storing task configuration information, since it is a storage that can monitor changes (watchable storage), after receiving new task configuration information, the Coding-job server can notify operations personnel (OPS) in various ways. Operations personnel if you confirm that you want to update the code on line, call the Coding-job client's up or UPDATE command, you can have the Slave machine start to download a new code image to run, so the update is complete.
So what are the benefits? The first is that a minute release of the new Component (service) is no longer a dream, on this basis, the function of the history container plus each update after the task configuration is submitted to the Git repository, allowing us to trace each component's historical online version. Second, we can use it in the corporate intranet environment within five minutes to launch a consistent with the line of the Staging environment, the Staging can be used for new functions of the beta. The last advantage is that we use the coding-job process, the developer no longer confused the online environment, but completely transparent, you can know how the production environment is formed. When the operations check the status of the entire station, you can get some ideas through WebUI, even this read-only WebUI can be open to the public network, so that everyone can see Coding service delivery status.
Happy Coding; )
Coding-job: The practice of containerized fusion from research and development to production