Absrtact: The topic of operations is very painful, you do any of the products are inseparable from operation and maintenance. No matter what language, platform or technology you use, it is likely that you will be able to determine the maturity of your product.
Taken from the CEO of Zhang Hu in the Ecug conference.
Operation and maintenance in the cloud era
The previous operation is so painful, we did not make much effort to change the status quo, why?
Because you have to build your own computer room, to buy their own, to investigate the room, procurement server, procurement bandwidth, the middle of any problem is very large may be the problem of the computer room.
In the cloud era, especially after the advent of AWS, many American teams have undergone great changes in the way they do operations.
Why is the operation of the cloud era different from the original operation?
The first is from the cloud host. Cloud host before the advent of the computer room with a vague area of the controversy has been completely wiped out. With a provider dedicated to cloud hosting services, by hundreds of millions of users to verify. The services provided are actually simple, including some basic monitoring, a portal for deployment services. If a cloud host service provider more than half user usage is very stable, basically you encounter the probability of instability is relatively low. Its reliability is higher than its own building room, looking for bandwidth providers.
A professional cloud hosting service provider can be more professional to solve the problem of infrastructure. You will not face so many painful problems any more.
Another essential change is the elastic operation. Click on a button to apply to a machine, or run the API to get a machine, in less than two minutes to complete the application, testing the application and even online process. Things that need to be done in the past few months can now be done in the time it takes to invoke an API.
My personal feeling is that the changes in infrastructure have created an essential difference between the operations of the cloud era and the original operation and maintenance.
The fact that you do is really rotten things, we will not care too much, because even if the links do not good, your boss also feel. In fact, the change it brings is the elastic operation.
When the infrastructure is an elastic operation, virtually all services can become elastic operations.
Elastic expansion
You only need to do the minimum capacity when you start the product. Users grow and then expand.
Grayscale on-line
The past preconditions were long and could not be done for months or even years. Because the cost of preconditions is very low. Almost every time we finish a small feature to do the gray line immediately, because Gray line on the earlier, the point of time to find the problem will be earlier, the system will be less stable time.
Deploy every day
Expansion, grayscale on-line, is actually a development process. Deployment of the entire cluster we make some tweaks every day, such as adding a new engine room, which is almost unthinkable in the early days. But in this day and age, this is going to be a lighter weight.
newly added room
Now the cloud host service provider and up, before adding room, you need to do sufficient verification. The verification process involves deploying a whole set of things, running some test users on top. The cost of adding a new room is actually very low.
Expansion
The process of capacity expansion includes:
- Application Machine
- Configure DNS
- Deploying Apps
- Deployment Monitoring
- Automated testing
- Allocate traffic
The whole process is actually a script inside the automation operation. The expansion is usually 3 to 5 units, that is, small-grained way to expand. 3 to 5 takes more than 20 minutes, and you can actually assume that the scaling process is a grayscale-on- line or grayscale test. Each completion of a feature to do the expansion, first grayscale on-line, after the success of the line to cut it to the production of traffic and expand, at the same time the old version of the traffic or capacity down, completed a capacity expansion process.
The cheap allocation process
I understand the elasticity of the operation is mainly two points, one is needed when the allocation. The other is that the allocation process is cheap.
What is this concept? In the past, the allocation process was costly and required a lot of preconditions. Now the allocation process is very cheap, run an API, call their API interface, the entire allocation process is completed, including the allocation of the host, the allocation of traffic, IP bindings, domain name resolution configuration. or include other services, such as storage services, load balancer, and so on.
I used the first service is AWS, but now the domestic cloud hosting service providers really do so into a system, supporting, complete should still not. There are already APIs, but the integration with some of the auto-deployment modules around is not good enough.
The so-called distribution process is very cheap, elasticity, expansion, grayscale is very convenient, the combination of the two have formed a flexible budget focus.
Users are expanding online at any time
One conclusion mentioned above is that there is no automatic operation and no elastic operations.
It is another concept that users are expanding online at any time. The difference is that the above is in the perspective of the OPS, or in terms of the product service provider's point of view of the problem.
We serve the target users, early user volume is very small only to do the verification does not pay, more and more users to find the delay has become larger, check the monitoring data found daily Online users a lot, daily interaction many times, now the channel is not enough to support so many users, your channel may need to expand.
The previous practice was to call business and say, "pay a little more, you give me a more server."
I think that is the traditional service, cloud services should be pure self-help .
Find the channel is not enough to apply, considering that the main users are in Shanghai, Nanjing, or in Tibet, Xi ' an, need to go according to specific circumstances in different areas of expansion channels. This thing should be done by the user himself, but he needs us to help him, so what is the way we help him? In fact, it is very simple, we will see where there is resources, where resources are scarce, the actual purchase of resources is the process of automatic expansion.
Automatic expansion is not driven by operations, but by developers who use our services.
Automatic expansion
Every day, we will monitor the system through the various modules, the individual components of the situation, once to a certain baseline, the previous need for human intervention, now the maturity of each link is sufficient, actually do not need to intervene manually. For example, to do a configuration, every night from 12 o'clock to 8 in the morning, which part of the capacity of the problem, start the automatic expansion process.
By contrast, the ability to work after the night is weak, even if there are operations people staring at them every day. In fact, a lot of operations to do things can be standardized, especially like the expansion of such things, a lot of things can be simply through the expansion to allow him to survive a long time . It doesn't matter if your middle OPS program is poorly written, the code is bad, or even because you chose Erlang without Go, whatever the reason, just widening that road will make the car all up.
Take the example of the BA's own: once, the system suddenly appeared bottleneck, time-out, packet loss, which has a cluster of pressure is very large. I only did one thing: to expand its capacity once, from the original is 20 machines quickly expanded into 40 machines. The pressure will go down soon.
I said to the team, "that's the bottleneck in this module," and we didn't go to the log or grab the bag. The reason for this conclusion is that the entry data has been growing, the export data has not changed , the number of calls to the dependent module has not increased, the entrance has been piled up, export does not go. Very simply, it means that the pipe is too thin.
The difference between cloud computing and Enterprise Services
Enterprise Services There is a problem immediately stop, but the cloud service is not, now there may be thousands of users are using your service, stop a minute someone will start cursing, your phone, your QQ group, your mailbox will be full. The right approach is to go offline and find out where the problem is.
What we want to do in the future
The entrance and exit of any module is monitored in detail, and every request, even every protocol package, is monitored and we find some paradigm. We have predictions about how much to go in, how much to go out, what to go to module A, what to Module B, and what to module C. Once you find out which module is under pressure, do it automatically at once. As mentioned above, stress is not just a matter of capacity, it may be a bug in the program. Whatever it is, let it get through, and that's what I've been thinking about. We will also put a lot of energy behind this, hope to share with you as soon as possible.
Team Division and role of operations engineer
What is the division of our previous development team? The person who writes the code, tests the test, and the OPS person knocks out all the commands to deploy.
There is an interesting paradox , in fact, people who write code are more aware of how to test than those who test, and the person who writes the code knows better how to configure and deploy than the OPS.
In fact, we have a painful lesson , before doing the Aurora push, each upgrade, I will put two big whiteboard in the office, prepare 3 pens, black, blue and red, the developers, operators, testers all together, to write a very complex battle map. If a migration involves 9 modules, draw all 9 modules, and that process is a bit like a drama script. I first said a you have to do something first, after the completion of what you want to run after the confirmation notice B at the same time to tell me, B to do things, b after finished also tell me to inform the next C, and then you want to do things, there may be a matter that several of us have to do at the same time.
I believe you should understand that when a is done, B should complete another action immediately within a few seconds, or else it will go wrong. Such things we will do once a month, 2 o'clock in the morning to 5 o'clock in the morning, but also to stare at 8 in the morning to go back to bed, very hard. Every time you ask, what exactly is this thing configured, where the profile is, what each item means, and so on. Finally found in forcing each person to write documents. I'm going to rewrite all the documents after I've finished writing them.
The cost of writing a document is high, so its time costs are high, and most of the situation is not well documented. Even if a good document is written, it will take a long time to see the operations and the tests. The cost of communication between these three characters is very high, and the communication effect is very poor.
How to solve this problem?
Answer: Devops.
Allow operations to participate in development.
In fact, our practice is that there is no testing and operation. All things are done by development. The Division of Labor will have a bit of crossover, maybe 5 people to write a module, the module may be divided into 5 small modules, it is possible that the person doing feature a may have done the 30% test of the feature a, the code of the test case and the code of the deployment. The person doing feature B does 70% of the characteristics of a test and part of the deployment of feature a so that everyone can cross each other.
When you write a test and deploy a program, you can actually understand that thing more profoundly than the person who wrote the code. Because writing code throughout the product development is the simplest thing to do. A good product, write things can only account for 30%, there are other factors accounted for 70%.
a good product is determined by your operational level, not by the level of code you write .
The amount of code that actually deploys the script can be much more than the code of the feature, and writing a code can result in an adjustment of a large number of test scripts/operations scripts and deployment scripts. DevOps is not our first initiative, and some teams abroad have been practicing for years. They have no tests or operations, and all are developed, operational, and tested.
Each member of our development team writes test cases, write deployment scripts, in addition to writing features. We recruit full-stack engineers when hiring, because you'll be thinking about what you wrote when you wrote your test scripts and deployment scripts.
A common problem in the past is that ops people will complain about "how you write this program, how complex the configuration is, and you don't know what you're doing". The test person will also say, "How do you define your interface?" I can not understand, how do I write your test case? ”
Now, every engineer will often reflect on the original writing is reasonable, whether to consider the future of maintenance, which in turn can improve the quality of your coding process.
The ideal way to deploy
There are a lot of traditional deployment tools in the early days, such as puppet, and there is a big paradox . They are using a cluster or a set of already-deployed things to manage something else that is not well-deployed. Here's a chicken egg and egg problem.
How do you do the puppet cluster itself? The original practice is the operation of the machine has been done, all the machines are installed, the configuration is all well, the process cost is very high.
My personal preference is that the new machine only needs to support SSH. The standard practice is to apply the key to the deployment machine, invoke an API host request, with sudo permission to deploy complete. How to configure a new machine in fact, a lot of the application is good. This and puppet a kind of deployment system, its cost is lower, no master, no agent, as long as the machine can be connected to the target machine on it.
Ansible Overview
The process of finding Ansible
First we use SSH to write a lot of scripts, to use SSH to connect the past, but also on a certain machine, do not have to log on the target machine. This practice, for a considerable period of time, is the means by which we actually use it, which is actually more effective than Puppet. But it has some problems: high management costs and a growing number of scripts. The process of deployment has many basic components that need to be deployed over and over again, and are almost impossible to manage.
Later we used the Rundeck, it has the interface, has certain management ability. We've also used Fabric, which executes commands in bulk, to do things like deployment. However, the ability to manage only when the target machine is large is not enough. Then we went over Salt and didn't think there was much difference.
The choice of Ansible is primarily due to rich support, including many existing components and modules and open source Ansible deployments and scripts. Our team doesn't like to tangle. We found that Ansible did not have a very essential difference and began to use it. If there is no definite reason, we can choose one with the feeling.
Ansible is connected to the target server via SSH , the above two things I think for a long time, I want to have the features. One is a complete cluster, you need only one inventory to define it, and you can define the cluster by writing out the list. The specific functions of each role are defined by several playbooks.
Ansible Inventory
This is a virtual example of a Inventory. This defines two roles, one is Web-group and the other is db-group.
Playbook's detailed case. We use Ubuntu 12.04, so we use the Nginx PPA first.
Ansible's development process is to write a large number of playbooks. There are now 251 modules supported by Ansible, especially for cloud-enabled services. Like AWS, Docker, Rackspace, OpenStack, deployment scripts are placed in a subdirectory. This means that it is very easy to get the scripts written by someone else, or to playbook someone else's written definition. Now there are a huge number of open-source scripts on Ansible, with more than 3,000 projects, and I believe the numbers will get bigger. Because the way it is shared is really simple, just copy the catalogue.
Demo Video
Video link to this, 41 minutes to start.
"This is one of our machine room deployment of machines, I personally prefer the practice is that I will give each machine is hostname, each time we add a machine, we will run a number of scripts, including to modify with him in a main network in the so-called Hosts file, which does not have an IP." Here I would like to talk about, more interesting places such as my side has a nogic, because I have to consider the problem of load balancing, each machine with different configuration items, then you can take some parameters to give him to specify a variety of configuration items, if not specified can write some scripts. "
"Let me show you the commands we have to execute almost every day, the so-called Ansible-playbook, and we can look at the contents of playbook." I've defined a number of roles, and all of my machines are going to run common, we have to brush up some of the basic modules, each time we run it, to avoid the inconsistency of his dependence.
This is a top-level playbook, because my Roles will also refer to another Roles, we have divided the playbook to a few levels so it can be seen very clearly, when I installed the Redis he only to rely on this, this side has a emqtt actually since three small Roles, a full playbook you can clearly see what each Roles is going to do.
You can give each Role an alias, I can put all the hosts that have this tags to execute a ubuntuprotobuf, you just use this tags. Before we run a command, we confirm that it will be executed on which machine, and then go and confirm what commands he will run, and you can get rid of it and start the execution. I don't know if I can do it now, I'd better not do it, I'll show you, this is one of our production environment, I give you to perform a look, there should be no big problem. He starts Playemqtt and collects some raw data. He collects a lot of data, you see this is called gathering facts, your system version, all kinds of platform information collected, in fact, collect this information is to judge your version, it will perform different branch platform can help you do these things. For example, depending on your IP, depending on some other information, such as you may have a few network ports, perhaps your configuration needs to use a different network port configuration, in the first step will be all the information collected. Here's another important thing is to verify that your target machine is not able to connect, in this protobuf process. "
About the author
Zhang Hu
Weibo: @Tiger_ Zhang Hu, founder of Inyumba (Yunba.io), Yunba.io Cloud backend services. Jpush founder, former CTO. A member of the Oracle VM founding team.
Based on ANSIBLE automation operation and Maintenance practice