Based on the premise of stable operation for many years, to ensure that the business will never go offline, he led the operation and maintenance team to independently develop the operation and maintenance system, including, asset management, work order management, monitoring system, domain name management, public cloud management, private cloud management, etc. Platform, and analyze and organize the operation and maintenance data to make the operation and maintenance work transparent and visual.
This time, I will mainly introduce the changes in the monitoring system during the operation and maintenance process from dozens to thousands of servers. It is often said that there are a thousand Hamlet in a thousand people's minds, and a thousand operation and maintenance methods in a thousand operations and maintenance minds. No one method is universal and can be applied to all scenarios. Specific problems must be analyzed in detail. I will The five-year experience is roughly divided into three stages:
The first stage: less than 200
The second stage: 200 ~ 1000 units
The third stage: 1000+ (there is no difference between 1000 and 2000)
The demarcation point of each stage is not so precise, it is an approximate period, and the change is a gradual process.
First, the stage where the number of machines is less than 200
The requirements in this period are simple, mainly used to notify problems and quickly locate and solve problems. In summary, the main requirements are three points:
1. Simple and easy to use;
2. Stable operation;
3. Able to alarm, email, text message.
Based on the above requirements, you can use the more popular open source monitoring software Nagios, Cacti, Zabbix, Ganglia, etc. Popular open source products have more documentation, can be quickly used, and have a lot of previous experience, which can avoid many problems, and it is easy to find solutions even if they encounter problems. Among them, email alarms are generally supported, and short messages need to be connected to the short message platform.
We chose Nagios and Cacti in the early days. The choice of Nagios was mainly for personal reasons. I am most familiar with Cacti because it is particularly convenient to monitor the switch, which is almost a fool. In fact, at this stage, no matter which monitoring product is, it can basically meet the needs. The choice factor depends on personal preferences. In this period, operation and maintenance students can occasionally be capricious.
Second, the number of machines from 200 to 1000
During this period, requirements began to become complicated, but mainly used for notifications and alarms to avoid the same problem from happening again. I mainly did the following during this period:
1. Unified monitoring content: Unify basic monitoring. By default, each machine contains basic information monitoring such as CPU, memory, and disk space;
2. Coverage monitoring: Including all machines in the monitoring, except for basic monitoring, the most important thing is business monitoring, covering business processes as much as possible, reducing and eliminating duplicate problems through custom monitoring to ensure stable business operations.
3. Timely notification to ensure that there are no missed reports: classify all monitoring, and notify by different levels such as email, WeChat, SMS, phone, etc., according to the degree of importance, urgency, etc. Your business uses the way of calling you to death and keeps you informed.
During this period, I conducted in-depth research on Nagios, wrote custom scripts, added a variety of monitoring items, and made full use of most Nagios plugins such as nrpe, nsca, and functions.
With more and more machines, more and more services need to be monitored, alarm information has exploded, and thousands of alarm emails have been received every day. There is an episode, I should be the first person to explode Tencent's corporate mailbox. It was not the capacity that exploded. It was because the number of emails exceeded the maximum of their database, which caused me to be unable to send and receive emails within a week. Ways to delete.
At the end of this stage, that is, when approaching 1,000 machines, Nagios 'monitoring functions can no longer meet the needs, and Nagios' graphics functions are always stretched, so he began to think about the situation of more than 1,000 machines. There are two roads ahead. article:
1. Continue to develop Nagios in depth according to their own needs;
2. Self-built monitoring.
At this time, some friends will think: changing to another open source monitoring can solve it. The biggest problem with using open source software is that what functions can you use with this software. What you do n’t have to develop yourself or give up. A large number of alarms are just a turning point. After a long period of use and accumulation, general, general Appropriate open source monitoring products can no longer fully meet the huge and complex needs.
After a long period of careful consideration, I decided to set up a monitoring system myself. In fact, it was because I had a deep understanding of Nagios' overall architecture and operating mode before, and I felt that it was not impossible to do it myself.
Third, the number of machines exceeds 1,000
After the initial thinking and preparation, at this stage, we started to develop our own monitoring system to solve the pain points and complete the requirements. There are mainly several things:
1. Have all the features of Nagios currently in use: Compare with Nagios, cover the original functions, and optimize and improve the problems of Nagios, and then upgrade after replacing Nagios. (The first step is the most important. If you can't replace the functions of the previous Nagios, the road of self-building can only be stopped here.)
2. Organize the alarms to simplify and reduce duplicate alarms: After the bombing alarm information appears, if you do not arrange in time, it will inevitably delay the things that really need to be processed, and due to some reasons, such as line problems, Repeated alarms occur, so the alarm information must be processed and re-issued. The warning information has dropped from 3000+ per day to 300 per day.
3. Separate alarm and display: The previous monitoring system basically has the alarm function and display function together. The information of different computer rooms also needs to be aggregated and displayed and alarmed in the central node. The handling of important alarms is a matter of seconds, and it has nothing to do with the interface display, so I separated the display and alarm functions once during the design, gave an alarm in the local computer room, and then focused on the display.
4. Distributed deployment to avoid single points: Each computer room is provided with a sub-node, which is the above-mentioned alarm node, and a central node, which is firstly alarmed in each computer room and then summarized and displayed in the center. The sub-nodes and the central node are backed up and switched through intelligent DNS. If the central node goes down, the DNS automatically switches to a sub-central node, and the sub-node is upgraded to a central node.
QA part
Q: Is this bottom layer still nagios?
A: No, it was all written from scratch and borrowed the ideas of nagios, but the method of collection and the method of summary processing are different.
Q: How much resource does this monitoring consume?
A: Fortunately, I encountered some bottlenecks in the centralized display and processing of data, and kept optimizing.
Q: Is the intelligent DNS system developed by yourself?
A: We use a third-party smart DNS, and our own is also used as a backup.
Q: Is your database a MySQL cluster?
Answer: There is another reason for MySQL master and slave to separate alarm and display, which is to worry about performance issues. The display can be slow for a few seconds or minutes, but the alarm is not, so the alarm is immediate and there is no need to worry about the monitoring machine becoming blind if it hangs. We currently have 6 nodes distributed throughout the country, and the chance of being completely suspended is very small. As long as one of them is alive, the alarm can be reported.
Q: Is this exact value in seconds?
A: Seconds, the slowest notification is a phone call, which takes a dozen seconds.
Q: Do you only use Monitor Po? Is there a use of Perspective Bao?
Answer: Perspective Bao is researching.
Q: What indicators are obtained by the switch?
Answer: CPU, memory, warning information, traffic, port.
Q: How does business monitoring work?
A: Business monitoring is actually similar to PerspectiveBaobao, but it is not as fine-grained.
Q: Is it buried in the program?
Answer: Without burying points in the program, it is achieved by using monitoring data, so it can only be achieved at the phenomenon level, not at the code level.
Q: How many operations does the company have?
Answer: I count eight people in total.
Q: How is the operation and maintenance divided into products every day?
Answer: The early product division, after the completion of the second stage of automation, was basically arbitrary, and all were completed through the work order system. The conventional online process was automatically launched after the approval of the self-built work order system was completed, without the need for operation and maintenance.
Q: What is your physical machine configuration?
A: The minimum configuration is also dual 6-core, 64G.
Q: Have you ever encountered a situation where the server is normal, the middleware and database are normal, and the online business suddenly fails?
A: You may need to see through treasure.
Q: Can Perspective Baobao monitor the congestion at the egress bandwidth of the network?
A: Perspective Bao is mainly used for application performance monitoring. Perspective Bao is like a CT scanner for application systems. It can collect actual user mobile and browser experience performance data, the application environment running on the server, database access, and application code. Performance data, and then use the big data technology to quickly diagnose and analyze the collected data, find the "foci" that affect application performance, and give diagnosis recommendations. The monitoring of network links is completed by the monitoring treasure. The combination of the two can Realize the full link service monitoring and problem diagnosis from the user to the server.
Q: Have you ever encountered business failures caused by intranet problems?
A: She should be able to help you. She is very detailed. Perspective Bao can solve internal problems. Monitor Bao can solve external problems. When combined, it is OK. You can check the switch to see if there is an SFP network oscillation. I have encountered this.
Q: What is sfp network shock? If there is a network problem, then everything else should have an impact?
A: Switches on the network, due to packet changes or timer timeouts, trigger recalculation repeatedly, which will continue in the three processes of root bridge selection, port role switching, and port state transition. If this process is continued multiple times, it is referred to as STP shock
To put it simply, if a module fails and a network cable fails, network ups and downs will occur frequently.