Cook looked through solution monitoring System (III)

Source: Internet
Author: User

Let's talk about this one. High availability Welcome to join the DevOps development Discussion Exchange Group to Exchange, group number 365534424

I borrowed a high availability definition: High availability h.a. (Hi availability) refers to increased system and application availability by minimizing downtime due to routine maintenance operations (planning) and sudden system crashes (unplanned). It differs from the fault-tolerant technology that is considered to be uninterrupted operation. Ha system is the most effective method to prevent the failure of core computer system in the current enterprise.

High Availability is a highly available system. How much is high, I have been in the bat before a company like to use a few 9来 after the decimal point measurement. Of course the decimal point before the default is 99. It is generally believed that 4 9 is still good, 5 9 is the core business should have. How to calculate, this I can also say after the algorithm.

I think it would be a blessing for an OPS engineer to be able to transport a very high availability system. The system can be highly available, and the priority of operations processing alarms is not so high. The big winter evening to receive an alarm, also do not have to immediately climb up from bedding inside even VPN processing. The higher the availability, the more peace-of-mind the OPS engineer sleeps.


Factors that influence high availability and how to calculate

What are the factors that affect availability? Let's get it smoothed out. A service is divided into software and hardware. Get a Web site to solve.

Suppose there is a website, the domain name is www.51reboot.com. Deployed a nginx, deployed on a server, this server in a telecommunications room called ZW.

What steps does the user have to start with the browser input www.51reboot.com until he can open the page content in the browser?

First step, domain name resolution

The second step is to initiate an HTTP request toward the server

The third step is to request the network and reach the switch in the server room.

The fourth step, after the data walk several layers of switch, arrives the server network card

The fifth step, the network card data through the OS, arrived Nginx

Sixth Step, Nginx receives HTTP request

Seventh, Nginx calls the thinkphp framework (this is assumed to be the framework)

Eighth, PHP connect MySQL database to get data

nineth Step, PHP processing data

The tenth step, Nginx return data to the client

The 11th step, after the user-side browser complete receiving data, Render finished

A rough point, 11 steps. This involves: DNS, servers, switches, OS, Nginx, thinkphp, PHP, Mysql

The server can also be divided into: disk, and other components such as CPU, memory. The reason for this is that the disk as a storage part is the most perishable part.

OK, so let's figure out how much 51reboot this site is available. Once available, the equivalent of the 11 steps above is called normal to be available. Then it is necessary to consider the availability of DNS, the server is not down (availability) is how much, nginx and other software availability is how much, Mysql, the availability of disk, the availability of the network and so on. The product of these usability is the availability of the site.

There are no more facts to consider here. For example, if MySQL and the web is not on the same machine, not even in the same room, or deployed Nginx instance more than 1, and so on.

above we tell how usability is calculated. Let's take a look at the high availability of the monitoring system.


First qualitative, then quantitative

This is my principle, things are first qualitative, to see if it is necessary, re-quantitative, see the need to determine how much, specific quantification. The monitoring system itself is to monitor whether other services and systems are functioning properly. If the monitoring system itself is not available enough, it will seriously affect the monitoring effect. It can even be said that there is no use, there is a big danger.

Another problem here is the monitoring of the monitoring system itself. In the future will be in the space.

As stated earlier, the monitoring system itself must be highly available. Let's see how this high-availability needs to be quantified. Readers think it should reach a few 9 after the decimal point? I personally think at least 2 9. That's 99.99%. Otherwise the business unit's brothers are anxious, because their system if the request is 99.999%, no one can prove ah. The monitoring system itself is 99.99%.

How to achieve high availability 99.99%

This is another core issue to be discussed today. To achieve high availability, it depends on the architecture of the system. According to whether there is a single point to distinguish, the system has two types, one is a single point of the architecture, one is no single point of the architecture. Here we need to say what is called a single point. A single point is simply a part of the system, which is a unique presence when deployed within the system. The only deployment of a sibling instance that is not extensible. The only existence in the future is most likely to be a chronic problem in the system, because it has problems and no brothers can top it.

But I'm going to say it again. No single point, does not mean that absolutely absolutely no problem. For example, we deployed 2 Nginx instances using a hash method and load balancing method. When one of the instances hangs, only 50% of the requests are affected. The whole system can't be counted off. But the stability of the system is still worrying, and we can't say how high its availability is. If the hash calculation can be combined with the healthy state of the instance, unhealthy automatically removed from the hash pool, the usability is greatly improved.

in combination with our previous article, the monitoring system must be implemented in a centralized architecture to achieve high availability. is to let the whole system inside no one link is a single point. Because a single point means a bottleneck, it means that usability improvements are difficult and not easy to do.

Specifically, how can I go to a single point. We analyze the data flow from the monitoring system itself.

data collection, this to classify. One is an in-band agent that runs on the OS. This high availability, is another area of the thing, is how to write a high-availability, robust very good client. We'll say it in separate lengths.

The other is out of band. For example, HTTP or port monitoring, or survival monitoring. Let's take the survival monitor for example. For example, we use ping to monitor whether the server is alive or not. Then we need a detector for the batch ping packet. This is also a data acquisition terminal. It's just not collected on the OS. Of course it also has the problem of robustness, but this is another area of things that we do not talk about. This client, if it hangs, it is conceivable that the server it monitors is out of control. So we have to improve the PING monitoring module or call link availability. One of the simplest ways, we use two monitoring points to monitor the same batch of servers. But the new problem comes again, two monitoring point monitoring the same server, under what circumstances can conclude that the server is hanging? This is another problem of monitoring data merging, we also put in the next chapter to discuss. Another way to improve usability is to ping the monitoring point deployment of two, but not the same time between two, but two nodes have a heartbeat, a hung, another takeover, but also a way, but the switch slightly complicated.

data collection back, to do the processing. This process, or the high availability of the calculation link, has a lot of ready-made solutions. First, two compute instances are deployed, but two instances need to be able to be either interoperable or simultaneously functioning. Second, a purely distributed approach.

There are more distributed scenarios for storage and web.

There is also one of the most critical areas for how to achieve stateless. Only a stateless one can achieve a simple deployment switch to support high availability. We'll leave it to the next point.

to be continued, please Welcome to join the Operation Development Discussion Exchange group to Exchange, group number 365534424



This article from "Reboot Operation Development" blog, reproduced please contact the author!

Cook looked through solution monitoring System (c)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.