Baidu Network monitoring combat: Netradar turned out (on)

Source: Internet
Author: User
Tags syslog dedicated server traceroute command radar

Original: Https://mp.weixin.qq.com/s/VBShicsqReDtureKAdEgDA

Transferred from the subscription number "aiops Smart OPS, authorized to be forwarded by the operation and maintenance

Introduction: Yun Bei, Baidu Senior research and development engineer

Responsible for Baidu intranet quality monitoring platform (Netradar) business end design and development work. In the system and network monitoring, timing indicator anomaly detection, intelligent Customer Service robot and other directions have extensive practical experience.

Dry Goods Overview

Baidu intranet connected to hundreds of thousands of servers, carrying the whole company's business network communication, the importance of its communication quality is self-evident. and Baidu intranet quality monitoring platform Netradar(network Radar), through the entire intranet "server end-to-end" transmission quality monitoring, to achieve a rapid and accurate detection, notification, positioning intranet issues, for the normal communication of Baidu business to provide a strong guarantee.

"Baidu Network monitoring actual combat:Netradar turned out" series of articles will be divided up, the next two introduced Netradar platform, this article mainly introduces the meaning of intranet quality monitoring, related needs and Baidu's original intranet monitoring technology, and the next chapter will be from the core functions, Design framework, anomaly detection strategy and visual view of the Netradar platform are systematically introduced.

Introduction of Baidu Intranet

Baidu has hundreds of thousands of servers, distributed throughout the country's dozens of data centers (also known as IDC, Computer room). These massive servers through the network hierarchical level interconnection, constituted the unified "resource pool", provides the reliable, the powerful storage, the computation and the communication service externally.

In software architecture, Baidu's large-scale services are generally modular design, a service needs to go up and down a large number of modules to work together to complete. To improve concurrent service and disaster resilience, these modules are distributed across different servers in different engine rooms. In order to ensure the normal operation of the service, the intranet must ensure that each module has a good "end-to-end" network communication capabilities, in the event of network failure and affect the communication between the modules, will often affect the service, and even lead to the overall unavailability of services.

In order to provide high-reliability, high-performance end-to-end communication capability, the network structure is designed with a large amount of redundancy, both the redundancy of the equipment and the redundancy of the circuit. In this way, communication between the two servers can have many different paths at the same time, to some extent, to protect against network failure. However, the end-to-end communication problems in the real environment are still common, including: routing convergence delay, ToR Switch single point failure, network congestion and so on. On the other hand, even if a single device, network cable, server failure probability is very low, multiply the huge number, the fault is bound to be "normal" phenomenon.

In this "and fault-companion" environment, since the failure can not avoid, it is necessary to be able to timely and accurately monitor the quality of the intranet, which is essential to ensure the normal operation of the service.

Demand Research

In operation and maintenance practice, what are the requirements of Engineers ' internal network quality monitoring system? We have conducted research on the operations engineers of each line of business, as well as the students from the network group. To better illustrate the user's needs, Figure 1 shows a typical operations scenario:

Figure 1 operation and maintenance scenarios related to intranet issues

When the OPS engineer discovers that the service key metrics are abnormal, if the suspect is caused by an intranet failure, you need to troubleshoot by answering some of the following questions:
1) "Is there a problem with the network of Room A to room B?"
2) "Is there a problem with Server A to Server B network?"

If you have checked to confirm that the intranet is not a problem, you should continue to troubleshoot other possible causes, such as on-line, operation, and program bugs, to help make effective stop loss and recovery decisions. And if it is determined that the internal network failure caused the service to be compromised, then the network engineer in order to diagnose and repair the problem, will be troubleshooting a series of communication problems to help reduce the scope of the failure, such as: "Which server communication problem?", "which link is wrong?" such as In order to answer these questions, the most direct and effective way is to "perform server-to-peer detection", such as:

1) Troubleshooting "Room A to Room B network is there a problem?"

Can test: computer room A most of the machines to the computer room B Most of the network quality of the machine

2) Troubleshoot "Room A internal network problems?"

Can be tested: computer room a The network quality that most machines access to each other
3) Troubleshoot "Server A to Server B network problems?"

Just test: Server A access Server B's network quality

4) Troubleshoot "What are the server communication issues?"

Need to ping or ssh to a suspected problematic server
5) Troubleshoot "on which link is the problem?"

Need to execute traceroute command to view route details

Figure 2 Manual measurement of Network quality steps

However, it is time consuming and laborious to perform these test tasks manually. 2, in order to conduct an end-to-end network quality detection, first to determine the "source-destination" server, and then obtain the server's logon rights, before logging on to the machine to perform a variety of test operations, the final analysis of data to obtain the measurement results. Obviously, this method of manual measurement is very poor in scale and can not cope with the demand of large scale measurement. Therefore, a platform is required to perform the measurement tasks in real time and automatically , giving the analysis results.

So what does this platform need to meet? By conducting research on business Line Operations engineers and network engineers, the requirements for collating are as follows:

1) "End-to-end" continuous monitoring

As the program or module of the Baidu line of business is deployed on the server, and its network communication is initiated and received from the server, the server "end-to-end" network quality can reflect the actual impact of the intranet situation on the business communication. So from a business perspective, the platform should be able to continuously monitor the quality of the end-to-end network.

2) Full coverage monitoring

In practice, OPS engineers usually know which rooms the business is deployed in, but they don't know which machines have network communication, so they are concerned with the global problem of whether these computer rooms are normal. In addition, the network Engineer's responsibility is to ensure that the entire intranet quality is reliable, need to systematically monitor the entire intranet performance, as far as possible to detect and repair network failures, reduce hidden dangers.

3) Monitoring task issued on demand

In practice, specific monitoring tasks often need to be performed in accordance with site conditions, which results in additional, targeted measurements. Therefore, the monitoring platform also needs to support on-demand monitoring.

4) detection results active alarm

Because the network engineer is responsible for the quality of the internal network directly, so I hope to monitor the platform in the measurement of "end-to-end" communication performance, the relevant data analysis, determine whether the network is normal, and in the detection of network anomalies in a timely manner to send the alarm to ensure that the business Line service normal.

5) customized display for the product business

Because a product business is usually only deployed in part of the computer room, dependent on some networks, operations engineers often do not care about the non-responsible. Therefore, the monitoring system needs to support the customized display, so that operations engineers can quickly get their attention to the network status information.

So, Baidu's existing intranet monitoring technology can meet the above needs?

Existing monitoring technologies

In fact, Baidu has applied some internal network quality monitoring technology, these technologies use different means of measurement to obtain intranet quality data, and analysis, and then determine whether the network is normal. Table 1 provides information about three existing monitoring technologies.

Table 1 Existing monitoring technology principles and deficiencies

Number

Monitoring principle

Insufficient

Technology 1

Monitoring switch-level failures with the switch's Syslog

Switch-level failures do not accurately reflect the network performance perceived by the business

Syslog cannot log all switch failures

Unable to detect non-switch fault class network exception

Technology 2

Deploy dedicated server probes to connect each IDC core switch, and the server proactively detects network performance between IDC by contracting each other

IDC Internal network communication monitoring is missing

Detected network performance differences between IDC network performance and business experience

Resource overhead is large and cannot be scaled directly

Technology 3

Deploy probes on all on-line servers and set up a target server in each IDC to measure the network status of each target server on all on-line servers

Single point of failure problem, not very good representative of the network situation of the computer room

The internal topology of the machine room is not covered completely

On-demand probing is not supported

The above-mentioned technologies play a certain role in intranet quality monitoring and operation, but some shortcomings have been found in the use process, which can not meet the above requirements. Therefore, based on the above-mentioned technical experience, we have developed a new platform Netradar(network Radar). Compared with the above monitoring technology,Netradar has the following advantages:

Wide Coverage : detection agent in the whole network Linux server to complete the deployment, covering all of Baidu intranet room;

Multi-level :continuously monitor the network quality of the whole intranet, including the network quality between the machine room, the cluster in the computer room and the Tor switch in the cluster.

Indicator Full : Evaluate the network quality of various ways, distinguish QoS queue, protocol, statistical value, a total of 27 network quality monitoring indicators, each detection cycle will produce nearly million monitoring indicators;

detection Quasi : The monitoring index is detected by adaptive anomaly detection algorithm, and the network events of the computer room and regional level are further generated.

In addition,Netradar supports on-demand probing and provides a full intranet "end-to-end" probe interface and fault event interface to help engineers quickly diagnose network problems.

Summarize

I believe that through the introduction of this article, you have some knowledge of Baidu intranet quality monitoring. Next, we will launch the next article of this series: "Baidu Network monitoring actual combat:Netradar turned out (under)", systematically introduce Netradar platform, please continue to pay attention to aiops intelligent operation and Maintenance!

Baidu Network monitoring combat: Netradar turned out (on)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.