Exception Server Detection Technology Used by Netflix
In the early morning, half of our technical support team was still investigating the cause of Netflix's error. The system seems to be running normally, but we cannot find anything wrong. After checking for an hour, I finally found a problem with a server in the data center. We have been looking for any obvious problem, and there are tens of thousands of servers in the data center, so we ignored this little naughty problem.
The main character in the series "night magic man" is a blind man, but other senses are extremely sensitive. This allows him to detect some abnormal behaviors of a person and determine whether the person is lying. We have also developed a system to find the minor differences between servers. Although the differences are small, they may be problems in these small areas.
This article describes the automatic exception detection technology and how to fix the problematic server. Thanks to it, otherwise we may have to get up all night to save the fire.
There are tens of thousands of servers running the Netflix service. Generally, the number of problematic servers is no more than 1%. For example, if a problem occurs in the network of a server, the user's connection may be delayed. Although the running status is not ideal, it cannot be seen in the server health check.
In fact, it is better to directly shut down the problematic server. At least if it fails, the existing monitoring system and engineers can find that it has crashed. Although it is not hanging, it affects the user experience. Our customer service still needs to answer the phone to hear complaints from users. Somehow, there are always a few problems with tens of thousands of servers.
Lines of different colors represent the error rate of a server. Each line has a peak value and then drops back to zero, but purple indicates that this server has a higher error rate than other servers. Can you see that purple represents a server exception? Is there a way to use these time series data to automate exception detection?
One simple way is to set a threshold, and an alarm is triggered when the error rate is higher than the threshold, however, this method is only applicable to servers with high error rates. One problem with this method is that all data will have spikes, so the error may be large. in the figure below, it is difficult to find a suitable threshold, in addition, the threshold value needs to be adjusted regularly because the time and load used by the server may change. The breakthrough in improving system reliability is to automatically detect servers that are faulty but cannot be detected by the threshold.
To solve this problem, we use the clustering analysis algorithm. The basic principle of clustering analysis algorithms is to classify highly similar samples into the same category. This algorithm is unsupervised, so we do not need to mark and provide data. There are many specific Clustering analysis algorithms. Here we use the Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm.
DBSCAN algorithm principle
The DBSCAN algorithm was proposed by Martin Ester, Hans-Peter Kriegel, J örg Sander, and Xu Xiaowei in 1996. It can be said that it is a typical clustering analysis algorithm. DBSCAN traverses all data points. If there are many adjacent data points, they are classified as one type. In order to balance the adjacent data points in the DBSCAN algorithm, we need a method to determine the distance. The running process of the DBSCNAN algorithm is visualized here. If you are interested, you can check it.
Use the DBSCAN algorithm to find abnormal servers
To identify an abnormal server, we need to specify an indicator first, such as the error rate we mentioned earlier. Next, we need to collect a piece of time series data and use the DBSCAN algorithm for processing to find out the server with an exception. For example, in the figure below, the pink part is collected from the Netflix time series data platform.
In addition to the measurement indicators, we also need to specify the shortest duration for marking the server as an exception. When an exception is detected, it is handled by our alarm system as follows:
- Send an email or call the person in charge
- Server offline but not stopped
- Collect server data for further investigation
- Stop the server and wait for the Extension System to be replaced
Parameter Selection
Two parameters must be set in the DBSCAN algorithm: Eps and MinPts. This parameter is used to determine the adjacent radius of a data point and to define the minimum number of data points required by a cluster. Here, we use the simulated annealing algorithm based on the number of abnormal servers. This type of reverse push simplifies parameter settings, so now Netflix has several projects using our system.
To evaluate the effectiveness of this system, we have tested it in the production environment. We collected a total of data for a week, and then compared the servers with exceptions identified by algorithms. The test result is as follows:
The results show that although our detection system is not 100% accurate, the results are very good. Based on our own situation, we do not need to do anything completely, because even if we turn off a normal running server, it will not have much impact on the user experience, because the extension system can immediately add a new server. This detection system is always better than none. Haha.
Our practice is to collect data for a period of time for detection. Because it is not a real-time detection, the effect is related to the duration of data collection: if the time is too short, there may be noise. If the time is too long, the detection speed is too slow. If you want to improve the system, you can consider using real-time stream processing frameworks such as Mantis and Apache Spark Streaming. Data Stream mining and online machine learning research have also made some progress, so if you want to build a similar system, consider it.
In addition, the parameter settings can also be improved. You can perform data tagging to organize training data and train the model based on the provided training data, this method is better than the current reverse push method, and the model can be re-trained based on the changes in the training data.
Summary
Netflix's infrastructure is getting bigger and bigger. automating certain operational decisions (such as stopping servers here) can improve availability and reduce the burden on O & M personnel. Night magic's clothes can help him fight, and machine learning can also improve the efficiency of our technical support team. Detecting abnormal servers is just an example of automation. There are still many other opportunities for automation. Let's stay tuned.
Tracking down the Villains: Outlier Detection at Netflix)
This article permanently updates the link address: