360: Based on the network operation of AI, we do this

Source: Internet
Author: User

Introduction of the author


Tan Cox, Qihoo 360 network operation and maintenance experts, has 10 years of network operations and development experience. 2012 joined 360 companies, is responsible for the network Aiops algorithm research and development of the next generation of artificial intelligence operation and maintenance system. Through the data Center network operation in the hundreds of millions of users, accumulated a wealth of operational experience, and in the system research and development and algorithmic research actively explore, the courage to innovate, the use of the current advanced AI technology to solve the operational dimensions of practical problems. Objective

Thanks to the efficient operation of the community to provide such a platform, I was a network engineer, experienced 360 of the architectural transformation process, my personal technical transformation began to focus on the network monitoring, automation, network visualization and AI applications. My share today consists of the following four parts:

1 Project Background
2 Sequential sequence algorithm
3 Machine Learning
4 present and future one, project background

Focus on the network project, the project is how the ISP export in DC found traffic anomaly, through the traffic anomaly can automatically find, automatic positioning, finally find out which business, inform the business process.

Our company's business expands to search, intelligent hardware, mobile phone, traffic recorder, children's watches, small water droplets, also out of the sweeping robot, there are 360 clouds, although the company does not have bat volume so large, but the business direction though small spite, also accumulated a lot of cloud experience, the company also has some in the entertainment business.

Our OPS, this is our data at the end of 2017: The PC end of the month live 515 million users, mobile end of the month live 350 million users, add up to 865 million of the monthly live user volume. Operation of the data center in the mainland has 120, Hong Kong 1, 1 Los Angeles, operation of the ISP bandwidth reached 3.5T scale.

In the face of such a large-scale network situation, we have 0 tolerance for business interruption, to gain insight into any anomalies in the network. Although the business will switch, but for a user experience will be a certain decline, we hope to be able to know the current network in the DC export, traffic on there is no exception. What kind of exception occurred, and the first time to respond and fix.

This is our DC flow export map, the overall look at the trend of morning and evening peaks, zoom in to see some fluctuation, and more frequent fluctuations, local amplification to see that there is nothing too big law. DC is not a separate business, it is comprehensive, a lot of business in the flow of the export run, bring the problem is an alarm out, do not know which business is the exception. For us it may be a black box, which fluctuates abnormally. Which business is causing the exception. Open this black box, from the definition of the exception to the discovery, and then to the positioning of the business to network operators brought a certain challenge.

This refers to positioning to the business, if your final positioning can not find which business, you tell people are meaningless, engineers in the middle of the night to say I want to see who is the business, to call the business, said I have an app to release this evening, the flow of high is normal.

By positioning what type of business is not supposed to inform the operation of the business owner. If there is no location for what business, it is likely that much of the work done before is meaningless, network monitoring also used a number of traditional, traditional monitoring is for the flow of fixed threshold class monitoring, fixed threshold fluctuation anomaly can not be found at all, if the threshold set too low false alarm volume is relatively large.

Last year, we made a bold attempt to do anomaly detection and flow prediction, 360 of all networks have hundreds of thousands of port, we have all the traffic data are serialized storage. When each port is stored, the data characteristics of more than dozens of dimensions are extracted, to know which server this port is, which domain name corresponds to which business owner, which is the city, which area, so we played a lot of labels, with these sequential data only after we carry out anomaly detection and analysis of the premise. Second, sequential sequence algorithm

Get the data, we can use the sequential sequence algorithm and machine learning means to analyze the data, before processing data, we have to do data stability verification, we have some smoothness of data, we will do some processing such as difference.

The original data for a first-order split, you can see basically is 0 up and down to float the interval, and then calculate the autocorrelation coefficient, in the range of positive and negative 0.2, and then look at the distribution is not in line with the normal distribution, after analysis found that most of the data is still stable. 2.1 3-sigma

By verifying the data and smoothing the data, we can apply some algorithms, look at this is the normal distribution of the graph, the horizontal axis represents the situation of data distribution, each lattice represents the range of standard deviation, from the figure can be seen, only a a fraction of data in 3 times times the standard deviation, then get a current data, Use this data to determine if the range above the 3 times-times standard deviation of the mean is considered an exception. 2.2 ewma exponential weighted moving average

Ewma This algorithm considers that the historical data has certain influence to the current data, the historical data influence size is reflected in the weight, this algorithm introduced the parameter λ, when λ is between 0 and 1, the λ is bigger, the current weight is bigger, the front moment is smaller.

We found in the actual flow chart that the closer the data is, the more recent the data is, because we use the days, every 15 minutes a time window for 7 days of ewma calculation, calculated a trend of the ewma curve, to the last value of the curve, but also the most recent value to replace the mean value, Then do the above 3sigma calculation comparison, which is considered an exception if it is exceeded. This algorithm takes into account the impact of the historical data on the current data.

In the data center flow chart This piece chooses two time, T time and t-1 time, then takes two time window, respectively takes the mean value, uses the latter window to be compared with the preceding one window again to multiply by 100%, is the fluctuation proportion. Method the time window can effectively absorb certain instantaneous fluctuation, and also sacrifice sensitivity. 2.3 Dynamic threshold value

As the above image has a normal interval, the interval between the two sides of the anomaly, the 14-day history of the second-lowest historical data multiplied by 60%, the second-largest multiplied by 1.2 is considered to be an anomaly, it seems that the shortcomings are very large, although the implementation of the threshold of the dynamic, after several fluctuations will find the threshold to pull high or 2.4 Small Flow monitoring optimization

This is our actual business to do the optimization algorithm, to deal with some small flow of processing. The x axis is the time, the Y axis is the size, the size of the unit is 1 trillion to 9 trillion, from 1 g to 9 G, although all is 9 times times, but from the perspective of the operation of the meaning is not the same.

We hope that through a curve, the dynamic of some tolerance, the curve is very small when the flow is very steep, the more backward the curve, the more slowly, the mathematical understanding of the people know is to use the logarithmic function. Y=W*1N (x+b), b at the beginning of the outside, put on the outside of the effect is not as good as put in the inside, the effect is greater, w performance slope.

When the four algorithms are applied at the same time, it is OK to use the algorithm to solve the 80% of the data of the DC exit. There is also a data, with the time window to cut it, in a very difficult to cut out a similar situation, this situation is more difficult to optimize, but also more headaches.

Third, machine learning

In the face of the above difficult to optimize the situation, the engineer will find a way to solve it, we see the current popular method-machine learning. 3.1 Machine Learning Architecture

Considering the use of machine learning, first look at the architecture, we hope that by designing an automatic Update Model, why. Business traffic models are often unfixed, last month and this month may not be the same, there was no frequent jitter last month, this month may fluctuate frequently. May depend on the scheduling of the business and so on, often the closer the trend is more able to reflect the current traffic characteristics of the business.

Get the training model, we have real-time traffic for feature extraction, as a sample input model, the model can tell whether the flow is normal or abnormal, the above part because the training requires a certain amount of time, also does not meet the requirements of real-time processing, we put it on the Offline, And the following real-time forecast to put on the online real-time. 3.2 Comparison of learning styles

The choice of learning methods also made some attempts to first say that there is supervised machine learning, the general requirements of positive and negative samples of the ratio is 1:1, but also manual annotation, through the annotation can be effectively done to improve the algorithm to achieve the overall promotion. No supervision is not necessary to consider the proportion of positive and negative samples, do not have to do the annotation can automatically learn some useful information from the information, but also need some parameters, this should be based on the engineer's conclusion to manually adjust these parameters. 3.3 Feature Extraction

Input some features and raw data information, how do we extract this feature?

Our goal is to isolate the anomaly data in the first picture by analyzing it, which actually solves the problem of classification. This picture on the right is two normal clusters, blue and red represent normal data, the data of the fork is far away from the anomaly data, the anomaly is also called outlier point.

So what if this data feature is extracted?

Characteristics appear when there must be fluctuations, smooth is not called abnormal, if this data is always such fluctuations, we think is normal, abnormal time must be a small probability event. Eigenvector, we also made some attempts to finally data normalized traffic size, we are in the probability distribution of the number of points, the time period separated, the general distribution of data distribution probability is fixed, but also used a lot of year-on-year cycle, the chain coefficient of variation of the data for the characteristics of the training, the effect of the final measure is always not ideal, At present, it is characterized by the size of direct data and the ring amplitude. 3.4 Model Selection

Since it's a classification problem, model choice is also a lot of, tried the classical clustering K-means algorithm: Given a dataset, to define several categories, and then to calculate, set the maximum number of cycles, the output of the center of the sample set, where the center of each classification, each data marking, What is the situation, the following is the calculation of the loop, and finally calculate the process of my center point.

How do I know the output exception after calculating the center point? Often need to set a threshold here, through the last training I got is the center point of each category cluster, to determine the anomaly need to set a threshold, calculate the point to the central point of the European distance, distance exceeding the threshold I think it is abnormal. This picture is when we test the threshold setting 2.4 can be completely separated, the green part is normal data, the point of fluctuation is this red dot.

Next to share is the independent forest, the algorithm from Zhou Zhihua in 2011 isolation-based Anomaly detection by cutting the cake by the way to see who first cut out, such as randomly cut 100 times, several times to cut out the B, and a will cut a lot of want $ out. In the implementation is the use of the forest algorithm, the random establishment of a number of binary trees, the small left, large to the right, in the shortest path from the root of the score after the arbitration later to predict whether it is abnormal.

Compare two methods, at present we get the characteristics of the data is still relatively small, before the year and the chain characteristics or relatively few. When we do the classification, if the characteristics of more than the case K-means better. Classification of the set K-means to remove some of the abnormal samples, iforest do not need. In the ease of use, we think iforest better, finally we choose Iforest.

Because each DC scene is not the same, the flow characteristics of the in/out direction is not the same, preferably every port in each direction to train a model, so you can more fit business. That is, each port in one direction is a curve, each curve corresponding to a model, window size Select 10 minutes, the model update is updated every day.

If I use the first four methods of arbitration, the concept of arbitration is a variety of algorithms to vote, four of the algorithms in line with more than two anomalies, the algorithm is abnormal, in this case the accuracy rate is lower. A variety of algorithms in the case of arbitration, coupled with machine learning, the final Model of the decree, the accuracy rate can be raised to more than 98%.

Iv. present and future

Anomaly detection can be detected, but when anomalies occur, if it is not possible to locate where the business does this meaning is not very large.

Based on the previous accumulation, we split the data center on the export, flow mirror can get the complete data, wrote some C language development, can know which IP run in this area is relatively high, which ran relatively low, abnormal, alarm inside can call API pull TOPN, burst the time to see TOPN, when it's steep, it can't be seen.

Two days ago we just improved a version, know which business which IP is not enough, give the flow of the protocol type, know is tcp/udp/icmp and so on. Find the corresponding business person, directly to the business person and operators to send mail, the most unusual still.

There is also a method to determine the correlation coefficient of Pearson correlation coefficients, the similarity of two curves, if less than 0.3 is not a linear correlation, 0.3 to 0.5 is low linear correlation, 0.5 to 0.8 is significantly linear correlation, greater than 0.8 is highly linear correlation.

When the export flow curve fluctuates, we have to crawl the curve of IDC related Port. We have done clairvoyance, the network transports the mouse to choose on the sooth stage, will help you backstage calculates the same period DC inside also what curve and you choose the similarity degree highest, and shows which business. Before this is the engineers to the naked eye to analyze the flow chart, a diagram to see, if the switch special multiport, can be more time-consuming.

Network monitoring for so many years, we have done a number of monitoring items very carefully, port percentages, traffic loss, quality control, etc. have done. These controls lack a reasonable correlation, hope that through a way to connect it intelligently, when a failure can be linked to find out the associated events, do a good job related alarm suppression and shielding, in addition to the logical association, can also find the period of alarm root case, and consider the prevention of Make early predictions of failure.

After the fault has been found, can have a plan of action, and finally we set up a good plan in advance of the order, through automatic linkage plan to do the automatic repair of the fault.

GOPS2018 Shenzhen station wonderful PPT (continuous update)

Link: https://pan.baidu.com/s/1zgOGm7CabpO6lIquNVcNkg

Password: sp76


More highlights please visit the high-dimensional online: WWW.GAOWEI.VIP


End

How to be face-to-face with Tan Cox. opportunity to come ...

Event Schedule


Event guests


Place of activity


Registration Channel

Long press the two-dimensional code to enter the activity official website


More surprises please click to read the original ⬇️

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.