At present, there are hundreds of Alibaba Cloud products running on Alibaba Cloud Network, and the area where Alibaba Cloud has been deployed has grown from several domestic cities and regions to many countries and regions around the world. Alibaba Cloud Network As the most basic facility, it is responsible for connecting Alibaba Cloud products in all regions. It also needs to provide high-quality multiple Internet access methods for these Alibaba Cloud products, enabling Alibaba Cloud users to quickly and efficiently access the cloud from multiple places. product.
Alibaba Cloud Network is designed and implemented based on SDN. It is roughly divided into two parts, overlay and underlay. In the overlay, there are some self-developed virtual network components, including SDN controllers, virtual gateways, and host vswitches; there are a series of physical network devices in the underlay, including physical routers and physical switches. Our host vswitch performance is very high, leading the industry, is the key to our east-west traffic. At the same time, the virtualized gateway allows the cloud to be accessed from north to south. Finally, the virtual key device is the key to realize the cloud-to-cloud interaction.
The ultra-large-scale global network consists of millions of network devices, tens of millions of network instances, and 1000+ network indicators. The IPs allocated by various network products run normally on our network, with thousands of them. Network indicators, each indicator has its own different meaning, different indicators have different relationships, how to manage such a large network? This is very challenging. We want to know if there is any abnormality in this network at any time. We want to know the operation of the entire network. From the global situation to the status of each device and each instance, we want to know. With the rapid development of the business, we also want to know whether our network resources meet the planning of the next month or quarter, the quality of the resources, the Internet access capabilities provided by the suppliers, etc., which need us to analyze and solve.
In order to solve the above problems, we designed and implemented a data analysis system based on big data technology and years of network work experience, which can help us manage this network intelligently. It is characterized by the ability to process massive amounts of network data and translate these network data into visual information and decisions that help us diagnose problems in the network, understand the health of the network, and help us plan the direction of the network.
Data-driven intelligent network - Qitian
We call this intelligent network designed for Qitian, which means that the network can be seen from the perspective of the sky, from the entire earth to every device. Qitian Intelligent Network consists of the following four parts:
1. Network market. Learn about the Alibaba Cloud network health and understand what is going on with each network and every instance of the user.
2. The network is abnormal. Multi-dimensional understanding of Alibaba Cloud network anomalies, real-time monitoring of Alibaba Cloud network stability.
3. Network resources. Extracting the planned Alibaba Cloud network resources, let us know in time that some local resources are not enough, some local Internet quality is degraded, and some places connect users to the private line side network to appear jitter, etc., and timely contact the cooperation operators to help users solve resource quality. problem.
4. Network operations. Combine the technology and experience of the BI team, integrate our understanding of the network, analyze our network products, costs and user portraits, understand how users use our products, understand the development of our products, and understand how users deploy networks on the cloud.
The Qitian 1.0 product architecture is shown in the figure. The bottom layer is the virtual and physical network generation components, including the overlay and underlay. The overlay and underlay components will generate a large amount of network data. These data are very primitive, mostly pushed by indicators or logs. To the upper layer, the data analysis system consumes these network data and logs in real time, and cleans, aggregates, processes, and multi-dimensionally calculates the network data to generate semantically rich multi-dimensional network data, which then enters various offline data. The analysis includes timing analysis data and Maxcompute data analysis. After these analysis, we will hand it over to the upper platform, and perform secondary analysis from the exception dimension, resource dimension, product dimension and market dimension respectively, to help us provide services to users from four directions, and finally output including Web, API, and stream. robot.
Network market
Our network coverage covers all areas of Alibaba Cloud, covering all virtual network components, covering all core metrics, including proprietary and public clouds. It is responsible for multi-layer analysis and even trending of core indicators of all network production components. Statistics and topology maps, etc., we can understand the traffic and real-time operation of each region and even every IP of each cluster. The network disk combines the virtual network and physical network topology. When the network topology changes in any region, the whole set The data analysis system can sense its changes and embed the changes in the data aggregation algorithm. There is no need to restart the program or resubmit the changes. Our network topology and data analysis platform are linked in real time.
In order to be able to see the problem from multiple angles, we split the data into data granularity of multiple dimensions from 1 minute to 1 year, so that we can understand the network situation from multiple time spans, such as Alibaba Cloud in the past three years. When is the peak of the regional network, when is the network jitter, etc., which requires us to analyze the long-term dimension, and even take two or three years to see the direction of network development. Time-series data with multiple granularities helps us understand our entire network health in different time dimensions.
Network anomaly
The network anomaly analysis system is a key component in understanding the stability of Alibaba Cloud's entire network, and it is also the most complex component. To accurately extract the anomaly while avoiding excessive noise and not letting R&D and users receive too many alarms, we follow the following four aspects:
1. Active detection. Alibaba Cloud has deployed many detection nodes around the world, including overlay and underlay, and continuously detects the overlay and underlay. Once a device problem occurs, it will immediately find and alarm, and the management network personnel will immediately process and recover.
2. Indicator fluctuations. When there is a problem in the network, the indicator anomaly is inevitable. Each indicator may have different anomalies and problems. Because a data analysis link is very long, from acquisition to data aggregation to cleaning to processing, it takes many processes and depends on Many middleware, there may be data link jitter or indicator glitch in the middle. In response to these problems, we have designed and implemented a set of algorithms that can filter out even if the intermediate link is jittery or produces index glitch, and extract the fluctuations of the truly suspicious problem or failure indicator as an abnormality.
3. Interval prediction. We have designed and implemented a new algorithm for interval prediction based on machine learning in cooperation with Zhejiang University. Based on historical data of past indicators, we can analyze the flow characteristics of each indicator and form a data model. Based on the data model, we can predict the next In which interval the instance fluctuates over a period of time. Then, when our actual indicator reaches that time, we will estimate the anomaly score based on the actual indicator value and the interval offset, which is used as a network anomaly factor to generate a suspicious network anomaly.
4. Abnormal aggregation. Anomaly aggregation is not an algorithm. It is based on network topology aggregation exception events, convergence exceptions and positioning exception ranges. Through all the exceptions combined with the network topology and the network link, all the exceptions are converged into a high-level exception. This exception will accurately cover what happened in the past minute, how many devices and instances are affected, how many users and how many products are affected. Therefore, how much the traffic has fallen and how long the product business has been affected.
Internet resources
Resource analysis is a component that we use exclusively for resource planning, resource quality analysis, and more. We combine the sales data of all current products with the actual cluster operation indicators, determine the resource water level, analyze the average consumption speed of each indicator in the past period of time, when the fastest indicator will be consumed, and then predict the next certain time. Which regional cluster will have a capacity limit.
We also do global resource consumption multi-dimensional statistics and resource quality analysis, through a series of network resource quality analysis, including resistance detection, edge node packet loss and delay conditions to understand the quality of all network resources worldwide.
In addition, we also do resource planning, we predict inventory consumption based on historical data, determine resource consumption and purchase resources to prepare for the next business development.
Network operation
Network operations include the following four aspects:
1. Revenue analysis, we can analyze the causes of daily income fluctuations, which industries and users lead to growth or change in revenue, how these users use our products.
2. User analysis, analysis of user images of network products, analysis of resource usage of each user.
3. Case analysis, analysis of network product examples.
4. Cost analysis to analyze whether the cost of network products is in line with expectations.
Planning and evolution
In the future, we want to be faster, more accurate, and smarter. Specifically embodied in the following aspects:
1. Second-level monitoring: We want to analyze all indicators in seconds or even sub-second speeds. To achieve this, we will also encounter a larger number of data impacts, such as more than 100 times the data throughput.
2. Classification of indicators: Classification of various indicators, analysis of relevance, helping users to identify network characteristics, and telling users what kind of network products to purchase.
3. Full Link Diagnostics: Work with virtual networks and physical networks to directly locate problems on the network.
4. Intelligent scheduling: Flexible scheduling of network traffic, when traffic problems occur somewhere, real-time scheduling traffic to other areas.