Objectively speaking, the experience of the Double 11 Singles Day this year is still good. As my own example, in the past few years, at 0 o'clock in the double 11 singles day time, a wave of submissions from my own shopping cart generally entered the queuing state in an instant, but this year my experience was obviously different, and one wave succeeded. The underlying technology of the dual 11 is actually based on the network products. Next, the theme I want to share with you is the double 11 singles day of how we do it. I will call the Alibaba Double 11 Singles Day Cloud as an elephant dancing on the clouds. It’s an elephant, because the amount of double 11 is too big. It’s not easy to support the double 11 singles day perfectly, especially the peak of the double 11 zero. Everyone knows that AWS's ELB is warming up, and facing such a peak challenge must be powerless.
Here are a few numbers to share with everyone. The first is about the two results, one is that 100% of the flow ratio is in the cloud, and one is 325,000 transactions per second. What is the concept of this trading peak? You know, the first double 11 singles day in 2009, the peak value of the transaction is 200 pens per second. Under this large number of transactions, the cloud network is stable. The other three values for extreme performance, one is the number of ECS for a single network, which is 10w. Before I happened to visit a customer in Shenzhen, this customer was playing with Openstack on the cloud network at home. This customer just gave me a spit and said that I had 100 calculation nodes, and the controller was paralyzed. The other is the bandwidth of the 160G load balancing instance, which I believe has surpassed most hardware load balancers. The last one is to expand thousands of ECS in 10 minutes. For example, this flexible capacity expansion has played an important role in the day when Luhan exposed the relationship.
Not all cloud computing companies have the opportunity to withstand the baptism of the Double 11 Singles Day, and no cloud computing vendor other than Alibaba Cloud has experienced this experience. But even though Alibaba is a very extreme and wonderful customer, in addition to Alibaba, there are millions of customers on Alibaba Cloud to meet the needs of a customer. It may always be easy to serve a customer. Customers who completely customize the system, but serve millions of customers at the same time, are definitely not an easy task.
First of all, the most concerned about customers on the cloud is usability. No one wants to go to the cloud, and the service is down all day long, so availability is the lifeblood of customers on the cloud.
In addition to usability, ease of use is also a very important feature. Before the user wants to pull the private line, the component is a network, considering the computer room, power supply, the position, the switch router, the network architecture, the computer room implementation, etc., but the network is set up on the cloud, the user needs only click At a glance, the network will be created as the user wants; but if the user wants to click on countless, then the user may be rude.
With usability and ease of use, users are concerned about flexibility and performance at this time. These two are closely related to user experience, and they are also a big killer for users to reduce costs. Here, double 11 singles day is a good example. When the user needs resources, the time for resource preparation, the user hopes to be 0. When the user does not need resources, when the resource is destroyed, the user also wants to be 0. As for performance, it is very important that users want to achieve the highest results with the least amount of money.
The last one is intelligence. How do you understand this? For example, the user we just mentioned needs resources and the process of destroying resources. The user essentially wants these actions to be completed automatically. When there is a problem on the user's line, the user hopes. The system itself is automatically processed.
How does Alibaba cloud network products satisfy users? On the surface, we actually rely on an API-based SDN scheduling system to create and delete resources to ensure. Behind these products is the Luoshen system. The Luoshen system comes from the classical mythology of China. We hope that our network products, like the channel of the river transport, can connect resources smoothly.
Luo Shen was not built in one day.
The first generation of Luoshen's network appeared in 2011. At this time, the way users perceive the network is through a given public network and private network. Users cannot define the network by themselves. Under this architecture, the network actually does not exist virtual. The address of the physical network is directly passed to the VM. There is no physical isolation between the network and the network. All addresses are inside an address plane. Therefore, at this time, the network's isolated ACL and other functions. They are all on the physical switch. At this stage, we call it the classic network period of Luoshen. At this stage, the network is basically inelastic.
The second generation of Luoshen's network appeared in 2014. At this time, users can customize their own network. The user's topology and address can be completely defined by themselves. At this stage, we call it the vpc stage of Luoshen. In this stage, because of the user. With a network, there must be a problem with how the network communicates with the Internet, China Unicom's IDC and other Unicom networks. In this stage, Luoshen's product design has been greatly enriched, and various products have appeared.
The third generation of Luo Shen appeared in 2016, in fact, the third generation of Luo Shen is about the evolution of intelligent scheduling and user experience. When the user's network is getting bigger and bigger, the user must have a better way to connect the resources on the cloud with the resources under the cloud. When the user has more resources, the user must also need a better way to manage these resources, such as bandwidth, traffic, and so on. When users deploy services around the world, users must expect intelligent scheduling and proximity access from global access.
As users become more and more dependent on the cloud, users must hope that the feeling on the cloud is not just the use of the business, but the ever-increasing visualization, the maintenance and the continuous added value.
This is the third generation of Luo Shen, the intelligent Luo Shen.
Ok, let's take a look at how the current Luoshen architecture supports these rich features. Like the traditional SDN architecture, Luo Shen's system architecture also separates control and forwarding. The data plane is located in the lower left corner. In the data plane of Luoshen, the virtual switch above the common computing node is included. This is the first level of the packet above the computing node. In addition to the important role of the data plane is the gateway, the gateway has two main functions in Luo Shen, one is to carry the common network function virtualization unit, which is what we often say NFV. Another network that connects two different planes, for example, connects the public network and the offline IDC.
The data surface is equivalent to the carrier of the flow of data packets in the Luoshen system, then the control platform is the control platform. Luoshen's control platform also contains several parts, one is the management of the equipment, which includes the management of the main data plane components virtual switch and gateway, responsible for the configuration from the controller to the device. Another important component is virtual network management, which is responsible for the address allocation of the vpc itself, the routing configuration within the vpc, and so on. In addition to this, the other two key components are the local routing controller and the global routing controller. When Luo Shen is no longer a network inside the data center, but becomes a global network across the computer room or even across regions, these two controllers play a key role, one is responsible for the network within the data center. Interconnection and routing calculations are responsible for network interconnection and routing calculations worldwide.
In addition to the data plane and control surface, Luo Shen has a major component, the operational plane. The data plane and control plane of the virtual network continuously generate data, including various runtime states, abnormal alarms, logs, effective processing and consumption of these data, which is the basis of the overall system stability. So a very important function of the operational plane. Another point is the processing of operational data of network products. This part of the function is also supported by the operating platform.
Overall, the data platform, control platform and operating platform constitute the Luoshen system. Of course, the Luoshen system does not only support Alibaba Cloud's network products. The underlying network of Alibaba Cloud's products is supported by the Luoshen system. What everyone may know is Alibaba Cloud's overall virtualization protocol stack flying system.
Luo Shen is one of the virtual network components.
Next, let me share in detail the main key design of the Luoshen system. We believe that for the underlying technology of public cloud services, stability and availability are the highest priority, so I will first introduce the design of the availability of the Luoshen system.
When Alibaba Cloud's data center cloud room starts to be deployed, the Luoshen system is the lowest-level system, and will be deployed first after the physical facilities are deployed. At this time, everyone saw that there are computing clusters, gateways and control platforms in this computer room. There are our virtual switch components on the compute cluster.
In the case of only one cloud room, we have no way to achieve disaster tolerance across the computer room. However, the key nodes of the data plane and the control plane are all deployed by the cluster. The problem of a single service node will not have any impact on the user. At the same time, when the host of the vm has serious problems such as downtime, it can be Migration within the scope of the machine room, the migration itself will not have any impact on the network properties and connectivity of the vm.
When the second cloud room and the third cloud room are built, the cluster gateway and controller nodes are deployed in each cloud room, and as the machine room increases, a ring backup is automatically formed in the cloud room. relationship. When a new computer room is built, the Luoshen system will automatically join the backup chain after deployment. In this way, when a critical node of a certain computer room is abnormal due to an abnormality, it can automatically switch to the backup machine room in the second level, and the service is provided by the Luoshen system of the backup machine room. This ensures that the failure of a single device will not be affected as much as possible, and that all nodes in a certain room have problems, and the user's business can be restored in a very short time.
Another important design for the availability of the Loch system is the operational plane, which itself is a key design for the intelligence and dataization of the Luoshen system. If you think of the Luoshen system as a whole switch, then from the characteristics point of view, Luoshen system is a switch that supports flow tracking, with a variety of rich strategies. Below the Luoshen system are the devices and switches of the physical network. Through the flow marking capability of the Luoshen system and the policy-based policy, stream dyeing, mirroring of specific messages, sampling, trace, etc. can be performed in both the physical network and the virtual network. The ability, the data collected by the traffic, etc., the logs generated by these actions will be collected by the SLS to jstorm for real-time calculation. If the flow is abnormal, an alarm and log will be generated to the administrator, and some alarms can trigger the automatic processing of the fault and restore. There is also a part of the data will be synchronized to ODPS offline, for more rich calculations, and then generate data reports and user portraits to the user, or give the user a cool big screen. This is essentially the ability to digitize.
Introducing the key design of usability, let's look at the virtual switch on the data side. As a virtual switch, students know that virtual switches are the most widely distributed in the cloud network. Each host has a virtual switch on it. The virtual switch is the first door for vm to enter and leave the network. In the Luoshen system, the virtual switch assumes most of the complex business logic because of its distributed role. As a multi-tenant device, the risk of complex business logic must be greater, so this is the first key design of a virtual switch.
Luoshen's virtual switch is also the same as the ordinary switch, achieving a fast and slow separation, this design will greatly improve performance.
Another design is the abstraction of business logic. In the Luoshen virtual switch, the common network processing logic is abstracted. By simply using the match-action trigger of the policy, various complex logics can be configured. This means that a lot of complex functions are implemented, and finally there is no need to modify the code, and the configuration can be modified based on the policy.
Below the business logic of the virtual switch, it is a base layer of vswtich. With this level, one can adapt to different virtualization technologies and even physical machines, so everyone sees, like KVM, XEN, containers, physical machines, etc. Wait, all supported by a set of vswtich. The other is to adapt to different underlying platforms, X86, ARM, MIPS can run. This makes the management of the version the most efficient, and for the size of the vswitch, version management is really terrible.
After reading the vswitch, let's look at the overall data surface. The network products supported by Luoshen system released several performance values this year, including ECS performance, etc., which are still full, and many friends have followed the release of type performance. Now look at the data surface of the Luoshen system. In fact, the simplest metaphor is the data surface of the Luoshen system, which itself is a huge switch.
As we all know, the forwarding chip of the switch handles the data packets, all of which are pipelines. The hardware processing will never stop. The data surface of Luoshen is also the same. From the beginning of a data packet into the Luoshen system, to the whole process of going out to the Luoshen system,