[Turn] About data center Clos network architecture

Source: Internet
Author: User

http://djt.qq.com/article/view/238

1. Data Center Network Architecture Challenge
With the development of technology, the scale of the data center is bigger and larger, and the server capacity of a data center has grown from thousands of servers to today's tens of thousands of or even hundreds of thousands of units from a few years ago. In order to reduce network construction and operation and maintenance cost, the designers of data center network are trying to expand the scale of a network module as far as possible. At the same time, the traffic inside the data center network is increasing, and the data center network designers even start to discuss the possibility of 10000 gigabit-speed servers in a network module, driven by the demand of some cluster business.
A typical architecture for a common data center network module is the dual-core switch +tor access switch level two structure, as shown in:

Figure 1 Typical Data center Network module architecture


So what are the challenges of such a data center network architecture (and, in fact, another typical three-tier structure) in this larger data center development trend?
First, we will see that the size of a single network module is directly limited by the port density of the core switch device. For example, 10000 gigabit server 1:1 overload than access requirements, each core switch should provide at least 500 wire-speed non-blocking Gbps port, which is already challenging the commercial market mainstream switch product limits. What if a larger network is needed? Looks like only waiting for manufacturers to launch more high-density products.
Second, as the core switch devices become more and more large, the number of ports more and more, the core switch power consumption is not high, it is easy to reach a magnitude of nearly 10kw. The requirements of this power supply, although for many deep-pocketed internet enterprise self-built data center is a piece of cake, but for most enterprises, this means 2-3 months or more of the rack power transformation cycle, you know, in most of the domestic IDC room, single rack power supply capacity of only about 3kw.
As a result, the data center network designers began to explore whether there are other architectural design scenarios.
2, Clos network architecture debut
Before we discuss the topic, let's take a look at the Clos of the people who are often on the lips of the industry. Clos is a multi-level switching architecture designed to minimize intermediate cross-points as the input and output grows. A typical symmetric 3-level Clos architecture looks like this:


Figure 2 Symmetric 3-level Clos switching network


In fact, the Clos architecture is nothing new, as early as 1953, Dr. Charles Clos of Bell Labs presented this architecture in the non-blocking switching network research paper, which was widely used in TDM networks (mostly SPC) to commemorate this significant achievement, Named this structure in his name, Clos.
Now let's go back to the topic of data Center network design. In fact, the previously mentioned architecture design approach, which relies heavily on vendor product port density, has throat many designers to find the excitement of the challenge. Like the open-source wind in the Linux community, they prefer to get rid of vendor product restrictions and find an architecture design approach, preferably using inexpensive, ordinary small box devices to build a very large network.
This approach peaked at the research of several professors represented by Mohammad Al-fares, Department of Computer Science and engineering at the University of California, USA.
In their paper "A Scalable, commodity Data Center Network Architecture" published in Sigcomm in 2008, a three-level Fattree network architecture called the Fat Tree (CLOS) was explicitly proposed. With this architecture method, a large-scale server access network can be built with a fixed number of port-based box switches. Specifically, when using a switch with port number K, the number of core switches is (K/2) 2, a total of K pod, each pod has (K/2) aggregation and access switch, the number of servers can be connected to K3/4. And this architecture can guarantee access, aggregation, the core of the total bandwidth consistent, to ensure that the server access bandwidth 1:1 of the overload ratio. Shown is an example of the schema when k=4.


Figure 3 Fat Tree Clos network at k=4


Obviously, when k=48, the 48-port Gigabit Box switch, the core switches in this architecture are 576, a total of 48 pods, each pod has 24 aggregation switches and 24 access switches, a total of 2,880 switches, can support 1:1 overloading than access to 27648 gigabit servers. The number of devices, while seemingly astonishing, is undoubtedly an extensible architecture and seems to have no special requirements for devices.
More encouraging is that these professors in addition to armchair, but also very focused on actual combat, the paper detailed in the IP address Division, routing strategy, traffic scheduling algorithm and so on detailed design, but also with the traditional design method of the cost and power consumption of the comparison. What's more, perhaps the bosses are also aware of the large number of switches brought by the equipment and the interconnection of cable problems, and even give the equipment packaging and rack arrangement scheme. Word, as long as you strictly follow this paper design, a tens of thousands of 1:1 bandwidth overload ratio of the Gigabit Server data Center network is on the horizon, almost all the important issues have been carefully considered.
In addition, some of the above scenarios can be changed, for example, the convergence and core equipment into a full Gbps port switch, such as the popular 64-port million-gigabit switch, access to switch to gigabit, such as the current popular 40,000 gigabit + 48,000 Gigabit switch, such as the same size of the network situation, The number of devices and interconnect links can be greatly reduced. More specific design situation you can tiger yourself to study:)
3, Fog heavy
Here, it seems that the problem has been solved, data center network design workers have a choice, the industry's mainstream manufacturers of equipment port density is not so hungry.
The designers who loved to delve into it quickly discovered that the Clos network architecture was a replacement for two core devices with a traditional design equivalent to the three-level Clos, as shown in:


Figure 4 Traditional network architecture and Clos network architecture


Alternatively, a three-stage Clos is used to replace a core device. Why is a three-level Clos, because the convergence and the core are two-way communication, equivalent to a standard three-level Clos folding input, output units after the case of merging.
The core equipment of mainstream manufacturers has been replaced by a bunch of inexpensive small boxes, which are exciting news and confusing.
The first is the problem of non-blocking. We all know the importance of network core non-blocking. Mainstream core equipment in the commercial market is usually more complex, in order to achieve non-blocking, the Exchange matrix has a certain speedup, but also the use of VOQ and other technologies, as well as scheduling arbitration or self-routing design. So, how does a bunch of cheap small boxes work without blocking?
Next, a very obvious problem, Clos network architecture up and down the line of bandwidth consistent, whether it can be implemented without blocking?
Also, the switch input and output cache design has always been the mainstream equipment manufacturers uncompromising, in the Clos network architecture in this part how to reflect?
Then, how do the multiple equivalent paths of the Clos network architecture do load balancing? Hash algorithm is based on the flow, the difference between different streams can not be solved; the packet is long, even if polling is difficult to balance. It is important to know that the core switches typically use fixed-length cells internally to reseal packets for the most balanced distribution possible. So does the Clos network architecture require a centralized control system to do traffic scheduling?
If you continue to think, you may find more questions that are not easy to answer.
It seems that the Clos network architecture is not that simple.
4, how to achieve non-blocking
First we look at what is blocking without blocking.
• Internal blocking (Blocking). If the out-of-line idle, but because of the switching network-level link is occupied and can not be connected phenomenon, called multi-level switched network internal blocking.
• Non-blocking (non-blocking). Regardless of the state of the network, a connection can be made in a switched network at any time, as long as the starting point and end point of the connection are idle, without affecting the established connections in the network.
• Can be re-discharged without blocking (re-arrangeable non-blocking). Regardless of the state of the network, any time can be in the switching network directly or to the existing connection re-election route to establish a connection, as long as the start of the connection, the end point is idle, and does not affect the network has been established connection.
• Generalized non-blocking (scalable non-blocking). There is an inherent blocking possibility for a given network, but there may be an ingenious way of routing, so that all blockages can be avoided without having to rearrange the established connections in the network.
So what about the congestion of the Clos network architecture? According to the network architecture of the data center, we can study the Clos network of the symmetric 3-level structure in Figure 2.
This symmetric Class 3 Clos network, the first level of the entry and the third level of the outlet are N, the second level of the number of units m. In order to ensure that the link is non-blocking, complete A to B information exchange, at least there should be an idle link, that is, the intermediate switching unit to have (n-1) + (n-1) +1 = 2n-1, so the non-blocking condition is: m>=2n-1. As shown in the following:


Figure 5 Symmetric 3-level Clos network non-blocking condition is m>=2n-1


In addition, the Slepian-duguid theorem also proves that, when M=n, the symmetric 3-level Clos network can be re-discharged without blocking. The rearrangement of non-blocking means that an end-to-end scheduling of switched networks is done.
This also directly proves the non-blocking design in the Clos network, where the bandwidth converging to the core should be at least/n=2-1/n times the bandwidth of the converged (2n-1), nearly twice times.
In fact, the University of California professors also clearly pointed out in their paper that the design of a fat tree with a bandwidth overload ratio of 1:1 Clos architecture is non-blocking, it is necessary to use the appropriate scheduling algorithm to meet the conditions of the rearrangement as much as possible.
That is not satisfied with the conditions of m>=2n-1, you can rest assured that you can avoid end-to-end blocking it? Of course not, we can see that the non-blocking model is based on the uniform distribution of incoming and outward information, and if uneven distribution can lead to in or out of the block. At this point we need to consider the situation of in-direction, out-of-cache, and speedup.
5, the mysterious acceleration than speedup
The concept of speedup is straightforward, and it is a way to reduce input-output blocking. If a switched network can transmit N cells of an input port to the output within a cell time, the speedup of this switched network is s=n. In layman's words, the speed ratio is the processing power of the Exchange network in the case of "more dozen one". The larger the acceleration ratio, the higher the number of intermediate units in the switching network, and the more expensive.
It is easy to figure out that the switching networks we studied earlier, the symmetric 3-level Clos architecture, m>=2n-1 faster than s= (2n-1)/n=2-1/n when there was no blocking condition, close to 2. In the case of m=n without blocking conditions, the speedup ratio is 1.
At an acceleration ratio of 1 o'clock, in order to avoid incoming blocking, the so-called "head-end blocking", we need to have a large enough cache in the direction of the switching network, when the end-to-end delay becomes uncontrolled, or it can be very large. As shown in the following:


Figure 6 Switching model into the direction cache, S=1


As the acceleration ratio increases, the probability of blocking in the direction is getting smaller, but the likelihood of blocking in the direction is increasing, so we need to have a large enough cache in the direction of the switching network to accelerate the 1. When S=n (N is the total number of incoming ports), the in direction does not need to be cached, only the direction of the cache is required, at which point the cost of switching the network is the most expensive, but the latency is minimal. As shown in the following:


Figure 7 The switching model of the directional cache, s=n


As a result, accelerating than making the exchange network designers both love and hate, but they turn love and hate into a problem, that is, there is no possibility of clever design, in the case of accelerating than 1<s<<n, get close to s=n performance and close to S=1 cost.
The answer is possible. After efforts, the industry people have concluded that the scope of the 2<s<5 is more reasonable, by matching the appropriate in-direction, out-to-cache, using the appropriate scheduling algorithm, you can obtain a more balanced performance and cost of a switching network.
According to the Grapevine, some of the industry's products of the Exchange matrix acceleration ratio is as follows (no official confirmation, for reference only):
Cisco
· Cat6500:4
· nexus7000:3.x
· CRS-1:2.5
H3c
· s12500:1.8
Juniper
· T/MX series:s=2.25
All right, let's go around a lot. For our Clos network architecture, Speedup is more appropriate than how to choose.
Let's look at the design of the professors at the University of California. In the case of speedup ratio of 1, the global centralized traffic scheduling system Flowscheduler direct control of aggregation and the path selection of core equipment, in order to obtain a better traffic distribution, and the built-in flowreporter collecting flow information in each access switch. Even so, the bandwidth utilization of the network does not exceed 75% in the test of random streams. Obviously this is not a perfect solution, and it is difficult to implement on ordinary commercial products, which is a difficult solution for most designers who want to adopt the Clos architecture. Perhaps with the development of SDN technology such as OpenFlow, more progress can be made in this area. Juniper Qfabric To some extent, also similar to a centralized control and scheduling of Clos network architecture, many of the details of the design is not disclosed, the actual use of the effect of the need for large-scale commercial test.
According to the previous analysis, if there is no traffic scheduling system, in the actual situation can not avoid into or out of the blocking, only to minimize the possibility of blocking. Given the imperfect load-balancing of multiple equal paths, and the low-cost box-type device with a little input-output cache, it is not possible to use the overall overall orchestration, obviously a certain speedup is needed. Considering the cost factor, it may be an ideal choice to choose the ratio of speedup to the other. Accurate acceleration ratio selection, need for further laboratory testing and validation, and equipment selection is also relevant.
6. Summary
In the data center network, can use the commercial market ordinary box switch constructs the Fat Tree Clos network architecture, realizes the extensible, does not depend on the specific vendor product the network architecture. However, there is no global centralized traffic scheduling and arbitration mechanism, there is no global inbound and outbound cache for provisioning, it will be difficult to achieve the current commercial market core equipment of non-blocking forwarding capability, the network congestion when the performance of the application will have a certain impact. By the appropriate acceleration ratio selection, such as the recommended for both, can reduce the congestion. The design of the Clos network architecture requires further research, testing, and validation, which may be more desirable in the future through Sdn.

[Turn] About data center Clos network architecture

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.