Summary:
A large number of service providers are investing in a growing number of larger data centers to guarantee basic computing needs to support their services. As a result, researchers and industry practitioners have focused heavily on designing network structures to interconnect effectively and manage traffic to ensure the performance of these data centers. Unfortunately, data center operators often have reservations about the actual need to share their applications, making it challenging to evaluate the usefulness of any particular design.
In addition, the limited documentation of large-scale load information, whether good or bad, has so far been provided by a single data center operator for less common use cases. In this article, we have observed some of the Facebook data centers in the network traffic we reported. Facebook runs traditional data center services such as Hadoop, and its core web services and supporting caching infrastructure behave in a contrast to those reported in the literature. We report on the impact of network traffic, stability, and predictability on network structure, traffic management, and switch design, compared to Facebook's data centers.
Introduced:
We find that the study of traffic in literature does not represent the needs of Facebook as a whole. This is especially true when considering new network architectures, traffic management, and switch design.
For example, the best choice of topology for data center connections depends on how the hosts are contacted. However, because of the lack of specific data, researchers often design a all-to-all flow model based on the worst case scenario, where the connection between hosts is consistent with the same frequency and intensity, which is not worth it. If the requirements can be expected or remain stable for a reasonable period of time, the uneven structure is feasible, that is, connecting different parts of the data center in different ways, through a hybrid design including fiber optics and wireless connectivity.
Another way to improve performance is to improve the performance of the switch itself, in particular, some people propose simple modifications like reducing the buffer, the number of ports, the complexity of the structure of the switch layer, it has also been proposed to replace the traditional exchange with circuit or hybrid design, using the local traffic demand (locality), Durability and predictability. More to the extreme, host-based solutions promote direct connection to hosts. Obviously, in any case, the value of these programs depends on their load.
In the study between, some recommendations could not be well evaluated because of the lack of large-scale data centers. Almost all of the previous research on large-scale data centers was considered by Microsoft Data Center. And there are many similarities between the Facebook data center and Microsoft, such as avoiding virtual machines, which are somewhat different. While some of the key differences lead to different conclusions, we describe these differences and explain the reasons behind them.
Our research has found that the effects on important structures include:
Traffic is neither a local frame (rack-local) nor a global (all-to-all); the locality depends on the service and not always on the stability of whether the units in the time interval are seconds or days. An efficient structure may benefit from the variable degree of oversubscription and the bandwidth within the framework is less than the usual deployment.
Many flows exist for a long period but are not very large. Load balancing efficiently distributes traffic, and traffic demands are fairly stable or even more than sub-second intervals. Therefore, the large flow object is not much larger than the median stream, and the collection of large traffic objects changes quickly. The sudden large flow of objects over a long period of time is frequent rather than heavy traffic, which can confuse many traffic management methods.
The packet is small (the average length of the non-hadoop traffic is less than 200 bytes) and does not show the On/off behavior (that is, continuous arrival). The server is connected with 100 hosts and racks concurrently (at 5ms intervals), but most of the traffic is often doomed (rarely) to 10 racks.
The second section of the article focuses on the previous research of data center traffic in the main findings. Mainly reflected in three aspects: first, the flow is the local framework (rack-local), that is, 80% of the server traffic within the framework. Second, traffic is abrupt and unstable in a variety of time scales. Finally, the package has two sizes, either maximum (MTU), or very small (TCP ACK segment).
The third section details the Facebook data center, including their physical structure (Facebook's 4-post cluster design), supported services (how HTTP requests are serviced), and methods of collecting data (including Fbflow and port Mirroring).
The design, scale, and even the suitability of the technology for data center connections is largely dependent on the demand for their traffic. In the fourth section, the article quantifies the intensity of traffic, local, and stability in three different types of clusters in the Facebook datacenter (i.e., Hadoop, Frontend machines serving Web requests and Cache).
Previous studies have shown that the stability of data center traffic depends on the observed time scale. In the 5th section, we analyze the traffic on Facebook at the right time scale to understand the applicability of various traffic management and load balancing methods that might be in this case.
The sixth section mainly depicts all aspects of the design of the top-of-rack switch. In particular, we consider the size of the package and the arrival process, as well as the number of concurrent destinations to any particular terminal host. In addition, we test the impact of sudden bursts in short time and other effects on switching buffers.
Summarize:
Facebook's data center network supports a variety of different services, showing different traffic patterns. We found a number of services with significant deviations from the literature. Different applications, combined with the size of the Facebook data Center network (hundreds of nodes) and speed (10-gbps edge link), cause the workload to be contrasted with the various ways of previously published datasets. Because of the limitation of the length of the article, we cannot provide an exhaustive explanation, we describe the possible impact on topology, traffic management, top-of-rack switch design.
Our approach imposes some restrictions on the scope of this study. Using terminal host capture and timestamp packets describes the time stamp accuracy based on schedule change. In addition, we can only capture traffic from several hosts at a time. At the same time, these constraints prevent us from evaluating the performance of poor applications like incast or micro-outbreaks that point to as contributors. In addition, each host package dump must be anecdotal and special, relying on the presence of an unused host within the same framework as the target. When Fbflow is deployed in the data center, it provides a huge amount of measurement data that presents another challenge, namely data processing and storage. We therefore believe that effective network monitoring and analysis is an ongoing and evolving issue.
Inside the social network ' s (Datacenter) network