High-availability network system design for data centers
The data center has many fault types, but the results are similar. That is, the device, link, or server in the data center fails and cannot provide normal services externally. The simplest way to mitigate these problems is Redundancy Design. By providing backups to devices, links, and servers, you can minimize the impact of faults on your business. However, can a redundant design be used to mitigate the impact of faults? Some may equate network availability with redundancy.
In fact, redundancy is only one aspect of the entire availability architecture. Simply emphasizing redundancy may reduce availability and reduce the advantages of redundancy, because redundancy also brings the following Disadvantages:
W increase in Network Complexity
W Network support burden increases
W difficulty in configuration and management
Therefore, the High Availability design of the data center is a comprehensive concept. While selecting highly reliable device components and improving network redundancy, it is also necessary to enhance network architecture and Protocol deployment optimization to achieve real high availability. Design a highly available data center network. You can refer to the OSI Layer-7 model to ensure high availability at all levels, and ultimately achieve high availability of the basic network system of the data center, as shown in figure 1.
Figure 1 Data Center High Availability System Design hierarchy model
High-availability Network Architecture Design
During data center architecture planning and design, enterprises generally need to follow the modular and hierarchical principles to avoid large-scale rectification in the future, resulting in a waste of time and investment.
Modular Design
Modular Design refers to dividing and designing a series of functional modules based on functional analysis of applications with different functions or with the same functions and different performance and specifications within a certain range, the modules are loosely coupled and strive to ensure stable and reliable networks, easy to expand, simple structure, and easy to maintain on the basis of meeting business application requirements.
Hierarchical Design
It includes two aspects: Network Architecture hierarchy and application system hierarchy. With the continuous improvement of network and security device virtualization, the application system layer can be implemented through device configuration, without affecting the physical topology of the network. For hierarchical network architecture design, choosing a three-tier architecture or a two-tier architecture is a challenge faced by many enterprises in data center network construction.
From the reliability perspective, both the three-layer architecture and the two-layer architecture can achieve high availability of the data center network. In recent years, with the gradual rise of cloud computing, the layer-2 flat network architecture is more suitable for cloud computing network models. It can meet the needs of large-scale server virtualization clusters and flexible migration of virtual machines. There is no absolute difference between a L2 architecture and a l3 architecture. enterprise users can choose based on their own business characteristics. You can also use layer-2 networking for certain functional partitions.
Device High Availability design
Equipment reliability is the most basic guarantee of system reliability. It is particularly important to ensure the reliability and stability of the equipment in the core exchange area of the data center. Although the failure probability and impact scope of core equipment can be reduced by adjusting and optimizing the architecture, strategy, and configuration, to solve the hardware and software faults of the most fundamental equipment, you must select a data center-level network device.
There is no standard definition for Data Center-level equipment in the industry. However, from the data center solution products provided by mainstream network equipment suppliers, it can be seen that data center-level switches should have the following features:
1) physical separation of control plane and forwarding plane
The control plane and the forwarding plane are physically separated by hardware. When the engine is switched, the forwarding is not affected, and zero packet loss can be achieved. Both the control plane and the forwarding plane provide an independent Redundant Architecture to achieve control and forwarding redundancy and ensure higher reliability.
2) more redundant Key Components
In addition to the redundancy of the engine and Switching Network Board, the power supply of such equipment can generally be configured with multiple blocks to achieve N + M redundancy, to ensure a higher reliability of the power supply; in addition, the fan redundancy is also increased from the original fan-level redundancy to fan frame redundancy. Multiple fans are redundant in each independent fan frame.
3) virtualization capability
The complexity of data centers is getting higher and higher, and more devices need to be managed. The Virtualization of devices can Virtualize multiple devices on the same layer (core, aggregation, and access) into one, to simplify device configuration and management.
4) high-traffic Buffer Capacity
The CLOS-based data center-level equipment expands the port cache capacity, and uses a new generation of distributed cache mechanism to move the original outbound cache to the Inbound direction, with the same port cache capacity, this distributed cache mechanism can better cache the multi-to-one congestion model and better absorb the burst traffic of the data center.
Link Layer (L2) High Availability design
With the virtualization technology represented by H3CIRF2, horizontal integration of all layers of the network is realized without changing the network physical topology of the traditional design and ensuring the existing wiring mode, two or more physical devices on each layer of the switching network form a unified switching architecture, which reduces the number of logical devices and binds links across devices, eliminate loops while ensuring high availability of links.
Protocol layer (L3) High Availability design
The Protocol High Availability design of the data center network can be considered from the following two aspects:
1) Fast Detection and switching
In order to reduce the impact of device faults on data center services and improve network availability, the device must be able to detect communication faults with adjacent devices as soon as possible so that timely measures can be taken, this ensures that the business continues. Generally, the Hello packet mechanism in the routing protocol detects that the fault takes seconds. During this time, a large amount of data is lost during the transmission of high-speed data within the data center at a Gbps speed.
BFD (BidirectionalForwardingDetection, Bidirectional Forwarding Detection) is generated in this context. It is a unified detection mechanism across the network. It is used to quickly detect and monitor the forwarding and connectivity of links or IP routes in the network, so as to ensure that communication faults can be quickly detected between neighbors, establish a backup channel within 50 ms to resume communication. BFD detection can be deployed in the wide area/region city exit module, as shown in figure 9. The OSPF dynamic routing protocol was run before the core layer of the data center and the external modules (wide area and urban area), and BFD and OSPF route linkage were configured on the core layer switch. When a wide-area or metro routing device or link fails, the core switch quickly perceives it and notifies OSPF to quickly converge to shorten the data center's external data fault recovery time.
2) uninterrupted forwarding
In a data center network with a dynamic routing protocol deployed, if the device performs a master-slave switchover, the relationship between the device and its neighbors may fluctuate. This kind of neighbor relationship fluctuation will eventually lead to the convergence of the routing protocol and re-calculation, so that the winner's slave router will encounter a routing black hole within a period of time or cause the neighbor to bypass the data service, in this case, the business may be temporarily interrupted.
To achieve uninterrupted forwarding, the device itself must support data forwarding and control separation and support dual-master control design. At the same time, it needs to save the status of the Protocol (control plane) and use the help of neighboring devices, in the event of master-slave switchover, Session connections in the control plane are not reset, and forwarding is not interrupted. Its corresponding technology is the GracefulRestart (smooth restart) Extension of the routing protocol, or GR for short.
The core of the GR mechanism is that when the routing protocol of a device is restarted, the peripheral device is notified to maintain the neighbor relationship and routing of the device within a certain period of time. After the Router Protocol of the device is restarted, the peripheral device synchronizes the route information so that the route information can be restored to the status before the restart in the shortest time. During the Protocol restart process, the network routing and forwarding remain highly stable, and the packet forwarding path remains unchanged. The entire system can continuously forward IP packets.
Application Layer (L4 ~ L7) High Availability design
Implement L4 ~ at the data center network layer ~ The L7 layer is highly available and supports Server Load balancer. L4 ~ L7 Server Load balancer can improve server response capability and link bandwidth utilization. On the other hand, it can ensure that business data is seamlessly distributed to other servers and links when a single server or single link fails, to achieve high availability of the data center.
1) link Load Balancing (LLB)
Link load balancing is usually deployed in the wide area access areas and Internet access areas of the data center. It detects and monitors the status of multiple links in real time through static table item matching and dynamic link detection, ensure that traffic is distributed to different links in the most reasonable and fast way to achieve efficient service transmission.
For the wide-area access area of the data center, because the egress traffic of the wide area network is still the intranet data stream of the enterprise, different business flows can be distinguished through the five-element feature of IP packets on the L4 layer, therefore, load balancing, key service bandwidth guarantee, and wide-area link bundling can be achieved through hierarchical CAR and cross-port traffic forwarding on routers. No special LB device is required.
2) Server Load balancer)
Currently, most application systems adopt the BS architecture. enterprise data center WEB servers need to handle connection requests from Intranet and Internet users. Therefore, the performance and reliability of a single server may not be met, to achieve more user access and server redundancy, you can deploy Server Load balancer on the WEB server. Server Load balancer can be deployed in either of the following ways:
Server Cluster Software
Server Cluster software (such as MSC) generally requires the server group to be in the same VLAN. Other non-special requirements are not described here.
Server Load balancer Device
The server Load balancer device provides a virtual service IP (VSIP). After a user accesses the VSIP request service, the LB device distributes requests to various real services according to the scheduling algorithm.
Summary
Data concentration means risk concentration, response concentration, complexity concentration, and investment concentration ......, High Availability design and deployment is an eternal topic in the construction of enterprise data centers. "Do not build a high platform in the sand". As the basic IT bearer platform of the data center, the network is the basic guarantee for the high availability of IT systems. To achieve high availability of the data center network, technologies cannot solve all the problems, and comprehensive O & M procedures, rules and regulations, management systems, and other aspects are also required. Combining with the development trend of enterprise business, continuous summarization and accumulation is a long-term and gradual process.