Automated O & M practices for large-scale data centers
O & M practices for large-scale data centers
Hello everyone, I am Zhu Junhua, O & M engineer of QingCloud. I have worked in a Customs Organization for several years and have been involved in many foreign companies. I used to appear in IBM/EMC.
I have a rough look at the group members. I have several former colleagues and many customers who have served the group. There are a lot of experts in the Group. Today I am a little confused.
The topic we are sharing today is "O & M practices for large data centers", which is a summary and review of my many years of work experience. We will share with you, limited by my ability and energy, if something is wrong, please correct it.
The content of today's communication includes:
Definition of Data Centers Development and Evolution of data centers hierarchical maintenance of data centers definition of O & M data centers
Wikipedia provides the following description for Data centers: Data centers, or Server Farm, are facilities used to place computer systems and related components, for example, telecommunications and storage systems. Generally, it includes redundant and standby power supplies, redundant data communication connections, environmental control (such as air conditioners and fire extinguishers) and various security devices.
I made a simple summary of the data center. The modern data center is generally a campus, which contains several buildings. The building contains several rooms, called modules, which are the basis; A complex network is built on it. Various hardware devices, including servers and network devices, are deployed on the network, various software is run on various devices, and services are finally provided to the outside world.
In fact, the simple section above covers all technical aspects. data centers are the cornerstone of modern IT systems, and I believe they will be the cornerstone of the normal operation of society in the future.
Development and Evolution of data centers
A Data Center usually refers to a building or a campus, which contains many data centers. However, there was only one data center in the early stages, and there was only one machine in the data center, because the early computer components were too large and there were a large number of cables.
This figure shows the world's first computer, ENIAC, born at the University of Pennsylvania in February 14, 1946. At that time, it was a computer, a data center, and a prototype of a data center. It is said that every time the ENIAC is turned on, all the lights in the west part of philphia are lost. <喎?http: www.bkjia.com kf ware vc " target="_blank" class="keylink"> Release/q/release + release/zbuntssvt/release/s7xxve/qsq81Nq7 + re/vOTW0NGw1dLL + release + NDQ1 + nN + release/rN0Lncu/ authorization + DQo8cD7PwsPmuPi087zS1bnKvry4t/nNvMaso7o8YnIgLz4NCjxpbWcgYWx0PQ = "" src = "http://www.bkjia.com/uploads/allimg/160329/0406225314-1.jpg" title = "\"/>
The data center at the beginning of the 21st century was shown in this way. At that time, it was more called an equipment room. In a building, there were many large rooms with uniform heat dissipation and low efficiency; servers of different customers are placed in the same data center, with no cabinets, no locks, and no isolation. The security level is low.
Later, the design of the data center emerged, which is also the current situation of many data centers. There will be a high-rise lift, the cables and network cables below, there is a cooling cold air system, there will be a Tuyere in the middle of the two lines of cabinets, the eyes on the floor is easy to wind, then the server suction into the cold air, from the back out, achieve the effect of heat dissipation; you can see that there is a door in the picture, which can achieve a certain degree of closed effect, improve the heat dissipation efficiency, but the top of the Cabinet is not closed; in addition, the cabinets in the figure above do not have door and cabinet locks, and the security will be slightly worse.
There is also the above two figures of this design, the machine room has a high-rise, the heat dissipation system below; Each cabinet is closed, with its own doors and locks, high security; the cold air of the Cabinet enters the Cabinet through the channel and can be switched off separately (such as the Red Line), which not only saves energy but also achieves good heat dissipation, but the heat dissipation effect of the first half of the equipment may be worse.
Nowadays, many new data centers adopt micro-modular design. This design reduces the requirements for the data center itself, without the need to lift the high-rise, closed Heat Dissipation systems and standardized cabling slots, it combines energy-saving, beautiful, and efficient.
Data center level
Currently, the popular data center level is set according to the American ANSI & TIA-942 data center communication network infrastructure standard, divided into the following four levels:
Level Tier I-basic data center level Tier II-Infrastructure component redundancy level Tier III-infrastructure and maintenance level Tier IV-Infrastructure Fault Tolerance
Among them, there are not many data centers with the highest Tier IV level, both in China and abroad. Currently, most data centers in China are Tier III. We will not repeat the differences between different levels here. If you are interested, you can check them online.
O & M Definition
I have not found the definition of O & M in Wikipedia. I don't know whether it is easy to understand or difficult to define.
I did not dare to define O & M. I just talked about my own understanding. I once thought that O & M is more of the work done by the product or system after delivery and production and before the end of the product/system lifecycle. However, with the development trend of the IT industry and the prevalence of DevOps, the requirements for O & M personnel are getting higher and higher, and they need to participate in the entire lifecycle earlier.
Taking data center O & M as an example, O & M personnel may need to participate in the selection of data centers, including site selection, network provider selection, and inspection of various facilities and services in the data center, instead of starting O & M only after these settings are completed.
In addition, I need to make it clear that today we are talking about data center O & M, not simply from the perspective of data center providers, but also from the perspective of data center users. QingCloud currently uses services from multiple data centers. We are also investigating and building our own data centers.
Data center O & M
Now we officially enter today's topic-Data Center O & M.
"Wind, fire, and water" in the data center
When talking about data center O & M, we often mention "wind, fire, and water ".
Wind, usually refers to air conditioning refrigeration and ventilation filtration systems. Clean air can prolong the service life and reduce the failure rate. Regardless of the decommission time, the service life and failure rate of the same machine run in Beijing and Finland.
Fire refers to fire protection. This is often ignored, but it is often the most critical part. In the event of a fire, the entire area may require power outages and it is difficult to recover within a short period of time.
Water, usually humidity and moisture. If the humidity is too high, the service life of the device may be affected. If the humidity is too dry, static electricity may be caused and the device may be damaged.
Electricity, data center power. Power is regarded as the top priority of traditional data centers. Without power, the data center is an empty shell, and the power of the data center must be stable and be backed up in multiple ways.
As mentioned above, "wind, fire, water, and electricity" should be added to a "network". The data center must ensure an efficient network, which should be as close as possible to the backbone network, it also needs to provide BGP line services, which is also an important criterion for many customers to choose data centers.
Data Center Selection
The data center selection criteria can be categorized into the following three points: location, main standard, and secondary standard. The standard we mentioned is to consider different roles, including data center builders and users.
Location, including the city and region where the data center is located, which will directly affect the budget, at least avoid the impact of accidents such as the Tianjin Big Bang; it also affects whether you can recruit suitable employees. You need to consider the response speed when a fault occurs.
The main criteria include whether there is sufficient space for future development, stable and affordable power assurance, and whether there is a way to use environmental protection to achieve cheap heat dissipation systems, such as choosing the north, natural cold air is used for heat dissipation most of the year, and efficient network connectivity is required.
Secondary standards, including infrastructure, such as lighting and pipeline projects; security isolation facilities, walls, doors, windows, and equipment unloading areas in the data center park; devices such as carts and forklifts; pre-installed equipment rooms; monitoring and control centers; other miscellaneous, including security surveillance cameras, access cards, and anti-tail guard;
Production O & M
After traditional data centers are put into production, manual inspection will be arranged for High-level data centers. The cabinets purchased by the customer and the equipment in their cabinets need to be inspected by their own personnel. A company I used to work for has three-shift monitoring staff who are on standby 24x7, you need to go to the data center for inspection once every hour to check whether there is an alarm for each device.
QingCloud is considering building its own data centers. Therefore, the O & M process is more comprehensive. In addition to building and infrastructure O & M of traditional data centers, QingCloud also includes various physical devices, such as servers, network devices, various operating systems and software, and SDN developed by ourselves, each of which can be discussed as a topic.
Let's take a brief look at the possible scope of data center infrastructure O & M, including:
Security systems, security protection for campus buildings, access control systems, monitoring systems, fire fighting systems, smoke detectors, and fire suppression facilities; Environmental testing, such as temperature and humidity; power supply facilities, including power distribution equipment, generators, UPS, Cabinet PDU, etc.; heat dissipation systems, including air conditioning equipment, fresh air and chillers; other miscellaneous, such as wiring, including cables and network cables; whether there are inflammable and explosive objects in the internal environment of the IDC must be cleared in time.
From the perspective of a data center user, we hope that the data center can provide more efficient services, such:
Efficient entry application system, including personnel and equipment; efficient unloading channels and convenient pre-installed rooms; free and efficient entry and exit of the data room upon passing the certification, operate your own equipment. The service personnel in the data center can efficiently provide the data and services required by the customer, such as cabinet power consumption, and provide more personalized and professional services;
Next, we will discuss the user's O & M for their own devices and services.
For the selection of servers and network devices, do you choose DELL/IBM servers of a large brand, or do you choose Custom machines that are more cost-effective?
QingCloud chose the latter. In the cloud computing era, we assume that physical devices such as servers are unreliable and must rely on upper-layer software for reliability.
Operating System Selection: Linux or Windows?
Without a doubt, the QingCloud system must be running on Linux, but we need to consider how to efficiently initialize the server and quickly install the operating system, the file system, Kernel Parameter Optimization, various hard drive, Kernel version, and Kernel Panic must be considered. The Application Layer involves more.
How to efficiently initialize the system
How to efficiently initialize the system? Including BIOS optimization and RAID division.
There are many efficient installation methods for Linux systems. The initial solution is to cut the ISO of the Linux installation disk into a disc for installation. The current server with an optical drive must have been fooled; later, we installed ISO on the USB flash drive, which was manually installed. Advanced, you can write Kickstart/Preseed files to enable automatic installation of USB flash drives. This is sufficient for a small number of devices.
For large-scale deployment, we currently use the network to automatically divide RAID, install the operating system, and perform automatic BIOS optimization.
Our goal is to build a new machine. When physical connections are ready, the machine can be used for production after being started for half an hour, including BIOS optimization, RAID division, and operating system installation, installation of network connection and applications on the system. The operating system can be installed using the network PXE. Cobbler can be commonly used in open-source scenarios. For RAID division and BIOS tuning, I will not describe it too much here, different manufacturers use different hardware methods.
After the operating system and network are ready, we need to configure specific applications and services on the server. At this time, we can use more tools. These tools are usually referred to as configuration management tools. Common tools include the old Cfengine, Puppet and Chef used by many large companies, recently, new tools such as Saltstack and Ansible are available, but the best tools are suitable/familiar to engineers.
Automated O & M
The above mentioned focuses more on the first half of the product lifecycle. With the expansion of the scale, traditional O & M methods that rely on manual regular inspection and stare at the large screen in the monitoring center to see whether there are alarms are outdated. The only way out is automation.
O & M automation is a topic that has been being discussed since the Internet boom. The O & M work of data centers has become increasingly complex because the data centers have been continuously evolving and changing, the applications carried by the data center become much more complex. The problems cannot be solved efficiently simply by the accumulation of manpower. Various processes and tools must be introduced for standardized management.
An important part of automated O & M is a sound monitoring system. A sound monitoring system needs to be able to monitor all aspects of the entire data center, including various physical facilities and environments, this is not the focus of our discussion today. Today we will mainly discuss the monitoring of networks, systems, and other parts.
Monitoring may include:
Attacks, including internal and external, need to be able to quickly find the source and eliminate threats; Various Sensors of network and server equipment, including temperature, voltage and power redundancy; network traffic, network storm, and network loop monitoring. Server monitoring can usually obtain the status of physical devices on the server through out-of-band and IPMI, including CPU, memory, motherboard, and power supply; the storage system of the server, including information such as physical disks, RAID groups, RAID card Battery status, and Media Error. The RAID card of LSI can be viewed through MegaCli, And the Adaptec card can use Arcconf; in the operating system, we need to monitor more things, including system resources (Inode usage of CPU, memory, and file system space, as well as network traffic and system load ); monitoring of processes and services; monitoring of storage systems (throughput and IOPS); monitoring of system and application logs
With a sound monitoring system, we also need real-time alarm (email, IM, SMS) functions, so we can neither miss or report too many false positives. Otherwise, no one will pay attention to the alarm information, but it is useless.
Currently, many open-source monitoring software are available, including Nagios, Cacti, Ganglia, Zabbix, Zenoss Core, and SmokePing. Each software has its own expertise, you can use multiple software to build your own monitoring system.
With monitoring and alarms, we also need statistical reports on resource usage (daily, monthly, peak, and Valley), which will be the basis for system resizing.
Device retirement
Next, let's talk about device retirement. After a server or network device is running for a period of time, the failure rate will increase significantly. We need to consider whether to retire the device.
First, we need to set the end time of a device and how to handle it after it is decommissioned. We need to consider the extended warranty period under what circumstances, calculate the best time point, and squeeze the value of the device as much as possible.
In a small detail, QingCloud considers the security of user data, and our hard disk has bought a specific service (not to be returned). After the damaged hard disk is reported to the manufacturer for repair and replacement, we will destroy the replaced hard disks in a centralized manner.
Last
Before closing the sharing, let's take a look at some new trends related to the data center.
Many people in the group should have heard of mobile or mobile data centers, modular data centers, micro-modular data centers, Maritime data centers, and cave data centers. Their benefits are obvious. For example, a cave-type data center can defend against explosion or natural catastrophic events, save cooling energy consumption, and avoid high-power microwave and electromagnetic pulse weapon attacks.
In terms of network, G Ethernet will soon grow strongly in the data center field. Of course, this process may lead to the development of 25g and 50G networks, 25 Gbit/s and 50 Gbit/s per-channel technologies will be the foundation of 100 Gbit/s (4 Gbps) and 400 Gbit/s (8 Gbps) Ethernet in the future, therefore, the industry generally believes that the 25G network will soon replace the existing 10G network.
Today's sharing is almost the end. Let's make a simple summary.
The O & M of the data center is both macro and detailed, from building design, construction, and site selection to avoid being affected by such events as the Big Bang in Tianjin; from small to the location and direction of cables in the server, prevent the server from loosening the cables due to its slight vibration, resulting in frequent Kernel Panic of the system.