How to do the operation and maintenance of large data center

Source: Internet
Author: User

What do you call a data center? Wikipedia gives the definition that "data centers are a complex set of facilities." It includes not only computer systems and other ancillary devices (such as communications and storage systems), but also redundant data communication connections, environmental control equipment, monitoring equipment, and various safety devices. " Today, as the scale of data center construction expands and new technologies emerge, data centers become more and more complex. Large data centers are often composed of many large-scale cluster systems, and their operations need to have knowledge of all aspects, including hardware, network, server, storage, security and business things, need to get up and down to do operation and maintenance work.

650) this.width=650; "title=" 1111.jpg "src=" http://s3.51cto.com/wyfs02/M00/73/51/ Wkiol1x6skbdfbrfaaqjvbrsx4a585.jpg "alt=" Wkiol1x6skbdfbrfaaqjvbrsx4a585.jpg "/>

When a data center is very large, the challenges and problems are more advanced, many in small environment small system is not a problem in such a scale is also highlighted, so to do a large data center operation and maintenance work, the entire data center technology system learning will take a relatively long time, Only for this data center as a whole very understanding, can be targeted to develop some operations and maintenance programs, and even two times to develop a number of monitoring operations software, the entire data center to effectively manage and monitor, improve the efficiency of the entire data center, reduce the occurrence of failures, so as to the operation and maintenance work to a new height.
A large data center inside often contains a lot of small systems, operation and maintenance work is around these specific application systems, the specific can be divided into basic operations management, daily business operations, network, server, storage, security six parts, This article is about what operational methods and capabilities the general large data centers should have.

First of all, from the data center of the basic operation and maintenance management, the main hardware configuration management, maintainability optimization, monitoring, alarm processing, automatic operation and maintenance, broken network, power outages, computer room disaster recovery operations. Hardware configuration management includes the model and hardware configuration of each server in the enclosure, and it is clear which business systems are using these servers. Even in virtualized environments, you need to know which physical machines are flowing in the pool of resources that these VMs are made of. The number of physical and virtual machines in the data center is huge, and it is necessary to use automated operations. Automating operations can not only improve operational efficiency, but also reduce human involvement, while allowing the data center to manage itself and release manpower. and the data center can be a failure to monitor and alarm processing, in order to be able to know the problem in the first time of the failure, often a large fault is from the beginning of a small fault gradually expanded to eventually cause the collapse of the entire large system, so in the presence of some small anomalies must be eliminated in time, And these anomalies rely on the perfect monitoring and alarm system to detect.

From the data center of the daily business operations to consider, there are mainly resources, machine allocation, resource use, network throughput, failure recovery, backup applications, cluster construction, traffic, pressure, migration expansion, upgrade, subordinate business relationship, resource utilization, exception handling, contingency plans and so on. These daily operation and maintenance work actually cost a lot of manpower and time, is the main body of operation and maintenance work, but also the most cumbersome, but the most can not reflect the performance of the part. A data center can be long-term safe and stable operation, is to rely on these daily work accumulation, only the usual attention to these subtle changes, can continue to optimize. Stress testing, software upgrades, business deployment, exception handling, and so on almost become a daily course of operation and maintenance work, only to do this work well, to avoid major failures, and can quickly deploy new business, according to the use of resources in time to expand equipment.

From the data center network considerations, there are mainly network hardware devices, ACLs, OSPF, LACP, VIP, traffic, load Balancing, 2347-tier situation, network monitoring, million-Gigabit board card, core exchange and so on. Network is an important part of the data center, is the basic guarantee of all work operation, no network data center can not operate, so to ensure the stability of the network is the data center operation and maintenance of the top priority. The main concern here is the network hardware problems, ACL deployment and traffic monitoring situation. Network can be said to be all-encompassing, involving too much equipment and protocol technology, so also need to continue to learn, deepen the understanding of network technology, so as to do a good job of network operation and maintenance.

   from the data Center server, the main file system, kernel parameter tuning, a variety of hard disk drives, kernel version, Kernel panic and so on. Linux system is not only in the server, in the network operating system also occupy the mainstream position, master the use of Linux system to better handle the operation of server and network equipment, Linux is a basic skill of operation and maintenance work. In addition to familiar with the operation of Linux system, but also to monitor and manage the running state and kernel running state of the server, reduce the occurrence of server failure. Generally large data centers contain tens of thousands of servers, almost every day there are a variety of server problems, only a deep understanding of the server can be a good way to eliminate the problem. In order to prevent server failure caused by business interruption, so generally on the server to deploy virtualization technology or cluster technology, when a server physical hardware failure, the business can smoothly switch to other servers, the business will not be affected. These virtualization technologies add to the difficulty of operations and the need for continuous and in-depth learning of virtualization technologies.

The architecture is more diverse and complex, considering data center storage. In the cloud, virtualization, big data and other related technologies into the data center, storage has undergone tremendous changes, block storage, file storage, object storage supports a variety of data types of reading; Centralized storage is no longer the mainstream storage architecture in the data center, storage access of massive data, need to expand, Highly scalable, distributed storage architecture. In large-scale system support, distributed File system, distributed object storage and other technologies, for the various applications of storage provides a highly scalable, scalable and large elastic support and strong data access performance, and because these distributed technology to the standardization of hardware support, Enables large-scale data center storage to be built and operational at low cost. Of course, distributed storage is not to replace the existing disk array, but to cope with the high-speed growth of data volume and bandwidth generated by the new morphological storage system. In addition, software-defined storage, which represents a trend, is the separation of software and hardware in the storage architecture, that is, the separation of the data layer and the control layer. For data center users, through the software to realize the management and scheduling of storage resources, realize storage resource virtualization, abstraction, automation, can complete the implementation of data center storage System deployment, management, monitoring, adjustment and other requirements, making the storage system flexible, free and high availability. Corporate and Internet data are growing at a rate of 50% per annum, with a limited amount of structured data in new data, mostly unstructured, semi-structured data, and a data center storage architecture that requires extremely resilient adaptability as business grows, with low cost, massive expansion, High concurrency performance is the basic technical attribute for large cloud data center operational storage architectures. How to carry out large and chaotic data storage and deep application processing, and quickly extract valuable information, the formation of business decision-making will become the basis for the survival of all types of enterprises, but also future storage and storage architecture continues to derive the direction of business development.


Finally, from the data center security aspects, security is more than 10 small items: attack protection, upgrade backup, catch bug/find bugs, scripting tools, data security, service patrol and other items, each of which comes out actually contains a lot of content. For example, to attack and protect, this mainly refers to prevent foreign abnormal intruders to the data center initiated malicious and unintentional attacks, malicious attack is someone deliberately using various attack methods, into the data center, the important data theft or destruction, to its ulterior purpose. There are unintentional attacks, because the entire data center is to be connected with the outside world, the operation is dynamic, changing, there will inevitably be some abnormal traffic attack data center, and sometimes even from the data center, such as some server poisoning, or hardware failure, the construction of loops, abnormal traffic and other network failures, These will affect the operation of the data center, so how to do a good job of data center attack and protection is a big problem, this is not in the data center to deploy a few security devices can be resolved, the entire data center needs a comprehensive unified planning, and targeted deployment of some security measures, And with a variety of hacker technology to improve, security measures to continue to improve, this is a continuous learning and improvement process, as long as the data center is still running, this perfect will not stop. To facilitate operations, it is also necessary to execute scripts so that the problem can be dealt with quickly in the event of an emergency. For example, a data center business anomalies, in order to quickly restore the business, the need to adjust the route, the traffic to all other data centers, which need to be adjusted on the core router, there is a ready-made script can be automatically executed, to achieve fast switching purposes. The data center should also prepare scripts for many other jobs so that they can be used quickly in an emergency.


You must be surprised by the above analysis, the original data center operation contains so much content, big and small dozens of items, and each of the contents of the content said is not so simple, also involves a lot of technical knowledge. Operation is the key to a data center that can operate stably and efficiently. Only when these operations are deployed and implemented well can the data center be stable for a long time.

This article is from the "Hollows Jie Sun" blog, be sure to keep this source http://xjsunjie.blog.51cto.com/999372/1695653

How to do the operation and maintenance of large data center

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.