Large website O & M discussions and experiences

Source: Internet
Author: User
1. What is O & M for large websites? First of all, the full text of the O & M refers to: large website O & M, the difference with other O & M is quite big; then we define the scope of large websites and small websites, this definition is mainly from the perspective of O & M complexity, such as website specifications, popularity, server magnitude, and PV volume. Other factors are not important. Therefore, we first define a server with a size greater than 1000 servers, PV should be at least hundreds of millions per day (at least the top 10 in China), such as Sina, Baidu, QQ, and 51.com. Other small websites may not be really one. What is O & M for large websites?
First of all, the full text of the O & M refers to: large website O & M, the difference with other O & M is quite big; then we define the scope of large websites and small websites, this definition is mainly from the perspective of O & M complexity, such as website specifications, popularity, server magnitude, and PV volume. Other factors are not important. Therefore, we first define a server with a size greater than 1000 servers, PV should be at least hundreds of millions per day (at least the top 10 in China), such as Sina, Baidu, QQ, and 51.com. Other small websites may not have true O & M engineers, this is related to insufficient website specifications and cost factors. It is more of a "compound talent" that integrates network, system, and development work ", for example, some companies include procurement contracts in their O & M responsibilities, and IDC network planning in their O & M responsibilities. Therefore, it is very important to understand that O & M must be familiar with other associated jobs: networks, systems, system development, storage, security, DB, etc; the O & M engineer I am talking about here refers to a full-time O & M engineer.
Let's talk about the "birth" Process of general products:
1. First, the company's management team provides the guiding ideology. PM positions market demands (or copy mature applications) for research, analysis, and final detailed design.
2. The architect completes network planning and architecture design based on product design requirements, such as PV size estimation, server scale, and application architecture. (basically, the network is not changed much unless it is a large project)
3. The Development Engineer implements the design code and the Test Engineer tests the application.
4. Well, when the O & M engineer leaves, it is clear that the first three steps are irrelevant to the O & M work. On the contrary, the first three steps are closely related to the O & M work: pre-Architecture Design of applications, software/hardware resource evaluation application procurement, application design performance risks and evaluation, IDC, service performance \ Security tuning, server system-level optimization (related to specific applications) O & M engineers are responsible for product Server Installation preparation, server system installation, network, IP, and general tool set installation. O & M engineers also need to be responsible for factors such as the rationality of the online application system architecture, scalability, and security risks, and finally take responsibility for the product (Program), networks, and systems are spliced and optimized together to provide users with a final product launch, and follow the cycle: requirement-> Development (upgrade) -> test-> launch (unexpected issues such as performance and security issues are coming soon). Here we mention that the website development model is completely different from traditional software development, 1 ~ Five upgrades are common, and the user experience is king. If an online problem like MS needs to be solved for one year, the user will be exhausted early. After the application is launched, the O & M work is just getting started, specific work may include: upgrade version launch, service monitoring, application status statistics, daily service status inspection, emergency troubleshooting, daily service change adjustment, cluster management, service performance evaluation and optimization, database management optimization, with the Application increase or decrease PV for scaling, security, and O & M of the application architecture:
A. Try to use tools (such as service monitoring, application status statistics, and service launch) to perform routine mechanical and manual work to improve efficiency.
B. Solve service problems in reality, such as high reliability and scalability.
C. Development of large-scale cluster management tools. For example, how can 10 thousand machines complete password modification or run specified tasks within one minute? How to quickly install the operating system on 2000 servers? How to quickly store, share, and analyze PB-level data in Distributed IDCs and storage clusters? And other challenges require the efforts of O & M engineers.
This section describes other types of cooperation. In the entire project, front-end applications are black boxes for Network/System Engineers. At the same time, development engineers are only responsible for functional development of applications, and is responsible for the performance, security, and other applications of the application itself. He is not responsible for or concerned with network/system architecture matters. Of course, other colleagues in the software/hardware procurement department will not care about these issues either, perform their respective duties, but the core of the project is O & M engineers ~! Bridges of all other departments.
I have mentioned a lot above. I think you should have some concepts about O & M. For example, if we are a vehicle that runs on a highway at high speed, the O & M engineer is the driver and Maintenance Engineer. the driver is not simple. Sometimes it is necessary to change the tires during high-speed driving and shift the position according to the road conditions. When the car is getting faster and faster, the vehicle itself cannot meet the needs of high-speed performance tuning or parts upgrade, solving vehicle faults and performance problems during high-speed travel, always pay attention to the security issues ahead, and Xianzhi to take preventive measures. This is O & M work ~!
Finally, let's talk about the responsibilities of O & M engineers: "Ensuring online stability" seems simple, but it is not easy. O & M engineers must weigh the negative factors: impact of new product models on existing architectures and technologies, online BUG risks caused by frequent upgrades of products, human errors caused by low acceptance of O & M automation management, and processes caused by high efficiency pursued by the IT industry the lack of implementation, the performance and architecture pressure caused by the increase of users, the loose technology management culture in the IT industry, the risk of innovation, and Internet security issues, it will be an enemy of website stability. O & M engineers must take control of this last level. They must have a high degree of responsibility, principles, and coordination. If they can achieve the best balance between various factors, that is a good O & M engineer.
In addition, I have a lot of people here who want to talk about their own O & M experience, such as Sina, QQ, Baidu, and 51.com. In fact, this is a bit difficult for them:
A. the company's own network architecture, scale, and more or less are the core secrets of the company. They must be kept confidential. In addition, for general-purpose software and architecture that everyone is familiar, many companies perform secondary development (such as Apache, PHP, and Mysql) based on their actual business needs and because of the performance, security, known bugs, and functions of the original version ), the operating system kernel will also be customized according to different business types, such as some applications belong to the computing type, some are high IO type, or large storage memory type. According to these features, we made kernel optimization and customization. For example, Sina made secondary development on memcache and developed a MemcacheDB. We won't talk about how to do it, but it is commendable that it is open-source, chinese companies have basically made no contribution to open-source applications. In addition, servers are not well-known models. Most of them have been tailored to DELL, HP, and IBM based on their business characteristics; in addition, you have your own solutions for distributed storage, or use existing open-source Hadoop or other solutions or develop your own. However, 90% of them all draw on the idea of GoogleGFS: distributed storage, computing, and large tables.
B. The business direction of each company is different, which may lead to different O & M modes and methods. For example, the O & M modes of 51.com and Baidu must be very different, because their business model determines their architecture, server level, IDC distribution, network structure, and general technology, Sina, which focuses on news portals, differs greatly from the O & M mode of 51.com, which focuses on SNS, even the responsibilities are not the same. However, the general technology and general architecture are similar, so don't be too precise. More companies are just playing the game of building blocks, with no technical skills.
C. As mentioned above, the concept and experience of O & M for large websites are still relatively scattered in their early years. There is no mature knowledge system. What is O & M? Everyone should first think about it, I have never thought about it. The real discussion is only the tip of the iceberg of O & M. It is limited to specific technical details, or the framework of a famous website, this may be the reason why there are few online O & M related materials. It is also difficult for O & M personnel in China to recruit, which is rare for O & M engineers.
2. What skills and qualities do O & M engineers need?
What skills and qualities does an O & M engineer need? First, let's talk about the skills. As you can see above, O & M is a job that combines multiple IT jobs and skills, you must be familiar with system-Network-storage-Protocol-demand-Development-test-security and other aspects, but you must be familiar with or even proficient in some aspects, such as the system (familiar with basic operating systems, Linux/Unix, Windows, etc) protocol and system development (the most important daily work is automatic O & M-related development, large-scale cluster Tool Development and Management) general applications (such as LVS, HA, WebServer, DB, middleware, and storage), networks, and IDC topologies;
The skills are summarized as follows:
1. Development capability, which is very important, because O & M tools need to be developed by themselves. development languages: C/C ++ (one of them is essential), Perl, Python, PHP (one of them), shell (awk, sed, exact CT .... ). You must have actual development experience. Otherwise, the work will be very painful.
2. For general applications, you need to know about the operating system (mainly Linux and FreeBSD in China) and WebServer (Nginx, Apache, PHP, Lighttpd, Java ...) , Database (Oracle, Mysql), other miscellaneous things... System optimization, high reliability... These are just extra points and don't need to be necessary. You can learn them slowly while working. It's not hard to learn these things. Of course, in O & M, some of them have different division of labor.
3. systems, networks, security, storage, CDN, and DB should be well understood and relevant principles should be known.
Personal qualities:
1. Communication and team collaboration: O & M work involves a lot of work across departments and types of work. It requires good communication skills and strong team collaboration capabilities. This should be the basic quality requirement of modern enterprises, not much.
2. Be bold in your work: be bold in order to innovate and not take the ordinary path. Especially for new types of work such as O & M, innovation is needed to promote development. Be careful, O & M engineers are website administrators, the person with the highest online permissions will regret the lifetime or enter the 18-tier hell.
3. Initiative, execution, strong energy, and strong stress resistance: due to the characteristics of the IT industry, the changes are fast. The O & M work is more prominent when the plan cannot keep up with the changes, for example, the servers of major companies in China are often located all over the country. If they are cheap and cost-effective, they need to migrate services on a large scale (involving hundreds of thousands of servers). This is a headache; usually the time is very short, such as within one week to complete, in this case, the O & M engineer's initiative and execution have high requirements: plans, solutions, seamless service migration, machine relocation and installation, Environmental preparation, security assessment, performance evaluation, infrastructure construction, all relevant departments, and emergency response.
4. Others are some basic qualities: the mind should be clear, the logic thinking ability should be strong, the person should be modest and steady, the affinity should be helpful, and there should be a big picture.
5. Finally, website O & M requires the spirit of exploration and innovation to solve practical problems through innovative thinking, because this is a young career (the same is true in foreign countries, but earlier than in China ), there is no mature system or methodology that can be used for reference. You can only make your own efforts.
3. How to be a qualified O & M Engineer
1. Ensure that the service meets the required online standards, such as 99.9%. Ensure online stability, which is the basic responsibility of O & M engineers.
2. continuously improve application reliability and robustness, performance optimization, and security improvement. This is a test of initiative and innovative thinking.
3. The coverage of monitoring and statistics at all levels of the website, including software, hardware, and running status must be monitored to avoid monitoring dead corners and real-time understanding of application operations.
4. Solve the O & M efficiency problem through innovative thinking. At present, most of the company's O & M work is still dependent on manual operation intervention, and it is necessary to release both hands as much as possible.
5. The accumulation and accumulation of O & M knowledge and the completeness of documents. O & M is an experienced position. Good experience and traps need to be accumulated to avoid repetitive mistakes.
6. planning and execution; planning of work; trying to achieve the goal after planning, do not make excuses.
7. Automated O & M; can refine, design, and develop daily mechanized work into tools and systems, so that the system can be automatically completed as much as possible relying on the system; let everyone spend more time thinking, innovative thinking, and doing what they like.
The above are only some technical aspects. Of course, personal consciousness is also very important.
Iv. Confusion, Current Situation and Development Prospects of the O & M profession
Unlike other positions, such as R & D engineers and test engineers, O & M personnel have clearly defined responsibilities and career plans and have a sense of professional identity and accomplishment; the O & M work may give people the feeling that they know more about each other, but they are more proficient than the full-time engineers and feel that they are usually less concerned (unless there is a fault online ), gradually everyone will be confused and confused about career development. Why? In addition to the characteristics of the profession, it is mainly because of the lack of in-depth understanding of O & M and the lack of in-depth work. In fact, this problem also occurs in other positions, but I found that O & M is more typical, this problem is more likely to occur;
To address this issue, I will talk about the current situation and development prospects of website O & M (I am still thinking about it, it may not be very thorough and comprehensive. Please make an axe to supplement it)
O & M status:
1. In the initial stage, major companies have their own full-time employment, but their importance or importance is not high, which is highly alternative; small companies are mostly engaged in this work by taking into account other positions. They do not have a full-time role and cannot do in-depth work.
2. the technical level is relatively low. It is mainly in the stage of technological exploration and accumulation, and there is no systematic concept or technology.
3. physical labor is too high. This problem is mainly related to the second point. Many things are still carried out by human resources, and there is no mature automated management method for large-scale clusters, large-scale clusters are closely related to O & M. If there are only a hundred machines, there will be no space for O & M.
4. Excellent O & M talents are extremely lacking. At present, all major companies rely on their own training. This situation leads to extremely low mobility of O & M talents in the industry, A lot of good technologies are confined to major companies, such as Google50 Million Machine science management, or top 10 O & M experiences of domestic Internet companies, these experiences are valuable and determine the core competitiveness of a company. These problems lead to the circulation, integration, and approval of advanced O & M technologies in the industry, and will ultimately limit the development of O & M.
5. Many excellent O & M experiences are in the hands of large companies. This is not based on the technical strength of the company, but on the technical scale, massive PV, and sufficient hardware scale of large companies, such as the terrible traffic of Baidu and the massive data volume of 51.com ~~~~ These factors determine that the problems they encounter are not met by other small or medium-sized companies, or are about to be met. However, large companies may have good solutions or systems.
Development Prospects:
1. From an industry perspective, with the rapid development of China's Internet (China's Internet users have jumped to the world's No. 1), the scale of websites is growing, and the architecture is becoming more and more complex; the requirements for full-time website O & M engineers and website architects will become increasingly urgent. In particular, the demand for experienced excellent O & M talents will increase, and the old ones will become more valuable; at present, graduates are basically selected in China (limited to large companies). The training cost is high, and the absence of experienced talents will lead to slow technical updates and affect the company's technological development. Of course, graduates also have advantages: White Paper, strong plasticity, comparison of identity and easy to integrate into corporate culture.
2. From a personal perspective, the technical content and requirements of O & M engineers are getting higher and higher. At the same time, they are also the people who are most familiar with the company's application and architecture and are getting more and more attention.
3. Website O & M will become a comprehensive technical position integrating multiple disciplines (networks, systems, development, security, application architecture, and storage, it provides you with a good space for personal abilities and technical breadth.
4. experience related to O & M work will become very important, and will also become the core competitiveness of individuals. They have good problem-solving capabilities, solution provision, and global thinking capabilities at all levels.
5. Develop strengths and interests. The O & M positions have a wide range of knowledge, which makes it easier to cultivate or give full play to certain personal strengths and interests, such as the kernel, network, development, database, and so on, can be very proficient and become an expert in this field.
6. If you do not want to perform O & M in the future, it is easier to transfer to other positions without too many limitations. Of course, you have to do it with your heart.
7. Technology Development Direction, website/System Architect.
V. Anatomy of key O & M technical points
1. Large-scale cluster management
First, we need to clarify the concept of a cluster. A cluster is not a general combination of functional servers, it refers to the integration of server and hard disk resources for a certain purpose or function (the number of machines is greater than two). For applications, it is a whole. Currently, conventional clusters can be divided: high Availability cluster (HA), load balancing cluster (such as LVS), distributed storage, computing and storage cluster (DFS, such as GoogleGFS, YahooHadoop ), specific Application Clusters (a combination of specific function servers, such as the DB and cache layers) are mainly based on these four types in the Internet industry, if the business is simple and there are few POST operations on applications, you can simply use layer-4 switches (such as F5) to achieve high service availability/balance, for companies with tight resources, there are also some open-source solutions such as LVS + HA, which are very flexible. For the latter two solutions, the technical strength and application characteristics of the company will be tested, the third type of DFS is mainly used in massive data applications, such as email and search. In addition to simple massive storage, it also includes data mining and user behavior analysis; such as Google and Yahoo You can save and analyze the user record data for nearly one year, while Baidu should be less than 30 days, and Sogo will be less... These are essential for search preparation and user experience.
Next, let's talk about how to manage clusters scientifically. The key points are as follows:
I. Monitoring
It mainly includes fault monitoring and performance, traffic, load, and other status monitoring, which are related to the healthy operation of the cluster and the timely detection and intervention of potential problems;
A. Service Failure and status monitoring: Mainly monitors the interaction between the server itself, upper-layer applications, and associated service data. For example, for front-end WebServer, we can have many types of monitoring, this includes application port status monitoring, which allows you to detect crash on the server or application itself in a timely manner, and detect the health status of the server through ICMP packets. The upper layer may also include monitoring of services of various channels, the common method is to use the facial recognition code for judgment, or sign the key pages to prevent website tampering (alarm and automatic recovery of tampered data, there are also N multiple monitoring methods, depending on the characteristics of the application, there are still some problems to solve, such as the cluster is too large, how to monitor high performance is also a real problem.
B. Others are monitoring or statistics on cluster status, which provides data reference for rational management and optimization of clusters, including service bottlenecks, performance problems, abnormal traffic, and attacks.
II. Fault Management
A. hardware faults: for N-plus clusters with hundreds or tens of thousands of machines, the probability of server crashes and hardware faults is very high, and there are service hardware problems almost every moment, crashes, hard disk damage, power supply, memory, and switch. In response to this situation, we need to fully consider these issues when designing the website architecture and regard them as the norm. We are more likely to avoid this risk by leveraging the redundancy mechanism of applications, however, the system engineers are provided with sufficient processing time. (If Google does not claim to have killed 800 machines at the same time, will the service not be affected?) This is the place where O & M engineers and website architects are tested for functionality, A good design can achieve the self-recovery capability described by Google, such as GFS. A bad design means that the crash of a server may cause a chain failure reflection of a wide range of services and directly refuse to respond to users.
B. Application fault Problems: A BUG may be triggered, or a performance threshold may be exceeded or attacked. But the important point is that, it is necessary to have preventive measures against these problems. If they do not take them for granted, they will not go wrong. If there is a problem, how can we deal with it? This requires O & M engineers to do their best, including emergency response speed, scientific troubleshooting, and effective backup solutions.
III. Automation
Automation: in short, we use tools and systems to do some of our daily manual work, freeing our hands and boring repetitive work. For example, before there is no tool, we need to install the system on a bare-metal machine, such as 2000. It may take 10 people/10 days to compress N CDs, resulting in higher labor costs... Now, with automated tools, only a few simple commands can be done, and human programs such as machines can automatically complete the previous daily manual intervention, so that they can automatically complete and report results, it also has certain expert system capabilities and can make some simple options, such as/non-judgment and optimization... These benefits are obviously not mentioned much... It should be said that automated O & M is a pursuit of professional O & M engineers, and has benefited the public. Although this is an extremely arduous task: constantly changing businesses, nonstandard application design, development models, network architecture changes, IDC changes, specification changes, and other factors may affect the existing automation system, therefore, it is necessary to modularize, interface, and parameterization. Therefore, automation is one of the key tasks of O & M engineers and also a manifestation of value.
5. Anatomy of key technical points in O & M (more practical and practical cases)
1. Design of a large number of high-concurrency websites
2. High-reliability and high-scalability Network Architecture Design
3. How can I avoid being hacked for website security issues?
4 north-south interconnection issues and dynamic CDN Solutions
5. Massive Data Storage Architecture
If you have better comments, you are welcome to discuss them.
"O & M" is simply about Operation and Maintenance. The O & M department is responsible for ensuring the security, stability, and efficiency of the infrastructure and providing support for the enterprise's upper IT architecture.
In terms of environment, O & M involves many IT management objects: hosts, networks, databases, middleware, storage, and applications.
From the perspective of people, O & M is in a difficult position: to face leadership, a question is a question; we will face complaints and misunderstandings from technical support staff or customer service personnel. We will also talk to our colleagues, developers, and vendors.
I have been in touch with some big environments, such as banks, telecommunications, or funds and brokers with high real-time and reliability requirements. To be honest, the O & M work is not much better, it is difficult.
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.