Responsibilities and prospects of O & M engineers

Source: Internet
Author: User

Responsibilities and prospects of O & M engineers

Anatomy of key technical points in O & M: 1. design solutions for a large number of high-concurrency websites; 2. Design of High-reliability and high-scalability network architectures; 3. Website security problems; how to avoid hacking? 4 north-south interconnection issues, dynamic CDN solutions; 5 Massive Data Storage Architecture

1. What is O & M for large websites?

First of all, the full text of "O & M" refers to: large website O & M, which is quite different from other O & M operations. Then we define the scope of large websites and small websites, this definition is mainly from the perspective of O & M complexity, such as website specifications, popularity, server magnitude, and pv volume. Other factors are not important. Therefore, we first define a server with a size greater than 1000 servers, pv should be at least hundreds of millions per day (at least the top 10 in China), such as sina, baidu, QQ, and 51.com. Other small websites may not have true O & M engineers, this is related to insufficient website specifications and cost factors. It is more of a "compound talent" that integrates network, system, and development work ", for example, some companies include procurement contracts in their O & M responsibilities, and IDC network planning in their O & M responsibilities. Therefore, it is very important to be clear: O & M must be familiar with other associated jobs: networks, systems, system development, storage, security, DB, etc; the O & M engineer I am talking about here refers to a full-time O & M engineer.

Let's talk about the "birth" Process of general products:

1. First, the company's management team provides the guiding ideology. PM positions market demands (or copy mature applications) for research, analysis, and final detailed design.

2. The architect completes network planning and architecture design based on product design requirements, such as pv size estimation, server scale, and application architecture. (basically, the network is not changed much unless it is a large project)

3. The Development Engineer implements the design code and the Test Engineer tests the application.

4. Well, when the O & M engineer leaves, it is clear that the first three steps are irrelevant to the O & M work. On the contrary, the first three steps are closely related to the O & M work:Pre-Architecture Design of applications, software/hardware resource evaluation application procurement, application design performance risks and evaluation, IDC, service performance \ Security tuning, server system-level optimization (related to specific applications) O & M engineers are responsible for product Server Installation preparation, server system installation, network, IP, and universal tool set installation. O & M engineers also need to be responsible for factors such as the rationality of the online application system architecture, scalability, and security risks, and finally take responsibility for the product (Program), networks, and systems are spliced and optimized together to provide users with a final product launch and reuse: require-> Development (upgrade) -> test-> launch (unexpected issues such as performance and security issues are coming soon). Here we mention that the website development mode is completely different from traditional software development, 1 ~ Five upgrades are common, and the user experience is king. If an online problem like M $ needs to be solved for one year, the user will be exhausted early. After the application is launched, the O & M work has just begun, specific work may include: upgrade version, service monitoring, application status statistics, daily service status inspection, burst troubleshooting, daily service change adjustment, cluster management, service performance evaluation and optimization, database management optimization, along increase or decrease the application PV for scaling, security, and O & M of the application architecture:

A. Try to use tools (such as service monitoring, application status statistics, and service launch) to perform routine mechanical and manual work to improve efficiency.

B. Solve service problems in reality, such as high reliability and scalability.

C. Development of large-scale cluster management tools. For example, how can 10 thousand machines complete password modification or run specified tasks within one minute? How to quickly install the operating system on 2000 servers? How to quickly store, share, and analyze PT-level data in Distributed IDCs and storage clusters? The efforts of O & M engineers are required for a series of challenges.

This section describes other types of cooperation. In the entire project, front-end applications are black boxes for Network/System Engineers. At the same time, development engineers are only responsible for functional development of applications, and is responsible for applications such as performance and security. It is not responsible for or concerned with network/system architecture issues, of course, other colleagues in the software/hardware procurement department will not care about these issues and perform their respective duties,The core of the project is O & M engineers ~! Bridges of all other departments.

I have mentioned a lot above. I think you should have some concepts about O & M. For example, if we are a vehicle that runs on a highway at high speed, the O & M engineer is the driver and Maintenance Engineer. the driver is not simple. Sometimes it is necessary to change the tires during high-speed driving and shift the position according to the road conditions. When the car is getting faster and faster, the vehicle itself cannot meet the needs of high-speed performance tuning or parts upgrade, solving vehicle faults and performance problems during high-speed travel, always pay attention to the security issues ahead, and Xianzhi to take preventive measures. This is O & M work ~!

Finally, let's talk about the responsibilities of O & M engineers :"It seems simple to ensure online stability, but it is not easy. O & M engineers must weigh the following unfavorable factors: impact of new product models on existing architectures and technologies, online BUG risks caused by frequent upgrades of products, human errors caused by low acceptance of O & M automation management, and processes caused by high efficiency pursued by the IT industry the lack of implementation, the performance and architecture pressure caused by the increase of users, the loose technology management culture in the IT industry, the risk of innovation, and Internet security issues, it will be an enemy of website stability. O & M engineers must take control of this last level. They must have a high degree of responsibility, principles, and coordination. If they can achieve the best balance between various factors, that is a good O & M engineer.

In addition, I want to talk about some additional things here. I have seen many people talk about their own O & M experience, such as sina, QQ, baidu, and 51.com. In fact, this is a little difficult for them:

A. the company's own network architecture, scale, and more or less are the core secrets of the company. They must be kept confidential. In addition, for general-purpose software and architecture that everyone is familiar, many companies perform secondary development (such as apache, php, and mysql) based on their actual business needs due to the performance, security, known bugs, and functions of the original version ), the operating system kernel will also be customized according to different business types, such as some applications belong to the computing type, some are high IO type, or large storage memory type. According to these features, we made kernel optimization and customization. For example, sina made secondary development on memcache and developed a MemcacheDB. We won't talk about how to do it, but it is commendable that it is open-source, domestic companies are basically asking for open-source resources and have not contributed to it. In addition, servers are not well-known models. Most of them are tailored to DELL, HP, and ibm based on their business characteristics; in addition, you have your own solutions for distributed storage, or you can use existing open-source hadoop and other solutions or develop your own solutions.However, 90% of them all draw on the idea of google GFS: distributed storage, computing, and big tables.

B. The business direction of each company is different, which may lead to different O & M modes and methods. For example, the O & M modes of 51.com and baidu must be very different, because their business model determines their architecture, server level, IDC distribution, network structure, and general technology, sina, which focuses on news portals, differs greatly from the O & M mode of 51.com, which focuses on sns, even the roles are not the same. But one thing is that the use of technology and the general architecture are similar, so we should not be too supernatural. More companies are just playing block-building games, with no technical skills.

C. As mentioned above, the concept and experience of O & M for large websites are still relatively scattered in their early years. There is no mature knowledge system. What is O & M? Everyone should first think about it, I have never thought about it at all. The discussion is only the tip of the iceberg of O & M work. It is limited to technical details, or the framework of a famous website, this may be the reason why there are few online O & M related materials. It is also difficult for O & M personnel in China to recruit, which is rare for O & M engineers.

2. What skills and qualities do O & M engineers need?

What skills and qualities do O & M engineers need? First, let's talk about the skills. As you can see above, O & M is a job that combines multiple IT jobs and skills, you must be familiar with system, network, storage, protocols, requirements, development, testing, and security. However, you must be familiar with or even proficient in some aspects, such as the system (familiar with basic operating systems, * nix, windows ..) protocol and system development (the most important daily work is automatic O & M-related development, large-scale cluster Tool Development and Management) general applications (such as lvs, ha, web server, db, middleware, and storage), networks, and IDC topology architecture;

The skills are summarized as follows:

1. Development capability, which is very important, because O & M tools need to be developed by themselves. development languages: c/c ++ (one of them is essential), perl, python, php (one of them), shell (awk, sed, exact CT .... ). You must have actual development experience. Otherwise, the work will be very painful.

2. General applications: Operating Systems (mainly linux and bsd in China) and webserver (nginx, apahe, php, lighttpd, java ...) , Database (mysql, oralce), other miscellaneous seven or eight pulling stuff... System optimization, high availability... These are just extra points and don't need to be necessary. You can learn them slowly while working. It's not hard to learn these things. Of course, in O & M, some of them have different division of labor.

3. systems, networks, security, storage, CDN, and DB should be well understood and relevant principles should be known.

Personal qualities:

1. Communication and team collaboration: O & M work involves a lot of work across departments and types of work. It requires good communication skills and strong team collaboration capabilities. This should be the basic quality requirement of modern enterprises, not much.

2. Be bold in your work: be bold in order to innovate and take the unusual path. Especially for new types of work such as O & M, be more innovative in order to promote development. Be careful that O & M engineers are website administrators, the person with the highest online permissions will regret the lifetime or enter the 18-tier hell.

3. Initiative, execution, strong energy, and strong stress resistance: due to the characteristics of the IT industry, the changes are fast. The O & M work is more prominent when the plan cannot keep up with the changes, for example, the servers of major companies in China are often located all over the country. If the prices are low and cost-effective, move the servers to perform large-scale service migration (involving hundreds of thousands of servers ), this is a headache. It is often very time-consuming, such as within one week. In this case, the O & M engineers have high requirements on initiative and execution: plans, solutions, seamless service migration, machine relocation and installation, Environmental preparation, security assessment, performance assessment, infrastructure construction, related departments, and emergency response.

4. Others are some basic qualities: the mind should be clear, the logic thinking ability should be strong, the person should be modest and steady, the affinity should be helpful, and there should be a big picture.

5. Finally, website O & M requires the spirit of exploration and innovation to solve practical problems through innovative thinking, because this is a young career (the same is true in foreign countries, but earlier than in China ), no mature system or methodology can be used for reference. You can only make your own efforts.

3. How to be a qualified O & M Engineer

1,This is the basic responsibility of O & M engineers to ensure that services meet the required online standards, such as 99.9%.

2. continuously improve application reliability and robustness, performance optimization, and security improvement. This is a test of initiative and innovative thinking.

3,The coverage of monitoring and statistics at all levels of the website, including software, hardware, and running status must be monitored to avoid monitoring dead corners and real-time understanding of Application Operation.

4,It solves the O & M efficiency problem through innovative thinking. At present, most of the O & M work of various companies still relies on manual operation intervention,Try to free your hands.

5. Accumulation and accumulation of O & M knowledge and completeness of documents. O & M is an experienced position,Good experience and traps must be accumulated to avoid repetitive errors.

6,Planning and execution; planning of work; trying to achieve the goal after planning without making excuses.

7. Automated O & M;It can refine, design, and develop daily mechanized work into tools and systems, so that the system can be automatically completed by the system as much as possible; let everyone spend more time thinking, thinking, and doing what they like.

The above are only some technical aspects. Of course, personal consciousness is also very important.

Iv. Confusion, Current Situation and Development Prospects of the O & M profession

Unlike other positions, such as R & D engineers and test engineers, O & M personnel have clearly defined responsibilities and career plans and have a sense of professional identity and accomplishment; the O & M work may give people the feeling that all aspects have been solved, but they are more proficient than full-time engineers, and feel that they are usually less concerned (unless there is a fault online ), gradually everyone will be confused and confused about career development. Why? In addition to the characteristics of the profession, it is mainly because of the lack of in-depth understanding of O & M and the lack of in-depth work. In fact, this problem also occurs in other positions, but I found that O & M is more typical, this question is more likely to appear;

To address this issue, I will talk about the current situation and development prospects of website O & M (I am still thinking about it, it may not be very thorough and comprehensive. Please make an axe to supplement it)

O & M status:


1. In the initial stage, major companies have their own full-time employment, but they do not pay much attention or have a high degree of importance and are highly alternative; small companies are mostly engaged in this work by taking care of other positions. They do not have a full-time role and cannot do a deep job.

2. the technical level is relatively low. It is mainly in the stage of technological exploration and accumulation, and there is no systematic concept or technology.

3. physical labor is too high. This problem is mainly related to the second point. Many things are still carried out by human resources, and there is no mature automated management method for large-scale clusters, large-scale clusters are closely related to O & M. If there are only a hundred machines, there will be no space for O & M.

4. Excellent O & M talents are extremely lacking. At present, all major companies rely on their own training. This situation leads to extremely low mobility of O & M talents in the industry, A lot of good technologies are confined to major companies, such as google's 0.5 million machine science management, or some of the top 10 O & M experience of domestic Internet companies, these experiences are valuable and determine the core competitiveness of a company. These problems lead to the circulation, integration, and approval of advanced O & M technologies in the industry, and will ultimately limit the development of O & M.

5. Many excellent O & M experiences are in the hands of large companies. This is not based on the technical strength of the company, but on the technical scale, massive PV, and sufficient hardware scale of large companies, such as the terrible traffic of baidu and the massive data volume of 51.com ~~~~ These factors determine that the problems they encounter are not met by other small or medium-sized companies, or are about to be met. However, large companies may have good solutions or systems.


Development Prospects:

1. From an industry perspective, with the rapid development of China's Internet (China's Internet users have jumped to the world's No. 1), the scale of websites is growing, and the architecture is becoming more and more complex; the requirements for specialized website O & M engineers and website architects will become increasingly urgent, especially for experienced excellent O & M talents.The older the, the more valuable it is. At present, graduates are basically selected for training (limited to large companies) in China, and the training cost is high, in addition, the absence of experienced talents will lead to the company's slow technical updates and affect the company's technological development. Of course, graduates also have advantages: a blank sheet of paper, strong plasticity, relatively recognized and easy to integrate into the corporate culture.

2. From a personal perspective, the technical content and requirements of O & M engineers are getting higher and higher.People who are most familiar with the company's applications and architecture are getting more and more attention.

3. Website O & M will become an integrated multi-discipline (Network, system, development, security, application architecture, storage, and so on), providing a good personal ability and technical development space.

4. experience related to O & M work will become very important, and will also become the core competitiveness of individuals. They have good problem-solving capabilities, solution provision, and global thinking capabilities at all levels.

5,Expertise and interest cultivation; because the O & M positions have a wide range of knowledge, it is easier to cultivate or give full play to some of my personal expertise or hobbies, suchCore, network, development, database, etc.He is deeply proficient and becomes an expert in this field.

6. If you do not want to perform O & M in the future, it is easier to transfer to other positions without too many limitations. Of course, you have to do it with your heart.

7. Technology Development Direction: website/System Architect.

V. Anatomy of key O & M technical points

1. Large-scale cluster management

First, we need to clarify the concept of a cluster. A cluster is not a general combination of functional servers, it refers to the integration of server and hard disk resources for a certain purpose or function (the number of machines is greater than two). For applications, it is a whole. Currently, conventional clusters can be divided: high Availability cluster (HA), Server Load balancer cluster (such as lvs), distributed storage and computing storage cluster (DFS, such as google gfs and yahoo hadoop ), specific Application Clusters (a combination of specific function servers, such as the db and cache layers) are mainly based on these four types in the Internet industry, if the business is simple and there are few post operations on applications, you can simply use layer-4 switches (such as f5) to achieve high service availability/balance, for companies with tight resources, there are also some open-source solutions such as lvs + ha, which are very flexible. For the latter two solutions, the technical strength and application characteristics of the company will be tested, the third type of DFS is mainly used in massive data applications, such as email and search. In particular, the search requirement is higher. In addition to simple massive storage, it also includes data mining and user behavior analysis. For example, goog Le and yahoo can store and analyze user record data for nearly one year, while baidu should be less than 30 days, and soguo will be less... These are important for search preparation and user experience.

Next, let's talk about how to manage clusters scientifically. The key points are as follows:

I. Monitoring

It mainly includes fault monitoring and performance, traffic, load, and other status monitoring, which are related to the healthy operation of the cluster and the timely detection and intervention of potential problems;

A. Service Failure and status monitoring: Mainly monitors the interaction between the server itself, upper-layer applications, and associated service data. For example, for front-end web servers, we can have many types of monitoring, this includes application port status monitoring, which allows you to detect crash on the server or application itself in a timely manner, and detect the health status of the server through icmp packets. The upper layer may also include monitoring of services of various channels, the common method is to use the facial recognition code for judgment, or sign the key pages to make the website hacked and tampered with (alarm, and automatic recovery of tampered data). These are just a part, there are also N multiple monitoring methods, depending on the characteristics of the application, there are still some problems to solve, such as the cluster is too large, how to monitor high performance is also a real problem.

B. Others are monitoring or statistics on cluster status, which provides data reference for rational management and optimization of clusters, including service bottlenecks, performance problems, abnormal traffic, and attacks.

II. Fault Management

A. hardware faults. For N-plus clusters with hundreds or tens of thousands of machines, the probability of server crashes and hardware faults is very high, and there are service hardware problems almost every moment, crashes, hard disk damage, power supply, inner storage, and switches. In response to this situation, we need to fully consider these issues when designing the website architecture and regard them as the norm. We are more likely to avoid this risk by leveraging the redundancy mechanism of applications, however, the system engineers are provided with sufficient processing time. (If google does not claim to have killed 800 machines at the same time, will the service not be affected?) This is the place where O & M engineers and website architects are tested for functionality, A good design can achieve the self-recovery capability described by google, such as gfs. A bad design means that the crash of a server may cause a chain failure reflection of a wide range of services and directly refuse to respond to users.

B. Application fault problems. It may be caused by a bug, a performance threshold being exceeded, or an attack, but it is important to note that, it is necessary to take preventive measures against these problems. It cannot be taken for granted. It will not cause problems. If there is a problem, how can we deal with it? This requires O & M engineers to do their best, including emergency response speed, scientific troubleshooting, and effective backup solutions.

III. Automation

Automation: in short, we use tools and systems to do some of our daily manual work, freeing our hands and boring repetitive work. For example, before there is no tool, we need to install the system on a bare-metal machine, such as 2000. It may take 10 people/10 days to compress N CDs, resulting in higher labor costs... Now, with automated tools, only a few simple commands can be done, and human programs such as machines can automatically complete the previous daily manual intervention, so that they can automatically complete and report results, it also has a certain level of expert system capabilities, and can do some simple:/non-judgment, optimization and selection... These benefits are obviously not mentioned much... It should be said that automated O & M is a pursuit of professional O & M engineers, and has benefited the public. Although this is an extremely arduous task: constantly changing businesses, nonstandard application design, development models, network architecture changes, IDC changes, specification changes, and other factors may affect the existing automation system, therefore, modularization, interface-based, and variable-cause parameterization are required. Automation is one of the key tasks of O & M engineers and the embodiment of value.


Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.