1. What is O & M for large websites?
First of all, the full text of "O & M" refers to: large website O & M, which is quite different from other O & M operations. Then we define the scope of large websites and small websites, this definition is mainly from the perspective of O & M complexity, such as website specifications, popularity, server magnitude, and PV volume. Other factors are not important. Therefore, we first define a server with a size greater than 1000 servers, PV should be at least hundreds of millions per day (at least the top 10 in China), such as Sina, Baidu, QQ, and 51.com. Other small websites may not have true O & M engineers, this is related to insufficient website specifications and cost factors. It is more of a "compound talent" that integrates network, system, and development work ", for example, some companies include procurement contracts in their O & M responsibilities, and IDC network planning in their O & M responsibilities. Therefore, it is very important to understand that O & M must be familiar with other associated types of work: Network, system, system development, storage, security, DB, etc. The O & M engineer I am talking about here refers to a full-time O & M engineer.
Let's talk about the "birth" Process of general products:
1. First, the company's management team provides the guiding ideology. PM positions market demands (or copy mature applications) for research, analysis, and final detailed design.
2. The architect completes network planning and architecture design based on product design requirements, such as PV size estimation, server scale, and application architecture. (basically, the network is not changed much unless it is a large project)
3. The Development Engineer implements the design code and the Test Engineer tests the application.
4. Well, when the O & M engineer leaves, it is clear that the first three steps are irrelevant to the O & M work. On the contrary, the first three steps are closely related to the O & M work: pre-Architecture Design of applications, software/hardware resource evaluation application procurement, application design performance risks and evaluation, IDC, service performance \ Security tuning, server system-level optimization (related to specific applications) O & M engineers are responsible for product Server Installation preparation, server system installation, network, IP, and general tool set installation. O & M engineers also need to be responsible for factors such as the rationality of the online application system architecture, scalability, and security risks, and finally take responsibility for the product ( Program ), Networks, and systems are spliced and optimized together to provide users with a final product launch and reuse: requirement-> Development (upgrade) -> test-> launch (unexpected issues such as performance and security issues are coming soon). Here we mention that the website development model is completely different from traditional software development, 1 ~ Five upgrades are common, and the user experience is king. If an online problem like M $ needs to be solved for one year, the user will be exhausted early. After the application is launched, the O & M work has just begun, specific work may include: upgrade version launch, service monitoring, application status statistics, daily service status inspection, emergency troubleshooting, daily service change adjustment, cluster management, service performance evaluation and optimization, database management optimization, with the Application increase or decrease PV for scaling, security, and O & M of the application architecture:
A, Try to use tools (such as service monitoring, application status statistics, and service launch) to perform routine mechanical and manual work to improve efficiency. .
B, This solution solves service problems in reality, such as high reliability and scalability.
C, Development of large-scale cluster management tools For example, how can 10 thousand machines complete password modification or run a specified task within one minute? How to quickly install the operating system on 2000 servers? How to quickly store, share, and analyze Pt-level data in Distributed IDCs and storage clusters? The efforts of O & M engineers are required for a series of challenges.
This section describes other types of cooperation. In the entire project, front-end applications are black boxes for Network/System Engineers. At the same time, development engineers are only responsible for functional development of applications, and is responsible for the performance, security, and other applications of the application itself. It is not responsible for or concerned with network/system architecture matters. Of course, other colleagues in the software/hardware procurement department will not care about these issues either, perform their respective duties, but the core of the project is O & M engineers ~! Bridges of all other departments.
I have mentioned a lot above. I think you should have some concepts about O & M. For example, if we are a vehicle that runs on a highway at high speed, the O & M engineer is the driver and Maintenance Engineer. the driver is not simple. Sometimes it is necessary to change the tires during high-speed driving and shift the position according to the road conditions. When the car is getting faster and faster, the vehicle itself cannot meet the needs of high-speed performance tuning or parts upgrade, solving vehicle faults and performance problems during high-speed travel, always pay attention to the security issues ahead, and Xianzhi to take preventive measures. This is O & M work ~!
Finally, let's talk about the responsibilities of O & M engineers: It seems simple to ensure online stability, but it is not easy. O & M engineers must weigh the negative factors: impact of new product models on existing architectures and technologies, online bug risks caused by frequent upgrades of products, human errors caused by low acceptance of O & M automation management, and processes caused by high efficiency pursued by the IT industry the lack of implementation, the performance and architecture pressure caused by the increase of users, the loose technology management culture in the IT industry, the risk of innovation, and Internet security issues, it will be an enemy of website stability. O & M engineers must take control of this last level. They must have a high degree of responsibility, principles, and coordination. If they can achieve the best balance between various factors, that is a good O & M engineer.
In addition, I want to talk about some additional things here. I have seen many people talk about their own O & M experience, such as Sina, QQ, Baidu, and 51.com. In fact, this is a little difficult for them:
A. the company's own network architecture, scale, and more or less are the core secrets of the company. They must be kept confidential. In addition, for general-purpose software and architecture that everyone is familiar, many companies perform secondary development (such as Apache, PHP, and MySQL) based on their actual business needs and due to the performance, security, known bugs, and functions of the original version ), the operating system kernel will also be customized according to different business types, such as some applications belong to the computing type, some are high IO type, or large storage memory type. According to these features, we made kernel optimization and customization. For example, Sina made secondary development on memcache and developed a memcachedb. We won't talk about how to do it, but it is commendable that it is open-source, chinese companies have basically made no contribution to open-source applications. In addition, servers are not well-known models. Most of them have been tailored to Dell, HP, and IBM based on their business characteristics; in addition, you have your own solutions for distributed storage, or you can use existing open-source hadoop and other solutions or develop your own solutions. However, 90% of them all draw on the idea of Google GFS: distributed storage, computing, and big tables.
B. The business direction of each company is different, which may lead to different O & M modes and methods. For example, the O & M modes of 51.com and Baidu must be very different, because their business model determines their architecture, server level, IDC distribution, network structure, and general technology, Sina, which focuses on news portals, differs greatly from the O & M mode of 51.com, which focuses on SNS, even the responsibilities are not the same. However, the general technology and general architecture are similar, so don't be too precise. More companies are just playing the game of building blocks, with no technical skills.
C. As mentioned above, the concept and experience of O & M for large websites are still relatively scattered in their early years. There is no mature knowledge system. What is O & M? Everyone should first think about it, I have never thought about it. The real discussion is only the tip of the iceberg of O & M. It is limited to specific technical details, or the framework of a famous website, this may be the reason why there are few online O & M related materials. It is also difficult for O & M personnel in China to recruit, which is rare for O & M engineers.
2. What skills and qualities do O & M engineers need?
What skills and qualities do O & M engineers need? First, let's talk about the skills. As you can see above, O & M is a job that combines multiple IT jobs and skills, you must be familiar with system, network, storage, protocols, requirements, development, testing, and security. However, you must be familiar with or even proficient in some aspects, such as the system (familiar with basic operating systems, * nix, windows ..) protocol and system development (the most important daily work is automatic O & M-related development, large-scale cluster Tool Development and Management) general applications (such as LVS, ha, web server, DB, middleware, and storage), networks, and IDC topology architecture;
The skills are summarized as follows:
1, Development Capability This is very important because O & M tools must be developed by yourself, Development language : C/C ++ (one of the prerequisites ), Perl, Python, PHP (One of them ), Shell (Awk, sed, wrong CT .... ). You must have actual development experience. Otherwise, the work will be very painful.
2. General Application Need to know: Operating System (Currently, Linux and BSD are mainly used in China ), Webserver Problems (Nginx, apahe, PHP, Lighttpd, Java ...) , Database (MySQL, oralce), other miscellaneous things. system optimization, high reliability. These are just extra points, do not need to be necessary, you can work while learning slowly, these things are not difficult. Of course, in O & M, some of them have different division of labor.
3, System, network, security, storage, CDN, DB You need to have a good understanding of the relevant principles.
Personal qualities:
1. Communication and team collaboration: O & M work involves a lot of work across departments and types of work. It requires good communication skills and strong team collaboration capabilities. This should be the basic quality requirement of modern enterprises, not much.
2. Be bold in your work: be bold in order to innovate and take the unusual path. Especially for new types of work such as O & M, be more innovative in order to promote development. Be careful that O & M engineers are website administrators, the person with the highest online permissions will regret the lifetime or enter the 18-tier hell.
3. Initiative, execution, strong energy, and strong stress resistance: due to the characteristics of the IT industry, the changes are fast. The O & M work is more prominent when the plan cannot keep up with the changes, for example, the servers of major companies in China are often located all over the country. If it is cheap and cost-effective, it is a headache to migrate services on a large scale (involving hundreds of thousands of servers; usually the time is very short, such as within one week to complete, in this case, the O & M engineer's initiative and execution have high requirements: plans, solutions, seamless service migration, machine migration and installation, Environment preparation, security assessment, performance evaluation, infrastructure construction, related departments, emergency response, etc.
4. Others are some basic qualities: the mind should be clear, the logic thinking ability should be strong, the person should be modest and steady, the affinity should be helpful, and there should be a big picture.
5. Finally, website O & M requires the spirit of exploration and innovation to solve practical problems through innovative thinking, because this is a young career (the same is true in foreign countries, but earlier than in China ), no mature system or methodology can be used for reference. You can only make your own efforts.
3. How to be a qualified O & M Engineer
1. Ensure that the service meets the requirements Online Standards Such as 99.9%; ensures online stability, which is the basic responsibility of O & M engineers.
2. constantly improve the application Reliability and robustness, performance optimization, and security improvement This is a test of initiative and innovative thinking.
3. Website levels Monitoring Statistics on coverage, software, hardware, and running status must be monitored to avoid monitoring dead spots and real-time understanding of Application Operation.
4. solve problems through innovative thinking O & M Efficiency Problem; at present, most of the company's O & M work is still dependent on manual operation intervention, and it is necessary to free hands as much as possible.
5. O & M Knowledge Accumulation and accumulation, document completeness O & M is an experienced job. Good experience and traps must be accumulated to avoid repetitive errors.
6, Planning and execution ; Work is planned. After planning, you can try to achieve your goals without making excuses.
7, Automated O & M ; Can refine, design, and develop daily mechanized work into tools and systems, so that the system can be automatically completed as much as possible relying on the system; let everyone spend more time thinking, thinking, and doing what they like.
The above are only some technical aspects. Of course, personal consciousness is also very important.
Iv. Confusion, Current Situation and Development Prospects of the O & M profession
Unlike other positions, such as R & D engineers and test engineers, O & M personnel have clearly defined responsibilities and career plans and have a sense of professional identity and accomplishment; the O & M work may give people the feeling that they know more about each other, but they are more proficient than the full-time engineers and feel that they are usually less concerned (unless there is a fault online ), gradually everyone will be confused and confused about career development. Why? In addition to the characteristics of the profession, it is mainly because of the lack of in-depth understanding of O & M and the lack of in-depth work. In fact, this problem also occurs in other positions, but I found that O & M is more typical, this problem is more likely to occur;
To address this issue, I will talk about the current situation and development prospects of website O & M (I am still thinking about it, it may not be very thorough and comprehensive. Please make an axe to supplement it)
O & M status:
1. In the initial stage, major companies have their own full-time employment, but their importance or importance is not high, which is highly alternative; small companies are mostly engaged in this work by taking into account other positions. They do not have a full-time role and cannot do in-depth work.
2. the technical level is relatively low. It is mainly in the stage of technological exploration and accumulation, and there is no systematic concept or technology.
3. physical labor is too high. This problem is mainly related to the second point. Many things are still carried out by human resources, and there is no mature automated management method for large-scale clusters, large-scale clusters are closely related to O & M. If there are only a hundred machines, there will be no space for O & M.
4. Excellent O & M talents are extremely lacking. At present, all major companies rely on their own training. This situation leads to extremely low mobility of O & M talents in the industry, A lot of good technologies are confined to major companies, such as Google's 0.5 million machine science management, or some of the top 10 O & M experience of domestic Internet companies, these experiences are valuable and determine the core competitiveness of a company. These problems lead to the circulation, integration, and approval of advanced O & M technologies in the industry, and will ultimately limit the development of O & M.
5. Many excellent O & M experiences are in the hands of large companies. This is not based on the technical strength of the company, but on the technical scale, massive PV, and sufficient hardware scale of large companies, such as the terrible traffic of Baidu and the massive data volume of 51.com ~~~~ These factors determine that the problems they encounter are not met by other small or medium-sized companies, or are about to be met. However, large companies may have good solutions or systems.
Development Prospects:
1. From an industry perspective, with the rapid development of China's Internet (China's Internet users have jumped to the world's No. 1), the scale of websites is growing, and the architecture is becoming more and more complex; the requirements for full-time website O & M engineers and website architects will become increasingly urgent. In particular, the demand for experienced excellent O & M talents will increase, and the old ones will become more valuable; at present, graduates are basically selected in China (limited to large companies). The training cost is high, and the absence of experienced talents will lead to slow technical updates and affect the company's technological development. Of course, graduates also have advantages: White Paper, strong plasticity, comparison of identity and easy to integrate into corporate culture.
2. From a personal perspective, the technical content and requirements of O & M engineers are getting higher and higher. At the same time, they are also the people who are most familiar with the company's application and architecture and are getting more and more attention.
3. Website O & M will become a comprehensive technical position integrating multiple disciplines (networks, systems, development, security, application architecture, and storage, it provides you with a good space for personal abilities and technical breadth.
4. experience related to O & M work will become very important, and will also become the core competitiveness of individuals. They have good problem-solving capabilities, solution provision, and global thinking capabilities at all levels.
5. The development of special hair control and interest; because the O & M jobs have a wide range of knowledge, it is easier to cultivate or give full play to some of my personal specialties or hobbies, such as the kernel, network, development, database, and so on, can be very proficient and become an expert in this field.
6. If you do not want to perform O & M in the future, it is easier to transfer to other positions without too many limitations. Of course, you have to do it with your heart.
7. Technology Development Direction, website/System Architect.
V. Anatomy of key O & M technical points
1. Large-scale cluster management
First, we need to clarify the concept of a cluster. A cluster is not a general combination of functional servers, it refers to the integration of server and hard disk resources for a certain purpose or function (the number of machines is greater than two). For applications, it is a whole. Currently, conventional clusters can be divided: high Availability cluster (HA), Server Load balancer cluster (such as LVS), distributed storage and computing storage cluster (DFS, such as Google GFS and Yahoo hadoop ), specific Application Clusters (a combination of specific function servers, such as the DB and cache layers) are mainly based on these four types in the Internet industry, if the business is simple and there are few post operations on applications, you can simply use layer-4 switches (such as F5) to achieve high service availability/balance, for companies with tight resources, there are also some open-source solutions such as LVS + ha, which are very flexible. For the latter two solutions, the technical strength and application characteristics of the company will be tested, the third type of DFS is mainly used in massive data applications, such as email and search. In addition to simple massive storage, it also includes data mining and user behavior analysis; for example, Google and Yahoo can store and analyze user record data for nearly one year, while baidu should be less than 30 days, and soguo will be less.
...
These are essential for search preparation and user experience.
Next, let's talk about how to manage clusters scientifically. The key points are as follows:
I. Monitoring
It mainly includes fault monitoring and performance, traffic, load, and other status monitoring, which are related to the healthy operation of the cluster and the timely detection and intervention of potential problems;
A. Service Failure and status monitoring: Mainly monitors the interaction between the server itself, upper-layer applications, and associated service data. For example, for front-end web servers, we can have many types of monitoring, this includes application port status monitoring, which allows you to detect crash on the server or application itself in a timely manner, and detect the health status of the server through ICMP packets. The upper layer may also include monitoring of services of various channels, the common method is to use the facial recognition code for judgment, or sign the key pages to make the website hacked and tampered with (alarm, and automatic recovery of tampered data). These are just a part, there are also n multiple monitoring methods, depending on the characteristics of the application, there are still some problems to solve, such as the cluster is too large, how to monitor high performance is also a real problem.
B. Others are monitoring or statistics on cluster status, which provides data reference for rational management and optimization of clusters, including service bottlenecks, performance problems, abnormal traffic, and attacks.
Ii. Fault Management
A. hardware faults. For N-plus clusters with hundreds or tens of thousands of machines, the probability of server crashes and hardware faults is very high, and there are service hardware problems almost every moment, crashes, hard disk damage, power supply, memory, and switch. In response to this situation, we need to fully consider these issues when designing the website architecture and regard them as the norm. We are more likely to avoid this risk by leveraging the redundancy mechanism of applications, however, the system engineers are provided with sufficient processing time. (If Google does not claim to have killed 800 machines at the same time, will the service not be affected?) This is the place where O & M engineers and website architects are tested for functionality, A good design can achieve the self-recovery capability described by Google, such as gfs. A bad design means that the crash of a server may cause a chain failure reflection of a wide range of services and directly refuse to respond to users.
B. Application fault problems. It may be caused by a bug, a performance threshold being exceeded, or an attack, but it is important to note that, it is necessary to have preventive measures against these problems. If they do not take them for granted, they will not go wrong. If there is a problem, how can we deal with it? This requires O & M engineers to do their best, including emergency response speed, scientific troubleshooting, and effective backup solutions.
Iii. Automation
Automation: in short, we use tools and systems to do some of our daily manual work, freeing our hands and boring repetitive work. For example, before there is no tool, we need to install the system on a bare-metal machine, such as 2000. It may take 10 people/10 days to compress n CDs, resulting in higher labor costs... Now, with automated tools, only a few simple commands can be done, and human programs such as machines can automatically complete the previous daily manual intervention, so that they can automatically complete and report results, it also has certain expert system capabilities and can make some simple options, such as/non-judgment and optimization... These benefits are obviously not mentioned much... It should be said that automated O & M is a pursuit of professional O & M engineers, and has benefited the public. Although this is an extremely arduous task: constantly changing businesses, nonstandard application design, development models, network architecture changes, IDC changes, specification changes, and other factors may affect the existing automation system, therefore, it is necessary to modularize, interface, and parameterization. Therefore, automation is one of the key tasks of O & M engineers and also a manifestation of value.
5. Anatomy of key technical points in O & M
1. Design of a large number of high-concurrency websites
2. High-reliability and high-scalability Network Architecture Design
3. How can I avoid being hacked for website security issues?
4 north-south interconnection issues and dynamic CDN Solutions
5. Massive Data Storage Architecture
From: http://bbs.chinaunix.net/thread-3675820-1-1.html