Operation and Maintenance technology planning

Last Update:2015-02-23 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Key technical points in operations anatomy: 1 design of a large number of high concurrent websites; 2 high reliability, high scalability network architecture design; 3 website security issues, how to avoid being hacked? 4 North-South interconnection problem, Dynamic CDN solution; 5 Massive data storage architecture

First, what is the large-scale website operation?

First of all, the full text of the "operation" refers to: large-scale web site operations, and other operational dimensions of the difference is quite large; then we define the scope of large sites and small sites, the definition is mainly from the operational complexity perspective, such as site specifications, visibility, server magnitude, PV capacity, etc., other factors are not the focus ; Therefore, we first define the server size of more than 1000 units, PV per day at least billion (at least 10 in the domestic top), such as Sina, Baidu, qq,51.com and so on; other small sites may not have a real operational engineer, which is related to the lack of site specifications and cost factors, More is the collection of networks, systems, development work in a "complex talent", as some companies have some contract procurement into the operation of the scope of responsibility, as well as IDC network planning is also included in the operation and maintenance responsibilities. Therefore, it is important to understand: operation and other related jobs must be very familiar with: Network, system, system development, storage, security, DB and so on; The OPS engineer I'm talking about here refers to a dedicated OPS engineer.

Let's talk about the "birth" process for general Products:

1, the first company management to give guidance, PM positioning market demand (or copy mature application) for research, analysis, and finally give detailed design.

2, the architect according to the requirements of product design, such as PV size estimation, server size, application architecture and other factors to complete network planning, architecture design, etc. (basically no network changes, unless major projects)

3, the development Engineer will design code implementation, test engineers to test the application.

4, good, to the OPS engineer, first clear not that the first three steps are not related to the operation and maintenance work, on the contrary, the first three steps and operation and maintenance relationship is very large: the application of the pre-architecture design, software/Hardware Resources Assessment application procurement, application design performance hidden danger and assessment, IDC, service performance \ Security tuning, Server System-level optimization (related to specific applications) and so on need to operate all the way, and lead the entire application on-line project; Operations engineers are responsible for product server shelves, server system installation, network, IP, Universal toolset installation. Operation and maintenance engineers also need to be on-line application system architecture is reasonable, scalability, and security risks and other factors responsible, and responsible for the final product (program), network, system splicing and optimization of the combination of the final product on-line to provide user use, and ZHOUERFU make: demand- Development (upgrade), test-on-line (performance, security issues, and other pre-estimated issues, and then slowly all come out) here: the site development model and traditional software development is completely different, the site of the development of a day on the line of the upgrade version is commonplace, user experience for the king, If a problem on the line like m$ need 1 years to solve, the user early uproar; After the application on-line, operation and maintenance work is just beginning, the specific work may include: Upgrade version on-line work, service monitoring, application status statistics, daily service status patrol, burst fault processing, service daily change adjustment, cluster management, service performance evaluation optimization, Database management optimization, with the application of PV increase and decrease in the application framework of scaling, security, operational development work:

A, as far as possible daily mechanical manual work through the tools to achieve (such as service monitoring, application status statistics, service, etc.), improve efficiency.

b, to solve the problems in the real service, such as high reliability, scalability problems.

C, the development of large-scale cluster management tools, such as how to 10,000 machines in 1 minutes to complete the password modification, or run the specified task? How do 2000 servers install the operating system quickly? How to store, share and analyze the data of PT level in each distributed IDC and storage cluster? A series of challenges require the efforts of the OPS engineer.

Here is a description of other work, in the entire project, the front-end application for the network/system engineer is a black box, while the development engineer responsibility is only responsible for the completion of the application of functional development, and the application itself is responsible for the performance, security and other applications, it is not responsible for or care about network/system architecture matters, of course Hardware procurement personnel and other business department colleagues will not care about these issues, their own duties, but the core of the project is operations Engineer ~! Bridges to all other sectors.

The above said a lot, I think we should have some concept of operations, in this analogy, if we are a high-speed car on the highway, the OPS engineer is the driver and maintenance workers, the driver is not simple, sometimes need to change tires in the high-speed process, and according to the road situation shift position, When the car is faster and faster, the car itself can not meet the high speed of the car performance tuning or parts upgrading, high-speed travel to solve the problem of car failure and performance, always pay attention to the security issues ahead, and foresight to take evasive measures. This is the operation and maintenance work!

Finally, the responsibility of the operations engineer: "To ensure the stability of the line," seemingly simple, but it is not easy, operation and maintenance engineers must weigh in a number of unfavorable factors: the impact of new product model on the existing structure and technology, product high-frequency upgrade brought about by the online bug, operation and maintenance automation management failure caused by human error, The high efficiency that the IT industry pursues leads to the lack of process execution, the performance and structure of the user to increase the pressure, the IT industry loose technology management culture, innovation risk, Internet security issues and other factors, will be the site stability of the enemy, operation and maintenance engineers must control this last level, the need for a specific high sense of responsibility, The principle and coordination ability, if can achieve the best balance of each factor, that is a good OPS engineer.

In addition, in this chat humorous, I see there are a lot of people to Sina, QQ, baidu,51.com and so on the operation and maintenance of their own experience, in fact, this is a bit for them to avoid the difficult:

A, the company's own network structure, size, more or less is the core secret of the company, to keep secret, in addition, for everyone familiar with the general software, architecture, because many companies will be based on their own actual business needs, but also because of the original performance, security, known bugs, functions and other reasons, Two development (such as Apache,php,mysql), the operating system kernel will also be customized according to different business types, such as some applications are operational, some are high IO type, or large storage large memory type. According to these characteristics of the kernel optimization customization, such as Sina on the memcache on the development of two times, made a memcachedb, specifically how we do not talk, but open source, is commendable, domestic companies for open source is basically to obtain, no contribution; The server is not familiar with the model, according to the business characteristics, most of them are looking for DELL/HP/IBM for customization; In addition, there are self-contained solutions in distributed storage, either using solutions such as open source Hadoop, or self-developed. But 90% is the idea of using Google GFs: Distributed storage, computing, big tables.

B, the company's business direction is different, will cause the operation and maintenance mode or methods are different, such as 51.com and Baidu operation is certainly different, because their business model determines its architecture, server scale, IDC distribution, network structure, general technology will not be the same, The main news portal Sina and the main SNS 51.com operation model difference is very big, even the duties are not the same, but one thing, general technology and the approximate structure of the same, we do not too deified, more companies just play base bricks game, nothing technical content.

C, as mentioned above, the current large-scale website operation is still in the early days of the idea and experience are relatively fragmented, there is no mature knowledge system, may be specific what is operation, we must first think about, or did not think, the real discussion is only the iceberg of operational work, limited to specific technical details, or a famous website large framework, The real operation of the system of things do not, this may be the current online operation and maintenance related data less than the original. Or is also the domestic operation and maintenance personnel more difficult to recruit, compare the cattle of the operation and maintenance engineer is one of the rare reasons.

Second, what kind of skills and qualities are required for the operation and maintenance

What kind of skills and qualities do you need as an OPS engineer? First of all, say skills, as you can see above, operation is a multi-it job skills and a post, on the system--network---storage----------- Safety and other aspects need to know some, but for some links need to be familiar with or even proficient, such as the system (basic operating system familiar use, *nix,windows ...), protocol, System development (daily important work is the automatic operation and maintenance of related development, large-scale cluster tool development, management), General purpose applications (such as LVs, ha, Web server, DB, middleware, storage, etc.), network, IDC topology architecture;

The following points are summarized in the skills:

1, the development ability, this is very important, because the operation tools all need to develop, the development language: Perl, Python, PHP (one of them), the shell (awk,sed,expect ..., etc.), needs to have the actual project development experience, otherwise the work will be very painful.

2, general application needs to understand: operating system (currently mainly Linux, BSD), webserver related (Nginx,apahe,php,lighttpd,java ... ), Database (Mysql,oralce), other miscellaneous seven or eight pull the east; system optimization, high reliability; These are just add-ons, not necessary, you can work while learning, these things are not difficult. Of course, in operations, some of the division of labor is not the same focus.

3, System, network, security, storage, cdn,db and so on need to understand, know its related principles.

Personal quality aspects:

1, communication skills, teamwork: operation and maintenance work across departments, cross-work a lot, need to be good at communication, and teamwork ability to strong; this should be the basic quality of modern enterprise requirements, not much to say.

2, the work needs to bold but cautious: bold can innovate, do not take the ordinary road, especially for the operation of this new type of job, but also need innovation to promote development; cautious, the maintenance engineer is the website admin, the highest online authority, will be sorry for life or into 18 layers of hell.

3, initiative, execution, energy, strong resistance: due to the characteristics of the IT industry, change quickly, often planning to catch up with changes, operation and maintenance work is more prominent, such as domestic major companies are often the country's servers, where cheap cost-effective, that move, to carry out large-scale service migration (involving hundreds of servers) , this is a very headache problem, often time is very urgent, such as limited to 1 weeks to complete, in this case, the operation and maintenance Engineer's initiative and execution has a high demand: planning, program, service seamless migration, machine relocation shelves, environmental preparedness, safety assessment, performance assessment, infrastructure, the relevant departments of the wrangling, 7x24 Small emergency response and so on.

4, the other is some basic quality: the mind to light, logical thinking ability, a modest and stable, affinity, helpful, have bigger picture.

5, the last point, do the site operations need to explore the spirit of innovation, through innovative thinking to solve the problems in the real, because this is a career in the early (foreign also, but earlier than the domestic start), there is no mature system or methodology can be used for reference, only rely on the people have been groping their own efforts.

Third, how to be regarded as a qualified operation and maintenance engineer

1, to ensure that the service to meet the requirements of the online standards, such as 99.9%, to ensure that the stability of the line, this is the basic responsibility of the operation and maintenance engineer position.

2, constantly improve the reliability and robustness of the application, performance optimization, security improvement; This is a test of initiative and innovative thinking.

3, the site at all levels of monitoring, statistical coverage, software, hardware, running state, can monitor the need to monitor statistics, to avoid monitoring dead spots, and real-time to understand the operation of the application.

4, through innovative thinking to solve the operational efficiency problem; At present, most of the major operations of the company or rely on manual intervention, need to free their hands as much as possible.

5, operation and maintenance of knowledge accumulation and precipitation, the completeness of the document, operation and maintenance is a very strong experience of the post, good experience and traps need to accumulate, to avoid duplication of error.

6, planning and execution, work plan, plan after the idea to try to achieve the goal, do not find excuses.

7, automatic operation and maintenance, can be refined to daily mechanized work, design and development into tools, systems, can let the system automatically complete as far as possible to rely on the system; Let everyone more time for thinking, innovative thinking, do their own things.

These are just some of the technical aspects, of course, personal awareness is also very important.

Iv. the confusion, present situation and development prospect of operation and maintenance occupations

Operations are not like other positions, such as research and development engineers, test engineers, have a very clear role in the positioning and career planning, a more professional identity and sense of accomplishment, and operation and maintenance work may give people a sense of what aspects are understood, but are more proficient than the full-time engineers, Feel usually the attention degree is relatively low (unless there is a failure on the line), slowly everyone will be confused, the career development confusion, why there is this phenomenon? In addition to the characteristics of the profession itself, mainly because of the operation and maintenance of the understanding of not in-depth, do not go deep into the cause, in fact, other positions will appear, but I found that operation is more typical, more prone to this problem;

In response to this question I talk about the website operation and development of the current situation and prospects (also in the thinking, may not be very thorough, also please treatise Add)

Operation and Maintenance Status:

1, at the beginning of the initial stage, the major companies have this full-time, but the importance or importance of the degree is not high, alternative strong; Small companies are more from other positions to do this piece of work, no full-time, it is impossible to do in-depth.

2, the technical level is relatively low, mainly in the technical exploration, accumulation stage, no type into the system of ideas, technology.

3, manual labor is large; This problem is mainly related to the 2nd, a lot of things or rely on manpower, did not complete a good practice, for large-scale cluster has no mature automated management methods, in this explanation, large-scale cluster and operation and maintenance work is closely related if only sell his Grove machine, There is no space for the operation to be too large.

4, the outstanding operation and maintenance personnel of the extreme lack of the current major companies are basically self-cultivation, the status quo led to the industry in the mobility of talent is very low, a lot of good technology is confined to major companies inside, such as Google 500,000 machine science management, or the domestic Internet company Top 10 of some operations and maintenance experience, These experiences are very valuable and determine the core competencies of a company, which in turn leads to the circulation, penetration, and borrowing of advanced operational technologies in the industry, and ultimately limits operational development.

5, a lot of excellent operation and maintenance experience in the hands of large companies, this is not the company's technical strength, but lies in the large company's technical scale, mass PV, hardware size is large enough, such as Baidu terrible flow, 51.com huge data ~ ~ ~ These factors determine the problems they encounter are other Small companies have not yet met, or are about to meet. But big companies may already have a good solution or system.

Development prospects:

1, from the industry point of view, with the rapid development of China's Internet (the current Chinese netizens have jumped to the world's first), the size of the website is becoming larger and more complex, the requirements of dedicated web site operations engineer, website architect will be more and more urgent, especially for the experienced excellent operation and maintenance talent demand, and the older the more valuable At present, the domestic basically is to choose graduates training (limited to large companies), training costs are high, and the lack of experience to join the company will lead to slow technology updates, the impact of the company's technological development; Of course, graduates are also good: a white paper, plasticity, more identification and easy integration into the corporate culture.

2, from a personal point of view, operation and maintenance engineer technical content and requirements will be more and more high, but also for the company's application, architecture most familiar with the people, more and more attention.

3, the website operation and Maintenance will become a multi-disciplinary (network, system, development, security, application architecture, storage, etc.) of the comprehensive technical post, to provide you with a very good personal ability and technological breadth of development space.

4, operation and maintenance work related experience will become very important, but also will become the core competitiveness of individuals, with a good solution to all aspects of the problem of the ability to provide solutions, global thinking ability.

5, the development of special features and interest; As the operation and maintenance of the contact with a wide range of knowledge, it is easier to develop or play a particular aspect of personal expertise or hobbies, such as the core, network, development, database, etc., can do very in-depth proficiency, become experts in this area.

6, if you really want to later do not want to do operations, to other positions are also relatively easy, there will not be too much limitation. Of course, you have to really do it with your heart.

7, Technology Development Direction: Website/System Architect.

Five, operation and maintenance of key technical point anatomy

1, large-scale cluster management problems

First of all, we need to clear the concept of clustering, cluster is not a general function of the sum of the server, but to achieve a purpose or function of the server, hard disk resource integration (more than two machines), for the application it is a whole, the current conventional cluster can be divided into: High availability cluster (HA), Load balancer clusters (such as LVS), distributed storage, compute storage clusters (DFS such as Google GFs, Yahoo Hadoop), specific application clusters (a particular feature server combination, such as db, cache layer, etc.), the Internet industry is mainly based on these four types, for the first two similar, If the business is simple, the application on the post operation is relatively small, you can simply adopt a four layer switch solution (such as F5), to achieve high service availability/responsible for the balance of the role of resource-tight companies also have some open-source solutions such as lvs+ha, very flexible; for the latter two, it will test the company's technical strength and application characteristics , the third type of DFS is mainly used in large-volume data applications, such as mail, search and other applications, especially the search requirements are higher, in addition to simple mass storage, but also includes data mining, user behavior analysis, such as Google, Yahoo can save the analysis of nearly a year of user record data, and Baidu should be less than 30 days, The Soguo is even less ... These are critical for the search readiness and user experience.

Next, we'll talk about how to manage the cluster scientifically, with the following key points:

I, monitoring

Mainly including fault monitoring and performance, traffic, load and other status monitoring, these monitoring is related to the healthy operation of the cluster, and the potential problems of timely detection and intervention;

A, service failure, status monitoring: Mainly on the server itself, the upper application, the correlation service data interactive monitoring; For example, for front-end Web server, we can have many types of monitoring, including application port status monitoring, to facilitate the timely detection of the server or the application itself is crash, Through the ICMP packet Probe server health status, the upper layer may also include the application of various channels of the monitoring of the business, the common method is to use the face industry signature to judge, or the focus of the page signature, to the site is black tampering (alarm, and automatically recover the tampered data), and so on, these are just a part, there are more than n Depending on the characteristics of the application, there are some problems to be solved, such as the cluster is too large, how high-performance monitoring is also a real problem.

b, the other is the cluster status class monitoring or statistics, for our reasonable management of tuning cluster to provide data reference, including service bottlenecks, performance problems, abnormal traffic, attacks and other issues.

II. Fault Management

A, hardware failure problem; For hundreds or tens of thousands of machines n multi-cluster, server crashes, hardware failure probability is very large, almost every moment there are service hardware problems, crashes, hard disk damage, power, memory, switch. In this case, we need to take these issues into account when designing our Web site architecture, and rely more on the redundancy mechanisms of the application to circumvent this risk, but give the system engineer ample time to handle it. (if Google is not claiming to be dead 800 machines at the same time, the service will not be affected any more), this is the test of the operations engineer and the site architect function of the place, good design can achieve Google described self-recovery capabilities, such as GFS, Bad design that is, the crash of a single server can cause a cascading failure of large-area services to be reflected, directly rejecting the user response.

b, application failure problem; it may be that a bug is triggered, or a performance threshold is exceeded, attacks and other situations vary, but the important point is to have these problems of preventive measures, can not be taken for granted, it will not problem, such as the real problem, how to deal with it? This requires the operation and maintenance engineers usually do a good job, including emergency response speed, the scientific nature of the fault treatment, the effectiveness of alternative programs.

III. Automation

Automation: In short, is to take our daily manual of some of the work through the tool, the system automatically to complete, the liberation of our hands and dull repetitive labor, such as: Before the tool, we install the system needs a bare metal installation, such as 2000, may require 10 people/10 days, rotten n CDs, Greater labor costs ... And now through the automation tools, just a few simple commands can be done, as well as machine human program, automatic completion of the previous day manual intervention work, make it automatic completion, report results, and have a certain expert system capabilities, can do some simple is/non-judgment, optimization choice, etc... These benefits are clearly not much to say ... It should be said that the automation of operations and maintenance engineer is a professional pursuit of profit, although this is an extremely difficult task: changing business, non-standardized application design, development mode, network architecture changes, IDC changes, specification changes and other factors, may have an impact on existing automation systems, Therefore, the need for modular, interface, variable due to parameterization and so, automation related work, operations and maintenance engineers are one of the core focus of work, but also the embodiment of value.

Http://bbs.chinaunix.net/thread-3667292-1-1.html

Operation and Maintenance technology planning

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Operation and Maintenance technology planning

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Operation and Maintenance technology planning

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support