Responsibilities and prospects for operations engineers

Source: Internet
Author: User

Responsibilities and prospects for operations engineers

Key technical point anatomy in operation and maintenance:
1 A large number of high-concurrency website design scheme;
2 high reliability, high scalability network architecture design;
3 website security issues, how to avoid being hacked?
4 North-South interconnection problem, dynamic CDN solution;
5 Massive data Storage architecture

First, what is the large-scale website operation?
First of all, the full text of the "operation" refers to: large-scale web site operations, and other operational dimensions of the difference is quite large; then we define the scope of large sites and small sites, which are defined mainly from operations complex
such as website specifications, visibility, server
Magnitude, PV amount, and other factors are not the focus; Therefore, we first define the server size of more than 1000 units, PV per day at least billion (at least 10 of the domestic top), such as Sina, Baidu,
Qq,51.com and so on; other small sites may not have a real operational engineer, which is related to the lack of site specifications and cost factors, more is the collection network, System
, development work in a "complex talent", as some companies have some contract procurement into the operation and maintenance of the scope of responsibility, as well as IDC network planning is also included in the operation and maintenance responsibilities. Therefore, it is important to be clear
White: operation and other related occupations must be very familiar with: Network, system, system development, storage, security, DB, etc. the OPS engineer I'm talking about here refers to a full-time OPS engineer.

Let's talk about the "birth" process for general Products:
1, the first company management to give guidance, PM positioning market demand (or copy mature application) for research, analysis, and finally give detailed design.

2, the architect according to the requirements of product design, such as PV size estimation, server size, application architecture and other factors to complete network planning, architecture design, etc. (basically no network changes, unless major projects)

3, the development Engineer will design code implementation, test engineers to test the application.

4, good, to the operation of the engineer to do, the first clear point is not to say that the first three steps are not related to the operation and maintenance work, on the contrary, the first three steps and the operation of the relationship is very large: the application of the pre-architecture design, software/hardware Resources evaluation
Applications for procurement, application design performance hazards and assessment, IDC, service performance security tuning, server system-level optimization (related to specific applications) are required to participate in the entire operation, and lead the whole application on-line project; operations
The engineer is responsible for product server shelves, server system installation, network, IP, Universal toolset installation. Operations engineers also need to be on-line application system architecture is reasonable, scalability, and
Security risks and other factors responsible, and responsible for the final product (program), the network, the system of the three splicing and optimization of the combination, the final completion of the product on-line to provide user use, and ZHOUERFU make: Demand
Development (upgrade), test-on-line (performance, security issues, and other pre-estimated problems are all coming out) Here's a point: Web development model and traditional software development
Completely different, the site day development on the line of the upgrade version is a commonplace, user experience for the King well, if some online problem like m$
Need 1 years to solve, the user early uproar; After the application on-line, operation and maintenance work is just beginning, specific work may include: Upgrade version on-line work, service monitoring, application status statistics, daily service status patrol, burst fault
Processing, service daily change adjustment, cluster management, service performance evaluation optimization, database management optimization, with the application of PV increase and decrease in the application framework of scaling, security, operational development work:
A, as far as possible daily mechanical manual work through the tools to achieve (such as service monitoring, application status statistics, service, etc.), improve efficiency.
b, to solve the problems in the real service, such as high reliability, scalability problems.
C, the development of large-scale cluster management tools, such as how to 10,000 machines in 1 minutes to complete the password modification, or run the specified task? How do 2000 servers install the operating system quickly? How to store, share and analyze the data of PT level in each distributed IDC and storage cluster? A series of challenges require the efforts of the OPS engineer.

Here is a description of the other work, in the entire project, the front-end application for the network/system engineer is a black box, while the development engineer responsibility is only responsible for the completion of the application of functional development, and
Application itself is responsible for the performance, security and other applications, it is not responsible for the network/system architecture matters, of course, soft/hardware procurement personnel and other colleagues will not care about these issues, their own duties, but the
The core of the goal is operations Engineer ~! Bridges to all other sectors.

It said a lot, I think we should have some concept of operations, for example, if we are a high-speed car on the highway, the OPS engineer is the driver and maintenance workers, the
Driver not Jane
Single, sometimes need to in the high-speed driving process to change the tires, and according to the road situation shift, when the car speed faster, the car itself can not meet the high speed of the car performance tuning or parts upgrading, high-speed travel to solve the steam
Car failure and performance problems, always pay attention to the security issues ahead, and foresight to take evasive measures. This is the operation and maintenance work!

Finally, the responsibility of the OPS engineer: "Ensure the stability of the line", seemingly simple, but it is not easy, OPS engineers must weigh in a number of adverse factors: the new product model of the existing architecture and technology
Impact, product high-frequency upgrade to bring the online bug hidden, operation and maintenance automation management commitment to the human error, the pursuit of high efficiency in the IT industry led to the lack of process implementation, the user increased performance and
Structure of the pressure, the IT industry loose technology management culture, innovation risk, Internet security issues and other factors, will be the site stability of the enemy, operation and maintenance engineers must control this last level, the need for specific high responsibility
Any sense, principle and coordination ability, if can achieve the best balance of factors, it is a good OPS engineer.

In addition, in this chat humorous, I see there are a lot of people to Sina, QQ, baidu,51.com and so on the operation and maintenance of their own experience, in fact, this is a bit for them to avoid the difficult:
A, the company's own network structure, size, more or less is the core secret of the company, to be confidential, in addition, for everyone familiar with the general software, architecture, because many companies will be based on their own actual business needs, the same
Two development due to original performance, security, known bugs, features, etc. (e.g. apache,php,mysql
), the operating system kernel will also be customized according to different business types, such as some applications are operational, some are high IO type, or large storage large memory type. Based on these features, the kernel is optimized for customization, such as Sina
Just in
Memcache on the development of two times, made a memcachedb, specifically how we do not talk, but open source, is commendable, domestic companies for open source is basically to obtain, no
In addition, the server is not well-known models, according to the business characteristics, most of them are looking for DELL/HP/IBM to be customized; In addition, there are self-contained solutions for distributed storage.
Or you could use solutions such as open source Hadoop, or self-developed. But 90% is the idea of using Google GFs: Distributed storage, computing, big tables.

B, the company's business direction is different, will cause the operation and maintenance mode or methods are different, such as 51.com and Baidu operation is certainly very different, because their business model determines its architecture, server volume
Level
IDC distribution, network structure, general technology will not be the same, the main news portal Sina and the main SNS 51.com operation model difference is very big, even the duties are not the same;
With the technology and the general structure of the same, we do not too deified, more companies just play blocks of the game, no technical content.

C
As mentioned above, the current large-scale website operation is still in the early days of the idea and experience are relatively fragmented, no mature knowledge system, may be specific what is operation, we must first think about, or never thought, really
It is also just the tip of the iceberg of operations, confined to specific technical details, or a well-known website large framework, the real operation of the system of things no, this may be the current online operation and maintenance of relevant information
Less than the original reason. Or is also the domestic operation and maintenance personnel more difficult to recruit, compare the cattle of the operation and maintenance engineer is one of the rare reasons.

Second, what kind of skills and qualities are required for the operation and maintenance
What kind of skills and qualities do you need as an OPS engineer, first of all speaking skills, as you can see above, operations is a multi-it job skills and a position, to the system-network
Requirements---------development-----------------
(Basic operating system familiar to use, *nix,windows
.), Protocol, System development (daily important work is automated operations related to development, large-scale cluster tool development, management), general purpose applications (such as LVs, ha, Web server
, DB, middleware, storage, etc.), network, IDC topology architecture;

The following points are summarized in the skills:
1, the development ability, this is very important, because the operation tools all need to develop, the development language: Perl, Python, PHP (one of them), the shell (awk,sed,expect ..., etc.), needs to have the actual project development experience, otherwise the work will be very painful.

2, general application needs to understand: operating system (currently mainly Linux, BSD), webserver related
(Nginx,apahe,php,lighttpd,java ... ), Database (Mysql,oralce), other miscellaneous seven or eight pull the east; system optimization, high reliability
These are just add-ons that don't have to be necessary, and can be learned while working, and these things are easy. Of course, in operations, some of the division of labor is not the same focus.

3, System, network, security, storage, cdn,db and so on need to understand, know its related principles.

Personal quality aspects:
1, communication skills, teamwork: operation and maintenance work across departments, cross-work a lot, need to be good at communication, and teamwork ability to strong; this should be the basic quality of modern enterprise requirements, not much to say.

2, the work needs to bold but cautious: bold can innovate, do not take the ordinary road, especially for the operation of this new type of job, but also need innovation to promote development; cautious, the maintenance engineer is the website admin, the highest online authority, will be sorry for life or into 18 layers of hell.

3, initiative, execution, energetic, strong resistance: due to the characteristics of the IT industry, change quickly, often planning to catch up with changes, operation and maintenance work is more prominent, such as domestic major companies are often servers
Across the country, where cheap and cost-effective, that move, to carry out large-scale service migration (involving hundreds of servers), which is a very headache problem; often time
Very urgent, such as within 1 weeks of completion, in this case, the Operations Engineer's initiative and execution is very high requirements: planning, programs, services seamless migration, machine relocation, environmental preparedness, safety assessment, sexual
Can assess, infrastructure, various related departments of the wrangling, 7x24 small emergency response.

4, the other is some basic quality: the mind to light, logical thinking ability, a modest and stable, affinity, helpful, have bigger picture.

5, the last point, do the site operations need to explore the spirit of innovation, through innovative thinking to solve the problems in the real, because this is a career in the early (foreign also, but earlier than the domestic start), there is no mature system or methodology can be used for reference, only rely on the people have been groping their own efforts.

Third, how to be regarded as a qualified operation and maintenance engineer
1, to ensure that the service to meet the requirements of the online standards, such as 99.9%, to ensure that the stability of the line, this is the basic responsibility of the operation and maintenance engineer position.

2, constantly improve the reliability and robustness of the application, performance optimization, security improvement; This is a test of initiative and innovative thinking.

3, the site at all levels of monitoring, statistical coverage, software, hardware, running state, can monitor the need to monitor statistics, to avoid monitoring dead spots, and real-time to understand the operation of the application.

4, through innovative thinking to solve the operational efficiency problem; At present, most of the major operations of the company or rely on manual intervention, need to free their hands as much as possible.

5, operation and maintenance of knowledge accumulation and precipitation, the completeness of the document, operation and maintenance is a very strong experience of the post, good experience and traps need to accumulate, to avoid duplication of error.

6, planning and execution, work plan, plan after the idea to try to achieve the goal, do not find excuses.

7, automatic operation and maintenance, can be refined to daily mechanized work, design and development into tools, systems, can let the system automatically complete as far as possible to rely on the system; Let everyone more time for thinking, innovative thinking, do their own things.

These are just some of the technical aspects, of course, personal awareness is also very important.

Iv. the confusion, present situation and development prospect of operation and maintenance occupations
Operations are not like other positions, such as research and development engineers, test engineers, there is a very clear responsibility and career planning, a more professional identity and sense of accomplishment, and operation and maintenance work may give people the feeling of what is the
Solution
Some, but are more proficient than the full-time engineers, feel usually less attention (unless there is a failure on the line), slowly everyone will be confused about career development, why this phenomenon? Apart from
The characteristics of the profession itself, mainly because of the operation and maintenance of the understanding of not in-depth, do not go into the cause, in fact, other positions will appear, but I found that operation is more typical, more prone to this problem;

In response to this question I talk about the website operation and development of the current situation and prospects (also in the thinking, may not be very thorough, also please treatise Add)

Operation and Maintenance Status:
1, at the beginning of the initial stage, the major companies have this full-time, but the importance or importance of the degree is not high, alternative strong; Small companies are more from other positions to do this piece of work, no full-time, it is impossible to do in-depth.

2, the technical level is relatively low, mainly in the technical exploration, accumulation stage, no type into the system of ideas, technology.

3, manual labor is large; This problem is mainly related to the 2nd, a lot of things or rely on manpower, did not complete a good practice, for large-scale cluster has no mature automated management methods, in this explanation, large-scale cluster and operation and maintenance work is closely related if only sell his Grove machine, There is no space for the operation to be too large.

4, the outstanding operation and maintenance personnel of the extreme lack of the current major companies are basically self-cultivation, the status quo led to the industry within the mobility of talent is very low, a lot of good technology is confined to major companies inside,
such as Google 500,000 machine science management, or domestic connected companies top 10
Experience, these experiences are valuable things and determine the core competencies of a company, which in turn leads to the flow, penetration, and borrowing of advanced operations technology in the industry, and ultimately limits operations
Development.

5, many excellent operation and maintenance experience are in the hands of large companies, this is not the company's technical strength, but lies in the large company's technical scale, mass PV, hardware size is large enough, such as Baidu Terrible stream
Volume, 51.com massive data ~ ~ ~ These factors determine that the problems they encounter are other medium/small companies have not encountered, or are about to meet. But big companies may already have a good solution or system.

Development prospects:
1, from the industry point of view, with the rapid development of China's Internet (the current Chinese netizens have jumped to the world's first), the size of the website is more and more large, the structure is more and more complex; dedicated website Maintenance engineer, website structure
The demands of the teacher will be more and more urgent, especially for the experienced talents with great demand, and the older the more valuable, the current domestic is basically to choose graduates training (limited to large companies), the cost of training is high, and no
The participation of experienced personnel will lead to slow technology updates, affecting the company's technological development; Of course, graduates are also good: a white paper, plasticity, more identification and easy integration into the corporate culture.

2, from a personal point of view, operation and maintenance engineer technical content and requirements will be more and more high, but also for the company's application, architecture most familiar with the people, more and more attention.

3, the website operation and Maintenance will become a multi-disciplinary (network, system, development, security, application architecture, storage, etc.) of the comprehensive technical post, to provide you with a very good personal ability and technological breadth of development space.

4, operation and maintenance work related experience will become very important, but also will become the core competitiveness of individuals, with a good solution to all aspects of the problem of the ability to provide solutions, global thinking ability.

5, the development of special features and interest; As the operation and maintenance of the contact with a wide range of knowledge, it is easier to develop or play a particular aspect of personal expertise or hobbies, such as the core, network, development, database, etc., can do very in-depth proficiency, become experts in this area.

6, if you really want to later do not want to do operations, to other positions are also relatively easy, there will not be too much limitation. Of course, you have to really do it with your heart.

7, Technology Development Direction: Website/System Architect.

v. Anatomy of key technical points in operation and maintenance
1, large-scale cluster management problems
First of all, we need to clarify the concept of cluster, cluster is not generally referred to the sum of the functions of the server, but to achieve a purpose or function of the server, Consolidation of hard disk resources (more than two machines), for the application
It is a whole, the current general cluster can be divided into: High availability cluster (HA), load Balancer cluster (such as LVS), distributed storage, compute storage cluster (DFS, such as Google GFs
, Yahoo
Hadoop), specific application clusters (a particular feature server combination, such as db, cache layer, etc.), the Internet industry is mainly based on these four types, for the first two similar, if the business is simple, the
application on the post operation is relatively small, you can easily adopt a four-layer switch
Solution (such as F5), to achieve high service availability/responsible for the role of balance, for resource-tight companies also have some open-source solutions such as lvs+ha, very flexible; for the latter two, that will test the company's technical strength
and application characteristics, the third type of DFS is mainly used in large-volume data applications, such as mail, Search and other applications, especially the search requirements are higher, in addition to simple mass storage, but also include data mining, user behavior analysis;
such as
Google, Yahoo can save the analysis of nearly a year of user record data, and Baidu should be less than 30 days, Soguo less ... These are important for the search readiness, and the user experience, to
off.

Next, we'll talk about how to manage the cluster scientifically, with the following key points:
I, monitoring
Mainly including fault monitoring and performance, traffic, load and other status monitoring, these monitoring is related to the healthy operation of the cluster, and the potential problems of timely detection and intervention;
A, service failure, status monitoring: Mainly on the server itself, the upper application, the correlation service data monitoring; For example, the front-end Web
Server, we can have a number of types of monitoring, including the application port
Status monitoring, convenient to discover whether the server or application itself crash, through the ICMP Packet Probe server health status, the upper layer may also include the application of various channels of the monitoring of the business, the common method is to use the face industry
Signature to judge, or the key page to sign, to the site was hacked (alarm, and automatically recover the tampered data), and so on, these are just a part, there are more than N monitoring mode, depending on the application characteristics, there is a
Some problems need to be solved, such as the cluster is too large, how high-performance monitoring is also a real problem.
b, the other is the cluster status class monitoring or statistics, for our reasonable management of tuning cluster to provide data reference, including service bottlenecks, performance problems, abnormal traffic, attacks and other issues.

II. Fault Management
A, hardware failure problem; For hundreds or tens of thousands of machines n more
Cluster, server crashes, hardware failure probability is very large, almost every moment there are service hardware problems, crashes, hard disk damage, power, memory, switch. In this case, we need to design the site architecture
Take these issues into full account and treat them as normal, and rely more on redundant mechanisms of application to circumvent this risk, but give system engineers ample time to deal with them. (if Google is not claiming to be dead at
800 machines, the service will not be affected by any of them); This is where the operations engineer and the site architect function, a good design can achieve the self-recovery capabilities described by Google, such as GFS, bad
Design that is the crash of a server may cause a large area of service cascading failure to reflect, directly to the user refused to respond.
b, application failure problem; it may be that a bug is triggered, or a performance threshold is exceeded, attacks, and so on, but the important point is to have preventive measures on these issues, can not be taken for granted, it does not
If something goes wrong, how do you deal with it? This requires the operation and maintenance engineers usually do a good job, including emergency response speed, the scientific nature of the fault treatment, the effectiveness of alternative programs.

III. Automation
Automation: In short, it is the use of some of our daily manual work through the tool, the system automatically to complete, the liberation of our hands and dull repetitive labor, such as: Before the tool, we install the system needs
A bare-metal installation, such as 2000 units, may require 10 people/10 days, rotten N discs, labor costs more ... And now with the automation tool, just a few simple commands
can be done, but also like Machine human program, automatic completion of the previous day manual intervention work, so that it automatically complete, report results, and have a certain expert system capabilities, can do some simple is/non-judgment, optimize the selection
The choice of ... These benefits are clearly not much to say ... It should be said that automation operations is a professional operation and maintenance of the pursuit of the benefit of the public, although this is an exceptionally difficult task: changing business, not
Standardized application design, development mode, network architecture changes, IDC changes, specification changes and other factors, may have an impact on existing automation systems, so the need for modular, interface, variable due to parameterization
This, the automation related work, is the operation and maintenance Engineer's core key work, is also the value embodiment.

Responsibilities and prospects for operations engineers

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.