We are still confused about website O & M. Indeed, this is a new position that has been idle recently. Based on past experiences, I would like to discuss with you what is "portal website O & M "? The following are some of your experiences and feelings, so we hope to discuss and make progress together.
1. What is portal website O & M?
First of all, the full text "O & M" refers to the portal website application O & M, which is quite different from other O & M services such as networks and systems, then, we will define the scope of large websites and small websites. This definition is mainly from the perspective of O & M complexity, such as website specifications, popularity, server magnitude, and pv volume, other factors are not important. Therefore, we first define that the number of servers is greater than 1000, and pv should be at least 10 million (at least the top 20 in China) per day ), such as sina, alibaba, sohu, baidu, and Netease. Other small websites may not have O & M engineers in the true sense. This is related to insufficient website specifications and cost factors, more is a "compound talent" that integrates network, system, and development work. For example, some colleagues in this version have included the company's procurement contracts in the scope of O & M responsibilities, for example, IDC network planning is also included in the O & M responsibilities. This is the work of network engineers, so we should not compete for a job. However, it is very important to understand that website application O & M must be very familiar with other associated types of work: Network O & M, system O & M, application development, and content; but these jobs are not their own jobs, the O & M engineer I am talking about here refers to a full-time application O & M engineer.
Let's talk about the "birth" Process of general products:
1. First, the BOSS layer of the company provides the guiding ideology. PM positions market demands (or copy mature applications) for research, analysis, and final detailed design.
2. The Development Engineer implements the design code and the Test Engineer tests the application (same product department ).
3. Network/System Engineers plan the network and adjust the equipment according to product design requirements, such as pv size estimation, server scale, and application architecture. (basically, the network has not changed much, SA System Engineers are responsible for product Server Installation preparation, server system installation, network, IP, and general tool set installation.
4. Well, it's time for O & M engineers. First, it is clear that the first three steps have nothing to do with O & M. On the contrary, the first three steps are closely related to O & M: pre-Architecture Design of applications, software/hardware resource evaluation application procurement, application design performance risks and evaluation, IDC, service performance \ Security tuning, server system-level optimization (related to specific applications) O & M engineers are required to participate in the entire O & M process and lead the entire application launch project. O & M engineers are responsible for the rationality of the online application system architecture, scalability, and security risks, finally, the team is responsible for splicing and optimizing the products (programs), networks, and systems. Finally, the product can be launched for users to use and reuse: requirement → Development (upgrade) → test → launch (performance, security issues, and other issues that were previously estimated are coming soon): the website development model is totally different from traditional software development, 1 ~ Five upgrades are common, and the user experience is king. If an online problem like MS needs to be solved for one year, the user will be exhausted early. After the application is launched, the O & M work is just getting started, specific work may include: upgrade version launch, service monitoring, application status statistics, daily service status inspection, emergency troubleshooting, daily service change adjustment, cluster management, service performance evaluation and optimization, database management optimization (greater than 50 servers) scaling, security, and O & M of the application architecture with the increase or decrease of the application PV:
A. Try to use tools (such as service monitoring, application status statistics, and service launch) to achieve routine mechanical and manual work to improve efficiency;
B. Solve service problems in reality, such as high reliability and scalability;
C. Development of large-scale cluster management tools. For example, how can 10 thousand machines complete password modification or run specified tasks within one minute? How to quickly install the operating system on 2000 servers? In distributed IDCs and storage clusters, how does one quickly store, share, and analyze data at the BT level?
The efforts of O & M engineers are required for a series of challenges.
This section describes other types of cooperation. In the entire project, front-end applications are black boxes for Network/System Engineers. At the same time, development engineers are only responsible for functional development of applications, and are responsible for applications such as application performance and security, it is not responsible for or concerned with network/system architecture matters. Of course, the software/hardware purchasers and Other Business Department colleagues will not care about these issues and perform their respective duties, but the core of the project is O & M engineers ~! Bridges of all other departments.
I have mentioned a lot above. I think you should have some concepts about O & M. For example, if we are a vehicle that runs on a highway at high speed, the O & M engineer is the driver and Maintenance Engineer. the driver is not simple. Sometimes it is necessary to change the tires during high-speed driving and shift the position according to the road conditions. When the car is getting faster and faster, the vehicle itself cannot meet the needs of high-speed performance tuning or parts upgrade, solving vehicle faults and performance problems during high-speed travel, always pay attention to the security issues ahead, and Xianzhi to take preventive measures. This is O & M work ~!
Finally, let's talk about the responsibilities of O & M engineers: "Ensuring online stability" seems simple, but it is not easy. O & M engineers must weigh the negative factors: impact of new product models on existing architectures and technologies, online BUG risks caused by frequent upgrades of products, human errors caused by low O & M automation management, and high efficiency pursued by the IT industry lack of infrastructure, performance and architecture pressure caused by user growth, the IT industry's loose technology management culture, innovation risks, Internet security issues, and other factors, it will be an enemy of website stability. O & M engineers must take control of this last level and have a high sense of responsibility, principles, and coordination. If we can achieve the best balance between various factors, that is a good O & M engineer.
In addition, I have heard many people talk about their own O & M experience, such as sina, Netease, sohu, and baidu. In fact, this is a little difficult for them:
A. the company's own network architecture, scale, and more or less are the core secrets of the company. They must be kept confidential. In addition, for general-purpose software and architecture that everyone is familiar, many companies perform secondary development (such as apache, php, mysql...) based on their actual business needs and due to the performance, security, known bugs, and functions of the original version ...), the operating system kernel will also be customized based on different business types, such as some applications that belong to the computing type, some are high IO type, or large storage and memory type. According to these features, we made kernel optimization and customization. For example, sina made secondary development on memcache and developed a memcache DB. We don't talk about how to do it, but it is commendable that it is open-source, chinese companies basically demand open-source resources and have no contribution. In addition, servers are not well-known models. Based on the business characteristics, most of them have been customized for DELL, HP, sun, and ibm. In addition, they all have their own solutions for distributed storage, or they use ready-made open-source hadoop and other solutions, or they are developed by themselves. However, 90% of them all draw on the idea of google GFS: distributed storage, computing, and big tables.
B. The business direction of each company is different, which may lead to different O & M modes and methods. For example, alibaba and baidu O & M must be very different, because their business model determines their architecture, server level, IDC distribution, network structure, and general technology, sina, which focuses on news portals, is very different from the Grand O & M model that focuses on online games, even the responsibilities are not the same. However, the general technology and general architecture are similar, so don't be too precise. More companies are just playing the game of building blocks, with no technical skills.
C. As I have mentioned above, portal website O & M is still in its infancy and its concepts and experience are scattered, and there is no mature knowledge system, I believe that everyone can't tell why (I am also squeezing this word out of my head). Maybe what is O & M? You have to think about it first or never think about it at all, the true discussion is only the tip of the iceberg of O & M. It is limited to specific technical details, or the big framework of a famous website, this may be the reason why there are few online O & M related materials.
2. What skills and qualities do O & M engineers need?
What skills and qualities do O & M engineers need? First, let's talk about the skills. As you can see above, O & M is a job that combines multiple IT skills, you need to know more about systems, networks, storage, protocols, requirements, development, testing, and security. However, you need to be familiar with or even proficient in some aspects, such as the system (familiar with basic operating systems, * nix, windows ..) protocol and development (the most important daily work is automatic O & M-related development, large-scale cluster Tool Development and Management) general applications (such as lvs, ha, web server, db, middleware, and storage .) , Network (at least have a good understanding of the network environment of the application );
The skills are summarized as follows:
1. Development capability, which is very important, because O & M tools need to be developed by themselves. development languages: c/c ++ (one of them is essential), perl, python, php (one of them), shell (awk, sed, exact CT .... ). You must have actual development experience. Otherwise, the work will be very painful.
2. General applications: Operating Systems (mainly linux and bsd in China) and webserver (highttp, nginx, apahe, php, tomcat, java, etc) database (mysql, oralce) and other miscellaneous. System Optimization and high reliability. These are just extra points and don't need to be necessary. You can learn them slowly while working. It's not hard to learn these things. Of course, in O & M, some of them have different division of labor. A dedicated O & M dba may exist.
3. systems, networks, and security should be understood at least.
Personal qualities:
1. Communication and team collaboration: O & M work involves a lot of work across departments and types of work. It requires good communication skills and strong team collaboration capabilities. This should be the basic quality requirement of modern enterprises, not much.
2. Be bold in your work: be bold in order to innovate and take the unusual path. Especially for new types of work such as O & M, be more innovative in order to promote development. Be careful that O & M engineers are website administrators, the person with the highest online permissions will regret the lifetime or enter the 18-tier hell.
3. Initiative, execution, strong energy, and strong stress resistance: due to the characteristics of the IT industry, the changes are fast. The O & M work is more prominent when the plan cannot keep up with the changes, for example, the servers of major companies in China are often located all over the country. If it is cheap and cost-effective, it is a headache to migrate services on a large scale (involving hundreds of thousands of servers; it usually takes a short time, such as one week to complete ~, In this case, the initiative and execution of O & M engineers have high requirements: plans, solutions, seamless service migration, machine migration and installation, Environment preparation, security evaluation, performance evaluation, infrastructure construction, and related departments. 24x7 Emergency Response.
4. Others are some basic qualities: the mind should be clear, the logic thinking ability should be strong, the person should be modest and steady, the affinity should be helpful, and there should be a big picture.
5. Finally, website O & M requires the spirit of exploration and innovation to solve practical problems through innovative thinking, because this is a young career (the same is true in foreign countries, but earlier than in China ), no mature system or methodology can be used for reference. You can only make your own efforts.
Iii. Duties of O & M engineers
1. ensure that services meet the required standards, such as 99.9%. Ensure online stability. For example, network/system O & M engineers are responsible for the stability of networks and systems, application O & M is responsible for the stability of online applications.
2. continuously improve application reliability and robustness, performance optimization, and security improvement. This is a test of initiative and innovative thinking.
3. monitoring and statistical coverage of the real-time status at all levels of the website; monitoring and statistics are required for software, hardware, and running status monitoring, avoid monitoring dead corners and be able to understand the running status of applications in real time.
4. Solve the O & M efficiency problem through innovative thinking. At present, most of the company's O & M work is still dependent on manual operation intervention, and it is necessary to release both hands as much as possible.
5. The accumulation and accumulation of O & M knowledge and the completeness of documents. O & M is an experienced position. Good experience and traps need to be accumulated to avoid repetitive mistakes.
6. Cost Control; improve hardware bearer and architecture optimization through technical means, such as virtualization technology, to save hardware costs.
7. Automated O & M; can refine, design, and develop daily mechanized work into tools and systems, so that the system can be automatically completed as much as possible relying on the system; let everyone spend more time thinking, thinking, and doing what they like.
Iv. Confusion, Current Situation and Development Prospects of the O & M profession
Unlike other positions, such as networks, systems, security O & M positions, R & D engineers, and test engineers, application O & M personnel have a clear sense of responsibility, career planning, social identity, and professional accomplishment; the application O & M work may give people the feeling that they know more about systems/applications, however, they are not as proficient as full-time engineers, and feel less focused at ordinary times (unless there is an online fault). Gradually everyone will be confused and be confused about career development, why does this happen? In addition to the characteristics of the profession, it is mainly because of the lack of in-depth understanding of O & M, and the lack of in-depth understanding and recognition of new positions. In fact, this problem also occurs in other positions, however, I found that O & M is more typical and prone to this problem.
To address this issue, I will talk about the current situation and development prospects of website O & M (I am still thinking about it, it may not be very thorough and comprehensive. Please make an axe to supplement it)
O & M status:
1. In the initial stage, major companies have their own full-time jobs, but they do not have high importance or importance, are highly alternative, and have different responsibilities; small companies are mostly engaged in this work by taking into account other positions. They do not have a full-time role and cannot do it in depth.
2. the technical level is relatively low. It is mainly in the stage of technological exploration and accumulation, and there is no systematic concept or technology.
3. physical labor is too high. This problem is mainly related to the second point. Many things are still carried out by human resources, and there is no mature automated management method for large-scale clusters, large-scale clusters are closely related to O & M. If there are only a hundred machines, there will be no space for O & M.
4. Excellent O & M talents are extremely lacking. At present, all major companies rely on their own training. This situation leads to extremely low mobility of O & M talents in the industry, A lot of good technologies are confined to major companies, such as google's 0.5 million machines. If they are managed scientifically? Or some of the top 10 experiences in China. These experiences are very valuable and determine the core competitiveness of a company. These problems lead to the circulation, connection and reference of advanced O & M technologies in the industry, and will ultimately limit the development of O & M.
5. Many excellent O & M experiences are in the hands of large companies. This is not based on the technical strength of the company, but on the technical scale, massive PV, and sufficient hardware scale of large companies, such as the terrible traffic and massive data volumes of baidu ~~~~ These factors determine that the problems they encounter are not met by other small or medium-sized companies, or are about to be met. However, large companies may have good solutions or systems.
Development Prospects:
1. From an industry perspective, with the rapid development of China's Internet (currently, Chinese Internet users have jumped to the world's No. 1), the scale of websites is growing, and the architecture is becoming more and more complex; the requirements for full-time website O & M engineers and website architects will become increasingly urgent. In particular, the demand for experienced excellent O & M talents will increase, and the old ones will become more valuable; at present, graduates are basically selected in China (limited to large companies). The training cost is high, and the absence of experienced talents will lead to slow technical updates and affect the company's technological development. Of course, graduates also have advantages: White Paper, strong plasticity, comparison of identity and easy to integrate into corporate culture.
2. From a personal perspective, the technical content and requirements of O & M engineers are getting higher and higher. At the same time, they are also the people who are most familiar with the company's application and architecture and are getting more and more attention.
3. Website O & M will become a comprehensive technical position integrating multiple disciplines (networks, systems, development, security, application architecture, and storage, it provides you with a good space for personal abilities and technical breadth.
4. experience related to O & M work will become very important, and will also become the core competitiveness of individuals. They have good problem-solving capabilities, solution provision, and global thinking capabilities at all levels.
5. Develop strengths and interests. Because O & M jobs have a wide range of knowledge, it is easier to cultivate or give full play to some of your personal expertise or hobbies, such as kernel, network, development, and database, can be very proficient and become an expert in this field.
6. If you do not want to perform O & M in the future, it is easier to transfer to other positions without too many limitations. Of course, you have to do it with your heart.
7. technology development direction. Website/System Architect.
5. Anatomy of key O & M technical points (more practical cases. I 'd like to come up with these items today. If you are interested in other cases, I can ask them)
1. Large-scale cluster management
First, we need to clarify the concept of a cluster. A cluster is not a general combination of functional servers, it refers to the integration of server and hard disk resources for a certain purpose or function (the number of machines is greater than two). For an application, it is a whole. Currently, regular clusters can be divided into high availability clusters (HA), Server Load balancer clusters (such as lvs), distributed storage clusters, and computing and storage clusters (DFS, such as google gfs and yahoo hadoop ), specific Application cluster (a combination of specific function servers, such as the db and cache layers ). Currently, the Internet industry is mainly based on these four types. For the first two types of similarity, if the business is simple and there are few post operations on applications, you can simply use layer-4 switches (such as f5 and foundly) to achieve high availability and balance of services. For companies with tight resources, there are also some open-source solutions such as lvs + ha, very flexible; for the latter two types, it will test the company's technical strength and application characteristics. The third type of DFS is mainly used in massive data applications, such as emails and searches, in particular, the search requirement is higher. In addition to simple massive storage, it also includes data mining and user behavior analysis. For example, google and yahoo can store and analyze user record data for nearly one year, and baidu should be less than 30 days, and soguo will be less. These are essential for search preparation and user experience.
Next, let's talk about how to manage clusters scientifically. The key points are as follows:
I. Monitoring
It mainly includes fault monitoring and performance, traffic, load, and other status monitoring, which are related to the healthy operation of the cluster and the timely detection and intervention of potential problems;
A. Service Failure and status monitoring: Mainly monitors the interaction between the server itself, upper-layer applications, and associated service data. For example, for front-end web servers, we can have many types of monitoring, this includes application port status monitoring, which allows you to detect crash on the server or application itself in a timely manner, and detect the health status of the server through icmp packets. The upper layer may also include monitoring of services of various channels, the common method is to use the facial recognition code for judgment or sign the key pages to prevent the website from being tampered with (alarm and automatic recovery of tampered data ). These are just a part. There are N multiple monitoring methods, depending on the characteristics of the application, and there are still some problems to be solved. For example, if the cluster is too large, how to monitor with high performance is also a real problem.
B. Others are monitoring or statistics on cluster status, which provides data reference for rational management and optimization of clusters, including service bottlenecks, performance problems, abnormal traffic, and attacks.
II. Fault Management
A. hardware faults. For N-plus clusters with hundreds or tens of thousands of machines, the probability of server crashes and hardware faults is very high, and there are service hardware problems almost every moment, crashes, hard disk damage, power supply, memory, and switch. In response to this situation, we need to fully consider these issues when designing the website architecture and regard them as the norm. We are more likely to avoid this risk by leveraging the redundancy mechanism of applications, however, the system engineers are provided with sufficient processing time. (If google does not claim to have killed 800 machines at the same time, will the service not be affected?) This is the place where O & M engineers and website architects are tested for functionality, A good design can achieve the automatic recovery capability described by google, such as gfs. A bad design means that the crash of a server may cause a chain failure reflection of a large area of services and directly refuse to respond to users.
B. Application faults. A bug may be triggered, or a performance threshold may be exceeded or attacked. The situation varies, but the important thing is that preventive measures should be taken against these problems. They cannot be taken for granted. If a problem occurs, how can we deal with it? This requires O & M engineers to do their best, including emergency response speed, scientific troubleshooting, and effective backup solutions.
III. Automation
Automation: in short, we use tools and systems to do some of our daily manual work, freeing our hands and boring repetitive work. For example, before there is no tool, we need to install the system on a bare metal, such as 2000. It may take 10 people/10 days to complete N CDs, resulting in higher labor costs. Now, with automated tools, only a few simple commands can be done, and human programs such as machines can automatically complete the previous daily manual intervention, so that they can automatically complete and report results, it also has certain expert system capabilities and can make some simple options, such as/non-judgment and optimization. These benefits are clearly not mentioned. It should be said that automated O & M is a pursuit of professional O & M engineers, making private profits. Although this is an extremely arduous task: constantly changing businesses, nonstandard application design, development models, network architecture changes, IDC changes, specification changes, and other factors may affect the existing automation system, therefore, modular, interface-based, and variable-cause parameterization are required. Therefore, automation is one of the core tasks of O & M engineers and a manifestation of value.
2. Design of highly concurrent websites
An important element in website architecture design is to ensure the scalability of the architecture, which is the cornerstone of High-concurrency websites. Often, a website's high traffic volume does not take effect, but is a process of accumulation, which eventually becomes a big bully, including global traffic giants such as google and yahoo, the experience accumulated during this growth process is the most worth learning, including the thinking method, problem solving, and improvement process. There is no best architecture design solution, but better. Therefore, we will not give you an ultimate solution here. The experiences introduced here are more about letting everyone master the architectural design methods, ideas, and soul, and truly make use of them in practice. To make it easier for everyone to understand, I will discuss this topic with you through some classic cases such as "google architecture and youtube architecture" and discuss some general principles and skills.
Factors and points to be met in the high concurrency architecture:
I. Server Load balancer Architecture
First, the front-end of the website needs to adopt a load balancing cluster to solve the High-concurrency response of users. Currently, common methods include:
A. squid reverse proxy, which is also a common method for various websites, including sohu, sina ...;
B. DNS round robin;
C. Use four-layer hardware devices, including google and baidu.
For lvs, small channels or unimportant applications can be used. websites with high traffic volume and real-time requirements are not yet mature.
II. Selection and optimization of high-performance Middleware
Middleware selection and optimization are very important. When the service traffic exceeds a certain level, the performance is slightly improved, and the overall hardware cost control and overall service performance are greatly improved. Apache is commonly used for web servers, but the apache Multi-process (thread pool) architecture has some shortcomings. The process is frequently generated and logged out, causing high system overhead. Especially when traffic is high, it is more obvious, if the application logic is simple, you can consider lighttpd using the single process + epoll concurrency mode, which is highly efficient. However, there is a problem with multi-CPU support, but you can use the multi-service to solve this problem; if you must use apache for application architecture reasons, you can consider that the performance of the apache module is doubled than that of common CGI. Other principles, including testing of various middleware versions, testing of performance and security, and finding a balance point. Do not pay too much attention to some factors, leading to potential risks in the overall architecture, in addition, it is very important to optimize the middleware parameters. You can search for the middleware parameters on google or baidu. However, there are many optimizations based on the actual server resources, for example, how big is the maximum number of httpd processes? Some of my friends just came up with a 2048 error and thought it would not happen again because of the process threshold being too low, which may cause denial of service. However, this is a risk because of the process generation, hardware resources are required. When the number of processes reaches a certain level, the server memory may overflow, resulting in server crash, especially for applications with high memory consumption. There are many such cases to remember.
III. scalability problems
Scalability is very important for websites during the high-speed development period. You can often see the development of XX websites on the Internet. It is simply a historical history, and the process is tortuous and painful ~~. Therefore, mature experience is very important. The scalability can be viewed from two aspects: the scalability of the network system and the scalability of the application itself. First, the network should be layered and flat as much as possible, network-wide Redundancy cannot have single point of failure. Try to divide the network structure (pv size and priority) by service type to prevent mutual interference. An important point is that the network design is simple ~~, Do not make it too complex without affecting scalability. network hardware resources, Rack Space, and IDC must be planned at least half a year in advance, the important basis of these plans is the company's business development prospects, which reflects the company's strategic vision, including the need for local data centers (depending on the user group ). In addition, it is necessary to select a better IDC, otherwise it will be exhausted from IDC migration. There are still many good IDCs in Beijing: zaojunmiao (a little old), Tucheng, Unicom, Jiuxianqiao, Ericsson,, and the official Olympic data center Digital Beijing are said to be ready soon. Of course, if you have money, you can build an IDC like google. Who has the power in China?
Another point is the scalability of the application. The principle is actually very simple, when designing an application, we should try to ensure that the application is layered, high-performance middleware, complex logic, and large data interaction functions should be used as independent modules, backend, cache layer, and database layer (read/write operation separation ). ), do not simply drag all functions into front-end CGI in the early stage. This is fatal and may cause performance bottlenecks at any time without scalability.
After the above two points are well solved, the only problem now is that pre-purchase servers are available every six months based on the PV increase and new business development. Of course, the fundamental solution is to optimize the existing architecture and improve the performance, especially in the global economic downturn. This is the responsibility of O & M engineers to optimize the existing architecture ~~.
IV. Considerations in Application Design and Development
After the architecture layer is designed, the application layer design is our focus on objects. This is also the key to the success of a project. A good design is mainly reflected in: Performance (high concurrency bearing capacity), scalability, maintenance, and security (data integrity, application stability, front-end application security, such as SQL injection .) Module Redundancy, Server Load balancer, and other technical points: thread pool, epoll, TCP (long/short) connection selection, functional module refinement and backend, Module Redundancy/Server Load balancer considerations (scalability), high-frequency data cache, data layering, application spof solution (Data uniqueness problem). There are two points to note: 1. When designing an application, we must fully consider the non-reliability of servers and hardware devices, especially those based on IDCs. That is to say, we need to consider the application running process during application design, there may be 1 ~ at any time ~ Two or more servers have faults (network faults, disasters, attacks, power outages (entire IDC). google GFS is a typical example, we cannot pin application stability on hardware stability. In particular, most portal companies use X86 common models, server crash is a common practice at any time (when the total volume reaches a certain magnitude). Therefore, when designing an application architecture, we need to consider the countermeasures when these problems occur, achieve the acceptable redundancy/load balancing (these two points can be unified), such as multi-IDC traffic control through intelligent CDN, single IDC application module multi-node redundancy/load balancing, etc, even if some applications cannot do this for special reasons, the contingency plan should be considered separately. A good design can be achieved without human intervention in these emergencies, of course, it is very difficult. I remember Li Kaifu said in his speech at Peking University the year before: a google IDC with 800 machines faulty at the same time won't affect the normal response of any application (a bit skeptical, maybe it is a type of server he selected, huh, huh ); 2. Do not use databases if you do not use databases in large-traffic applications/modules. This issue will be discussed in the next section.
V. Database Problems
VI. User region optimization
3. High reliability problem solving
4. Website Security Problems
Website security is a systematic task. There are also many factors that affect security, such as DDOS (the most common), application vulnerabilities, system-level vulnerabilities, and internal security Process Vulnerabilities (human errors ), you can consider the following aspects:
I. Network Layer
First, security factors should be taken into account during network design. At the trunk exit, non-service ports should be blocked (for example, all non-80 ports should be blocked), and non-conventional data packets should be limited, such as icmp, udp, etc., but the performance of the main device must be considered. The performance of the device cannot be significantly reduced due to security restrictions. A balance is required; otherwise, a new hidden risk may occur; on the other hand, the trunk bandwidth must be sufficient for redundancy and mutual backup (vrrp and hsrp) to resist the bandwidth consumption caused by DDOS (DDOS can exist for large websites at any time, but the size is not the same ~ Layer-7 hardware has a certain syn proxy function, which can defend against a certain scale of flood, but it has to fight resources, bandwidth, and hardware performance. In addition, it is necessary to analyze the main data image, provides targeted defense against regular attacks that locate features or even attack sources. For the company's key business, physical isolation can be performed at the network layer, enhancing the robustness of key business, or even distributing Business Redundancy to different IDCs for cross-regional disaster tolerance (such as earthquakes ).
II. system layer
The system layer mainly includes operating system security reinforcement, solving system security bugs, shielding non-service ports, removing non-service software, tracking the latest security trends of system tool software, and timely updating. In particular, servers that directly provide external services (on the Internet) must perform regular security reviews and evaluations. Generally, the company's servers are interconnected over the Intranet, attacking an Internet machine may expose the entire Intranet of the company.
III. Application Layer
Application Security is not much said, mainly because the development details need to be well off without leaving any logical vulnerabilities, and strict control over the upload interface, cross-border checks, SQL security considerations, etc, especially for applications with upload interfaces (such as mail, bbs, blog, and cloud computing), there are many vulnerabilities. system applications, such as middleware, also need to be configured with appropriate security. Let's not talk about it much more. You can find a lot of data on the Internet. You need to pay more attention to the information about your website Security Vulnerabilities (or regular searches) on the Internet. Because vulnerabilities on applications are often discovered by users first, and users are the best testers, you need to fix the problem immediately after detection, and conduct comprehensive troubleshooting for similar businesses. You can also monitor specific key pages and use programs to automatically restore the main pages (if there is any function problem, can display the business upgrade prompt) to avoid the impact on the company image after the application is cracked.
IV. Intranet Security Management
Strict procedures and unified portals are required for daily Intranet access, such as vpn, rsa, and secureID (such as dynamic keys used by sina, if you have no conditions, you must update the entry password periodically.
V. Security Inspection
Occasionally, some vulnerabilities may occur due to human errors. For example, some security parameters need to be changed temporarily due to work requirements, but some vulnerabilities cannot be recovered. This problem is actually the biggest, and most of the problems are human errors. We need to conduct regular inspections on key security points throughout the network, which is also the focus of 404 audits, I think we should be very touched by sohu, sina, Netease, and other US-listed companies.
5. Massive Data Storage and statistical analysis solutions and architecture