85 O & M rules
1) Bearing Capacity first-then Optimization-failure to comply with this rule will inevitably lead to downtime. Do not optimize it under the pressure of downtime-focus on improving the load capacity first.
2) taking ipvs as an example, make sure that each of your networks can match your WAL file, Slony copy, Snapshot technology, and disk-based DB version (snapshot derivatives)
3) do not optimize the problem into your architecture. Some newly added things to solve the problem will become a heavy burden of O & M. Make sure that the tools developed in the O & M engineering are handed over completely. Subsequent development is often ineffective. More importantly, a change request may damage the planned project plan.
4) Keep it simple. Keep it simple, because you are smart. Don't make things too complicated because you do it.
(TRANSLATOR: KISS principle Keep It Simple, Stupid)
5) the cache should be used with caution. To ensure resource consistency, it is difficult to scale horizontally.
If you are doing something that can be scaled horizontally, it is wise or prudent to do not add a cache layer. If you need to use it, it should be to obtain performance for the end user, not to win the capacity of a website;
6) do not write all the code by yourself; Do not outsource everything; Use appropriate tools at the right time to complete your work.
(TRANSLATOR: do not duplicate the wheel)
7) negotiation-the only way to negotiate effectively is to first conduct some research and develop feasible solutions. In this way, you can select your chief developer. If you really need it, don't be blushing.
8) Keep N + 1. If N = 1, do not use + 1 easily under any circumstances. This 1 is only used when N is down. When using Redundant servers to carry loads, do not let your system load exceed 49%. Use it when you have the opportunity to use only the N + 2 architecture.
9) data loss is not a risk that any company can bear-this is the truth of the world. The loss caused by data loss is far greater than the cost of keeping data intact.
10) perform parallel operations whenever and wherever possible. This is the most important way to repeat the road. For example, if you use MogileFS for Location Awareness and need to copy data in real time, a feasible method is that each MogileFS server can copy its data to the MogileFS Server Load balancer center. Enable as many parallel operations as possible.
11) read the manual. So far, I have insisted on reading the RAID card manual to check for any minor differences. All demons are hidden in details. Do your homework!
12) Identify the bottleneck and how to locate it. Check whether the hard disk, memory, or cpu is blocked. This is usually very simple.
13) regularly perform system capacity management procedures. Be positive. Without the curve of capacity data, it is hard to know the weakness of your system.
14) do not lead to failure, and do not fear changes.
15) do not dig traps for yourself. Do not think that your work achievements will be the motivation for future work.
16) the code written by O & M personnel should be an O & M tool rather than an application software.
17) do not underestimate the value of project management, document writing, and financial analysis personnel in the O & M team. They are more valuable than paying.
18) monitor everything. Alarm exception. Other data records are used for trend analysis.
19) regularly view the trend data in various regions through the process.
20) Don't mess up monitoring. Otherwise, it makes no sense for him.
21) Make sure that the monitoring system is simple enough for everyone in the company to get started. You may be surprised how frequently metric data indicators are converted into business indicators, market indicators, and sales indicators.
22) check only where improvements can be made. Otherwise, do not waste time.
23) publish your inspection report and attach relevant data so that others can easily read the key points and associate them with the response data.
24) Assign manpower to each technical point.
25) Reserve critical personnel.
26) continuous recruitment. Even if you don't have a place, you have to keep hiring.
27) be strict with yourself. No matter how smart you are or how NB you think you are, you must constantly improve yourself.
28) compare yourself with other companies as much as possible. Look outside.
29) Select an exhibition or memories, and attend the exhibition once a year. If the exhibition is held multiple times a year, the event will be attended once.
30) Purchase what you need, not what you want. Never take off the company's hat with the "what is the simplest and safest thing for me.
31) always do the most beneficial business, even if it means you need to leave.
32) formal accountability mechanisms-record commitments, mark them, and reveal commitments to be fulfilled.
33) You should not fail more than two times. Fear is a little good. But you need to know the difference between long-term mistakes and unintentional mistakes.
34. Heartless-so your opponent.
35) You need to sign your name when you finish the job. This also means completing your work.
36) become useful to others.
37) working with startups-providing your expertise and scale, you will get a free product in return, sometimes even for a lifetime.
38) capacity is a business/product problem. In this case, every page, every request, and network cost of every login must be transparent when making correct business/product decisions.
39) keep breaking the budget. The O & M team is usually the biggest consumer. There is usually no budget for revenue, but the O & M team can have many ways to postpone procurement.
40) What can be done normally in the past may not happen now or in the future. In this case, test it with a tool.
41) docized. Write everything as a document. Ask New people one by one to learn how to do things.
42) Draw the network topology of your data center with a large graph.
43) Use a chart to describe the business flow chart of each product.
44) Faq-O-Matic, Wiki, where people can easily publish the article "How to fix this" and make it easy to find. This is where technical writers can come in handy, but the most important thing is to make documents easier, even informal.
45) ensure that everyone and anyone can be replaced.
46) Most people do more at home than in the office, while others do not.
47) bundle the order-you can request more discounts, better terms, and so on. When you purchase hardware in batches. You can ask for everything-lowest price, spare parts package, lease term, as long as they haven't got the order yet.
48) maintain long-term relationships with your suppliers-make sure you can still contact them at your next job.
49) Configure super equipment that can be used remotely for each person of the O & M team: handheld computers, wireless NICs, 24 inch LCD monitors, and so on. The reward for hiring danale is far greater than that for remotely hired local staff. Remember that O & M engineers are power talents who can make full use of every pixel on the screen.
50) IT standards. Until the Mac runs office 2007 and outlook, you must run windows. Interrupted. Unless mac is used, this will damage the office efficiency of the meeting schedule, contacts, or email list. If an employee is willing to work in an xp environment. This is very rare. This rule is not necessarily the best method because it is outdated or unacknowledged. This list is very 07-type.
51) have a reasonable procurement process. Know your budget and be sure to manage it. Obtain the actual amount from finance. There is often a certain gap between a technology-driven budget/report and a finance-driven budget/report. As an O & M manager, we must be able to form a model to calculate these gaps into the total sales cost. A cfo that understands these things can help drive business decisions.
52) the weekly meeting must be continued. Implement the results and accountability for the events of the previous meeting one by one.
53) Establish a separate level-by-level upgrade system to eliminate the adverse effects of developers' code problems on online systems. This is mainly because O & M problems and code problems are often lost in the Development Tracking System or O & M tracking system, and finally ignored. Establishing an independent tracing system to solve these problems can make the problem simple and clear.
54) product development should be combined with O & M at every stage of design. In this way, scalability, monitoring, and reliability are integrated into the product. This also ensures that the hardware procurement and monitoring system of O & M personnel are in place on time, the O & M manual is updated in a timely manner, and the products are launched and run as expected and all comply with O & M standards.
55) real practices in the company-Sarbanes, WebTrust security audit certification, SAS 70 audit standards, Visa and banking, etc. If you are successful, you have to deal with them. It is actually very easy to start these preparations earlier, without too much knowledge. Deploy a ticket/task tracking tool and use it. Include change control and change management in the same system and use it. Other information can also be included. The system can help us find information like "What was changed last week.
56) simplify the process of redundant stay and multi-point logon. It may be difficult at first, but a system without real scalability and reliability will really delay your success.
57) the Oracle Standard Edition (SQL Server Standard Edition) is worth buying. If you can limit your demand for the Standard Edition, it is definitely worth buying, even if you don't need it when you start a business.
58) S and MySQL are a free option. If you are not particularly concerned about transaction integrity, MySQL is a good choice. Until the force chain of "vacuum" and S s words is interrupted, Postgres represents an unpredictable, normally negative and strange database.
59) the capacity design should be based on the daily peak value and then throw another 20% ~ 30% redundancy. Unless you are a migration technical leader.
60) read as many economic magazines as possible. They are usually free of charge. You only need to enter some surveys for free. The value of news is enormous. Let them deliver them to your home. The chance of reading a magazine at work is almost zero.
61) security assurance. Developers should not be authorized to access the wired environment, but should review the code. This is the separation of duties from O & M. Someone in the O & M team should have the permission to control other O & M personnel. Develop employee manuals to inform you of the serious consequences of violations of security regulations. From the very beginning, we have to protect customers' data security and privacy from physical, logical, and functional aspects. If a customer wants to confront you and finds that you only rely on courage and diligence to protect customer data, you will be stupid.
62) control the access portal. First, ensure that everyone can complete their work normally. Second, make sure that you know where they log on. Enable the two-factor authentication method.
63) It is important to press keys to record barriers and gateway hosts that are essential for people to access the production environment. Windows may be a little difficult, but some gateways can provide the automatic screenshot function.
64) In case of any situation, ensure that redundant logon points are connected to the production environment. Do not expect the company's VPN to be connected to the production environment when the network is interrupted. Directly set up the VPN in an online environment.
65) Use LDAP authentication. Even if you only have 10 machines and manage them by copying passwd and shadow files, you also need LDAP authentication.
66) do not underestimate the role of a Windows Server 2003 (2008) device in a UNIX environment. If you do not understand Windows, go to learn it, instead of rejecting it.
67) do not waste your time on ineffective wireless solutions. People are mobile. They want to access the Internet on the sofa, in the conference room, at the door. Ensure the reliability of wireless ad.
68) there are always people who put extra energy and time into their work-directly through their leave form. On the contrary, others only focus on how to pass their application form. In terms of personal schedule, O & M personnel always make huge sacrifices. They are ready to get up at a.m. to quickly respond to troubleshooting needs.
69) manage all your product achievements through centralized relational databases. Then, data is copied and distributed to assets, personnel, networks, contracts, and other data to a remote location. Yes, you need an online real-time available copy instead of backing up the tape every night.
70) use automated processes to ensure security, including operating system or product launch, file distribution, and log analysis.
71) Automated operations are configured through the O & M database (truth source ).
72) servers generally have three statuses: offline, online, and product. Online means that the configuration is being completed through cfengine, rsync, or other tools you are using. The product status indicates that the traffic has passed. You also need a status where devices can collect or test data without providing production services.
73) focus on log data. Before the device is deprecated or rebuilt, you must first export the log.
74) if the scale develops so fast that there is not much time for optimization, try to lock everything-the process can still be done, so don't change it until there is an absolutely necessary reason. In short, lock the default value and wait until necessary.
75) You will never avoid mistakes made by O & M engineers in the most critical aspects of your infrastructure-for example, on which machine you accidentally execute rm-rf/commands.
76) maintain a fun and interesting atmosphere for the team-if they no longer enjoy their work, they will find other things to entertain themselves. To give the team a sense of ownership, O & M is not a manager's personal task.
77) the true value of 99.999% availability is that we are able to maintain flexibility. This means you can make full use of redundancy when necessary. This allows for physical changes, device logon point changes, code modifications, and rollback. This is a huge value for the company itself, and even greater than the customer.
78) If you can achieve 99.999%, 100% of the customer's service commitment will be given.
79) do not lose the ability to release software by process. What you should lose is your ability to roll back or transfer the code to the old version. We should not "handle" this futile transfer of failure. When things become unsatisfactory, what you should do more is to find a big thing to block your fat ass. CYA = agile = successful companies.
80) be clear in your mind why and for the purpose of doing so, and build each specific step of the product for the customer. No matter what you deploy to end users, put these first considerations, that is, all your (infrastructure, process, and personnel) designs are designed to provide the best services and products.
81) The first time is right. There are very few opportunities for you to go back and do it again. Redo is a huge waste of company resources. To increase the hit rate, you must succeed once.
82) Contact industry insiders, allies, and similar companies to see how their O & M works. It is very likely that they have encountered the same challenges as you, and the solution is better. Don't be afraid to share your experiences and handling processes, because others will also give back. You can attack the jade!
83) recruit people who want to attract those who wish to worry about your unsecure seat, recruit you to appreciate and learn from the role model, and recruit those who are willing to work with him. This is even more than recruiting an employee with A job evaluation as.
84) IT and O & M are completely different concepts. A good O & M manager should be able to manage enterprise IT, but a traditional IT engineer is hard to handle Internet O & M tasks.
85) when you start a new job or start each year, you should get a budget. This is not to say that the old car goes forward, but should be based on historical data to make the best recommendation solution. If you are evaluating a new job, check that you fully understand the budget and budget source. At the same time, we should have the right to improve the budget.