1) load capacity is preferred--and then optimized--failure to comply with this rule will inevitably lead to downtime. Do not optimize under the pressure of downtime--focus on increasing load capacity first.
2) Take Postgres as an example, make sure that each of your networks matches your Wal file, slony replication, snapshot technology, and disk-based DB versioning (snapshot derivatives)
3) do not "optimize" the problem into your architecture. In order to solve the problem, some of the newly added things will eventually become a heavy burden of operation and maintenance. Ensure that the tools developed in the operational engineering process are fully integrated. It is often not very likely to go back and do further development later. More importantly, a change request can disrupt an already scheduled project.
4) Keep it simple. Keep it simple, because you're smart, don't make things too complicated, because you can.
(Translator: KISS principle Keep It simple, Stupid)
5) The cache should be used with great care, and it is difficult to scale horizontally to preserve resource consistency.
If you're doing something that can scale horizontally, it's wise or prudent not to add a cache layer. If it is not to be used, it should be for the end user to gain performance, not to win the capacity of a website;
6) do not write all the code yourself; Don't outsource everything; Use the right tools at the right time to complete your work.
(Translator: Do not create wheels repeatedly)
7) Negotiation-the only way to really be effective is to do some research first and make some feasible sex plans. So you can pick your chief developer if you really need it. Don't bluff.
8) Always keep n+1. If n=1, do not use +1 under any circumstances, this 1 is only used when the n down machine case. When using redundant servers to load loads, do not allow your system to exceed 49% of the load. Use it when you have the opportunity to only use the n+2 architecture.
9) data loss is not a risk that any company can take – it is the truth that is known worldwide. The loss of data is much more than the cost of keeping the data from being lost.
10) parallelize whenever and wherever possible. This is the most important way to consider the road to rehabilitation. For example, if you use MogileFS for location awareness and you need to replicate data in real time, a viable way is for each mogilefs server to replicate its data to the MogileFS Load Balancer Center. Enable as much parallelism as possible.
11) Read the manual. So far, I have insisted on reading through the RAID card manual to see if there are any nuances. Demons are hidden in the details. Do your homework!
12) know the bottleneck, and know how to locate it, a layer of troubleshooting, to find whether the hard disk, memory or CPU blocked. Usually this is very simple.
13) regularly do system capacity management procedures. Positive point. Without the curve of capacity data, it's hard to know the weaknesses of your system.
14) Do not contribute to failure, do not fear change.
15) Don't dig traps to jump at yourself. Don't assume that the results of your work will be a driving force for future work.
16) The operator should write code that is an operations tool, not an application software.
17) in the OPS team, don't underestimate the value of project management, document writing, and financial analysts. They are more valuable than pay.
18) Monitor everything. Alarm abnormal problem. Other parts of the recording data are used to make trend analysis information.
19) Regular process to view trend data in various places.
20) do not make the monitoring mess, otherwise he will have no meaning.
21) Make sure that the monitoring system is simple enough for everyone in the company to get started. You might be surprised how often monitoring data metrics are converted into metrics such as business metrics, market indicators, and sales.
22) Check only where appropriate improvements can be made. Otherwise, don't waste your time.
23) Open your inspection report and attach the relevant data so that others can easily read the key points and be able to correlate the data to the response.
24) Assign manpower at every technical point.
25) to equip important personnel with backup staff.
26) to keep recruiting. Even when you don't have a quota, you have to keep recruiting.
27) to discipline. No matter how smart you are or how NB you think you are, you have to constantly improve yourself.
28) Try to compare yourself with other companies as much as possible. Look outward.
29) Pick a show or a memory, only one, once a year, to attend. If the exhibition is held several times a year, it participates once.
30) Buy What you need, not what you want. Never take off the enterprise. This hat with "what is the simplest and safest" for me.
31) always do only the most profitable things, even if it means you need to leave.
32) formal accountability mechanisms-record commitments, mark them, and reveal promises to deliver.
33) You should not fail more than two times. Fear is a bit of a benefit. But know the difference between a long-term mistake and an unintentional mistake.
34 be ruthless-your opponent is so.
35) Depending on your work, you will need to sign your name when you are done. It also means finishing your work.
36) become useful to others.
37) partnering with startups-providing your expertise and scale, you'll get free products in return, sometimes even for a lifetime.
38) capacity is a business/product issue. It's a surprise that every page, every request, every logon network cost, and so on, must be transparent when making the right business/product decisions
39) have been breaking the budget. The OPS team is usually the biggest cost person. There is usually no income to reach the budget, but the OPS team can have many ways to postpone the purchase.
40) Things that were normally done in the past may not be normal now or in the future. So when you do it, test it with a tool first.
41) documentation. Write everything into a document. Let the newcomers ask how to do things.
42) Map Your data center's network topology with a large-sized diagram.
43) Use a picture to describe the business flow chart for each of your products.
Faq-o-matic, Wiki, where people can easily publish "this is how to fix this" article and make it easy to find somewhere. This is where technology writers can come in handy, but the most important thing is to make documents easier, even informally.
45) Make sure everyone, anyone can be replaced.
46) Most people do more at home than in the office, while others do not.
47) Bundle The order – you can ask for more discounts, better terms and more. When you buy hardware in batches. You can ask for everything--the lowest price, the spare parts package, the lease term, as long as they haven't got the order yet.
48) Maintain a long-term relationship with your suppliers – make sure you are still able to contact them during your next job.
49) For each of the OPS team to configure the super equipment that can be used to work remotely: PDA, wireless Internet card, 24-inch LCD monitor and so on. Hiring Danale is much more rewarding than hiring local people remotely. Remember that the OPS engineers are all power-up people and can make full use of every pixel on the screen.
50) completely mired in it standards. Until your Mac runs Office 2007 and Outlook, you must run Windows. Intermittent. This can disrupt the productivity of your meeting calendar, contacts, or mailing lists unless you use the full Mac. If there is an employee willing to work in an XP environment. It is very little. This rule, which is now obsolete/not recognized, is not necessarily the best way to get bogged down in the mire. This list is very 07.
51) There is a reasonable procurement process. Know your budget and be sure you can take care of it. Get the actual amount from the finance. There is often a gap in technology-driven budgeting/reporting and financial-driven budgeting/reporting directly. As an operations manager, it is possible to form models to account for these gaps in the total cost of sales. A CFO who understands these things can help drive business decisions.
52) The weekly meeting must be continued. The results and accountability of the events of the last meeting were carried out individually.
53) Establish a separate step-by-step upgrade system to eliminate the negative impact of developer code issues on the online system. This is mainly operation and maintenance problems, code problems exist in the development of tracking systems or operations tracking system will often lose, and finally no one to ignore. Setting up an independent tracking system to solve these problems can make the problem simple and clear.
54) product development from the beginning of the design of each phase to be combined with the operation and maintenance. In this way, scalability, monitoring and reliability are incorporated into the product. This can also ensure that the operation of the hardware procurement, monitoring system on time, operation and maintenance manuals are updated, the final product in accordance with the expected time to run and meet the operation and maintenance standards.
55) in the real practice of the company--sarbanes,webtrust Security Audit certification, SAS 70 audit standards, VISA and banking and so on. If you really succeed, you have to deal with this. It's easy to start early, but it doesn't take much knowledge. Deploy a work Order/task tracking tool that uses it. Incorporate change control and change management into the same system, using it. Other information can also be put in. The system can help us find information such as "What changed last week".
56) Simplify the process for redundant and multi-point logins. It may be difficult at first, but a system without real scalability and reliability can really delay your time of success.
) Oracle Standard Edition (SQL Server Standard Edition) is worth buying. If you can limit yourself to not exceeding the standard version of the demand, then you have to buy the absolute value, even if you have just started to start a business does not need him.
Postgres and MySQL are a free consideration. If you are not particularly concerned about transactional integrity, MySQL is a good choice. Until the force chain of the "vacuum" and the Postgres word is interrupted, postgres represents an unpredictable, usually negative, and bizarre database.
59) Capacity design should be 20% to 30% redundancy per day peak. Unless you're a migration technology enthusiast.
60) read as many economic magazines as possible. They are usually free, and you can get them for free by filling out some questionnaires. The value of the news is immense. Let them deliver to your home, and the opportunity to read magazines at work is approaching zero.
61) ensure safety. Developers should not have permissions on the wired environment and should do code review. This is a separation of responsibilities from operations. There should be someone in the OPS team who controls permissions for other OPS personnel. Develop an employee handbook informing you of the serious consequences of violating safety regulations. From the beginning of the physical, logical, functional aspects to protect the customer's data security and privacy. In case a client wants to confront you, you find that you are only relying on courage and diligence to protect customer data, then you are foolish.
62) Control access to the entrance. The first thing you need to do is to make sure you get the job done, and then ensure that you know where they're logged in. Enable the two-factor authentication method.
63) Keystroke logging is important for people to access the barriers and gateway hosts that are necessary for the production environment. It may be a little bit more difficult for Windows, but some gateways can provide automatic screenshot functionality.
64) If there is a situation, make sure that there is a redundant login point connected to the production environment. Do not expect the company's VPN to connect to the production environment when the network is interrupted. Directly set up the VPN online environment.
65) with LDAP authentication, even if you have only 10 machines to manage by copying passwd and shadow files, you also need LDAP authentication.
66) Do not underestimate the role of a Windows Server 2003 (2008) device in a UNIX environment. If it's just because you don't know Windows, learn instead of rejecting it.
67) do not waste your time on an invalid wireless scheme. People are mobile, they want to be on the couch, in the boardroom, in the doorway, everywhere to surf the internet. Be sure to ensure that the wireless ad is reliable.
68) There are always people who devote their extra energy and time to their work--just go through their leave list. Others, on the contrary, only focus on how to pass their own leave orders. On a personal schedule, ops people are always making huge sacrifices, and they are ready to climb up 3 o'clock in the morning and respond quickly to the need for a barrier.
69) Manage all your product results with a centralized relational database. It then distributes all the data to an offsite place through data replication to assets, people, networks, contracts, and so on. Yes, it's an online, real-time, available copy, not a nightly backup to tape.
70) To use automated processes to ensure safety, including operating system or product on-line, file distribution, log analysis and so on.
71) Automated operations are configured through the operations database (source of truth).
72) servers typically have three states-offline, online, and product configurations. Online means the configuration is being done via cfengine, rsync or other tools you are using. The product state indicates that the flow has been taken. It also requires a state in which the device can collect or test data without providing production services.
73) focus on log data. Before the device is offline or rebuilt, be sure to export the log first.
74) If the scale is growing so fast that there is not much time to optimize it, try to lock everything-the process can be done, and don't change it until there is an absolutely necessary reason. In short, lock the default values and wait for the growth to be reviewed when necessary.
75) You will never be able to avoid making mistakes in the most critical areas of your infrastructure-such as accidentally executing RM-RF/commands on which machine.
76) Keep the fun and fun atmosphere for the team – if they no longer enjoy their work, they will find something else to entertain. To give the team a sense of ownership, OPS is not the individual task of the manager.
77) The real value of providing 99.999% availability lies in our ability to remain flexible. This means that you can take advantage of redundancy when you need it. This allows for physical changes, device login changes, code modifications and fallback, and so on. This is of great value to the company itself, even bigger than the customer.
78) If you can do 99.999%, give the customer 100% service commitment.
79) do not lose the ability to publish software by process. What you should lose is your ability to roll back or transfer to an older version of your code. This futile failure transfer should not be "dealt with" at all. When things get worse, what you should do is find a big thing to block your fat butt. CYA = Keep Agile = Successful company.
80) In the mind to know why and what to do in order to achieve the purpose of building products for customers each specific step. Regardless of what you deploy to the end user, put these first in mind, that all of your (infrastructure, processes, and people) are designed to provide the best services and products.
81) for the first time, you must do the right thing. There is little chance that you'll go back and do it again. Re-doing is a huge waste of company resources. To improve the hit rate, one must succeed.
82) Get in touch with insiders, allies and similar businesses to see how their operations are done. It is likely that they have met the same challenges as you, and that the solution is better. Don't be afraid to share your experiences and processes, because others will give back. He mountain stone, can attack Jade!
83) Recruit people who are better able to make you worry about their seats, and take the example that you appreciate and can learn, and recruit those you want to work with. It feels even more than you hire an employee with a job appraisal as a.
It and operations are two completely different concepts. A good operations manager should be able to manage enterprise it, but a traditional it engineer is difficult to handle the Internet operations task.
85) When you start a new job or start every year, you should try to get the budget. This is not to say that the old car carts go forward, but should be based on historical data to make the best recommendation. If you're evaluating a new job, make sure you know exactly where your budget and budget are coming from. At the same time, there should be a right to perfect the budget.
Operation and Maintenance 85 rule