1. Capacity first, optimization second-this rule takes effect when a failure occurs. When you are down, do not study any optimizations, restore the device first.
2. Keep all the records that can be captured--take PostgresQL as an example, including WAL files, slony replication, snapshot technology, DB version based on the hard disk (included with the snapshot)
3. Do not introduce more problems because of optimization. Usually the things we do when we solve the problem turn into the burden of the later operation. Please confirm that the tools developed for the operational dimension are fully delivered. These things are often not working as expected. The result is to return to the development team again. More importantly, this change request usually breaks the work plan that the team originally arranged.
4. Keep it simple, don't let things get too complicated, smart you can certainly do it.
5. Use caching sparingly to protect resources that are difficult to scale horizontally. Of course, if you can extend it horizontally, it's not too much to give him a caching layer. Once the cache layer is used, it should aim to improve the end-user's access performance rather than increase the capacity of the site. Otherwise, you are simply adding a new, very unreliable bottleneck to yourself. Their potential negative effects may endanger the entire system. In fact, the cache layer failure is often caused by avalanche cascading failures.
6. Do not write your own code to achieve, and do not buy everything from the manufacturer-to use the appropriate tools at the appropriate time.
7. Negotiation-the only way to negotiate with a truly powerful manufacturer is to do your homework in advance and get everything ready. If necessary, you can choose to leave from your preferred manufacturer. It's not a bluff.
8. Always have the n+1 server ready. If n equals 1, then do not use this +1 device in any case, and wait for the N failure to take over. When you use a redundant server to balance the load, only 49% or less of the capacity is manageable. Usually we get a chance to n+2--make good use of it.
9. Data loss is a risk that no single company can take – a universal truth. Loss of data results in far more loss than is used to ensure that data is not lost.
10. Anytime and anywhere parallelization-this is a very important way of thinking. For example, if the mogilefs is set to position-aware and requires real-time replication, then each mogilefs server must be able to replicate its own data to the other end specified by the load balancer. As far as possible, try to achieve this many-to-many approach.
11.rtfm--Today I will also read a pair of RAID card instructions to compare their subtle differences. The devil is in the details. Read the document like you do your homework!
12. Understand the bottlenecks on each floor and how to identify bottlenecks. You have to know that you are limited in disk, memory, or CPU, and it's really easy to figure this out.
13. To have a fixed capacity management process-and active, not passive. It is extremely dangerous to know where the weaknesses of the system are, and to let the actual load curve run above the volume curve.
14. Do not contribute to failure, nor fear of change.
15. Don't suck in your own fumes. Don't think the results of your current job will become the driving force behind how you work in the future.
16. The code to be written by the operational personnel is the operational tools, not the application software.
17. Do not underestimate the value of project managers, technical authors, and financial analysts in the Operation team. These people are usually much more valuable than the salary you pay.
18. Monitor everything-alarms are used only at the time of the move, others are recorded for trend analysis.
19. There is a fixed process to view trend data for each place.
20. Do not let the monitoring too noisy, so quickly become no effect.
21. Make sure your monitoring system is easy to use and everyone in the company can get started. The frequency with which monitoring data indicators are converted into business metrics, market metrics, and sales metrics can be surprisingly high.
22. It is a waste of time to make a summary of the changes that can be made accordingly.
23. Summary to be disclosed, accompanied by event-related data. So that you can easily find the summary of the key points and jump to the corresponding data.
24. People are responsible for every point in the technology.
25. Prepare the backup staff for these leaders as well.
26. Keep on recruiting-even if there are no places.
27. Be the harshest critic of your own. No matter how smart you are or who you think you are, there is always room for improvement.
28. Look more outward, take their own level and as many as possible the position of the company to do a comparison.
29. Participate in a technical exchange conference every year. If there are several in a year, the one who chooses the best is enough.
30. Buy what you need instead of what you want. Never take off your company hat and replace that with the words "the simplest and safest for me".
31. Just do the best thing for the business, even if it's going to get you out of the way ...
32. The accountability system is formalized-documenting the commitments and subsequently pursuing those who have not done so.
33. Repeat failure is not allowed. That sounds a bit too harsh. But to distinguish between irreversible errors and mistakes.
34. Ruthless-because opponents are ruthless.
35. Work is something you have to name yourself at the time of completion. Branding also means accomplishing the task.
36. Keep the external contact available.
37. Entrepreneurial partners--tell them about your expertise and competencies. You will get free product returns, sometimes in life.
38. Capacity is a business/product issue. That is, each page, upload, or login requests for network consumption must be visible to help complete the correct business/product decisions.
39. Be sure to beat the budget! The operation team is always the biggest spender of the budget amount. The company's revenue goals often do not reach, the operational team should have many ways to postpone their own costs.
40. Past experience does not necessarily apply to the present and even the future-try it right, and have the right test tools to do it.
41. Documentation-everything should be well documented. Avoid new members of the team around the circle to find the whole team to understand the work content.
42. Draw an oversized network topology map that depicts your data center.
43. Draw a logical flowchart for each of your products.
44. wiki--it's easy to publish the "How to fix the problem" document and find it easy. This is where the technical authors work, but wikis can make even informal documents or small paragraphs that have been added and changed to look better.
45. Ensure that each member of the team, yes, is each, can be replaced.
46. Some people work at home better than when they are in the company, but some do not.
47. Order packaging signed--the hardware requirements packaged into large orders to consult the largest discount contract, remember to include everything in the order, such as spare parts, lease conditions and so on.
48. Maintain long-term contact with suppliers, even if you can contact them when you are in the next job.
49. To the operation and maintenance team everyone is equipped with everything they can remotely control-handheld computer, 3G network card, 24-inch LCD screen ... The rewards you pay for talented people far outweigh the field engineers who have been hired remotely. Remember, operational engineers are all power freaks who know and can make the most of every pixel on the screen.
50. The team always needs several windows unless the MAC can run Office 2007 and Outlook. This is very damaging to the team's meeting arrangements, contact management and mailing lists, and so on.
51. Have a streamlined sourcing process--if you want to understand your budget and be able to manage it well. We can get the facts from the financial statements. There is usually a gap between technology-driven reporting and financial-driven reporting. A good operation Manager can create models that count these differences into the total cost of sales. The CFO who understands these can help drive business decisions.
52. The weekly meeting must be held continuously to sum up and be accountable for the events of last week.
53. Create an independent upgrade system to manage code development projects that adversely affect operational dimensions. The source of this idea is: a problem involving the development of transport and peacekeeping, in the operation or development of the tracking system is mostly ignored, and finally no one to ignore, so to these problems to create a separate tracking system is more simple and clear.
54. Product development from the beginning of the design of each phase should be combined with operation and maintenance technology. In this way, scalability, monitoring and reliability are incorporated into the product. This can also ensure that the operation of the hardware procurement, monitoring system in place on time, the operation of the manual update, the final product in accordance with the expected time to run and meet the operational standards.
55. Operates like a real company-the SAX Act, WebTrust Security Audit certification, SAS 70 auditing standards, VISA organizations and banks, and so on. If you really succeed, these are the things you have to deal with. Getting started early is actually very simple and doesn't require much knowledge. Just develop a work order/task tracking tool and use it well. Put the change control and management into the same system and use it well. Other information is also put in. The system can help us find information like "What changed last week."
56. Leave room for redundancy. It may be difficult at first, but a system without real scalability and reliability can really delay the time you have to succeed.
57. It is worthwhile to buy an Oracle standard version (or Microsoft SQL Server Standard Edition). If you can limit yourself to no more than the standard version of the demand, then the absolute need to buy, even if you are just starting a business.
58.Postgres and MySQL are good for free. If you're not particularly concerned about transactional integrity, MySQL is actually pretty good.
59. Capacity design should be based on daily peak and then throw 20% to 30% redundancy. Unless you're a vmotion (the heat transfer technology of VMWare).
60. read as many trade magazines as you could. They are usually free, as long as you fill out a few questionnaires. The value of the news is enormous. By the right, remember to send them to your home, and the chance to read magazines at work approaches zero.
61. Pay attention to safety. Developers should not have access to the production line, but should do code reviews. This is the separation of duties from the operational dimension. Then there should be someone in the operation dimension who controls the permissions to set other operational personnel permissions. Create an employee handbook that warns you of the serious consequences of violating safety regulations. From the outset, remember to protect the customer's data security and privacy from the physical, logical, and functional aspects. In case a client wants to go to court with you, it doesn't feel good to remember that you're just relying on courage and diligence to protect your customer data.
62. Control the access to the entrance. First of all, make sure you get the job done properly, and then ensure that you know where they came from. Go to implement the two-factor authentication method now.
63. Keyboard records are critical for people accessing the fortress and gateway to the production environment. It may be a bit more difficult for Windows, but some gateways can provide automatic screen-cutting capabilities.
64. Ensure that there are many ways to log into the production environment. Do not expect the company's VPN to work when the network is disconnected. The VPN is set up directly in the production environment.
65. Use LDAP to authenticate, even if you only have 10 machines, by copying passwd and shadow files of the way to manage, you also want LDAP authentication.
66. Do not underestimate how useful a Windows Server 2008 device is in a UNIX environment. If it's just because you don't know Windows, then learn, not belittle.
67. Don't waste your time with ineffective wireless solutions. Everyone in the company is moving, on the sofa, in the conference room, in the doorway, and all over the Internet. Be sure to maintain your wireless routing.
68. There are always people who devote extra energy and time to their work-directly through their leave sheets. Others, on the contrary, focus only on how to pass their own leave list. On a personal schedule, the operators always make a huge sacrifice, they are ready to climb up 3 o'clock in the morning to quickly respond to the needs of the disabled.
69. Manage all your equipment assets through a centralized RDBMS. Then copy assets, people, networks, contracts and all other data to offsite. Yes, it's an online, real-time copy, not a nightly backup to tape.
70. Automate the use of multiple processes to ensure security, including operating system or product on-line, file push, log analysis and so on.
71. Automation must be associated with operational-dimensional RDBMS data.
72. The equipment usually has three states-off-line, service, preparation. The readiness state means that the configuration is being completed through Cfengine, rsync, or other tools you are using. The service is already running traffic. It also requires a state in which the device can collect or test data without providing a production service.
73. Respect for log data. You must first export the log before the device is offline or rebuilt.
74. If the rapid growth of your business does not allow you much time to optimize, then try to lock everything-the process can work, do not change it, until there is an absolutely necessary reason. In short, lock the default values and wait for growth to be reviewed when necessary.
75. You will never be able to avoid the fault of the operational engineer in the most critical areas of your infrastructure-for example, on which machine the RM-RF/command is accidentally executed.
76. Maintain a fun and fun atmosphere for the team-if they don't enjoy their work, they will find something else to entertain. To give the team a sense of ownership, operation is not a personal task of the manager.
77. The real value of providing 99.999% availability is the ability to remain flexible. This means that you can take full advantage of system redundancy when you need it. Physical changes, device migrations, code modifications, and fallback are all very well. This is a huge value for the company itself, even bigger than the customer.
78. If you can do 99.999%, give the customer a 100% SLA commitment.
79. Do not obliterate the ability of software thermal update. What should be obliterated is your ability to roll back or move to an older version of your code. There should be no "handling" of this futile failure to shift. When things get worse, what you should do is find a big thing to block your fat butt. CYA (Cover Your ass) = Keep Agile = success of the company.
80. Remember the reason and purpose of each step in your idea of building a product for your customers-no matter what you deploy to end users, put these first considerations, that is, all of your (infrastructure, processes, and people) are designed to provide the best services and products.
81. The first time will be successful. There's very little chance you'll get back to the beginning. Re-doing is a huge waste of company resources.
82. Contact the industry's partners, allies and similar enterprises to see how their operation and maintenance are done. It is possible that they met the same challenges as you, and the solution is more ingenious. Don't be afraid to share your experiences and processes, because others will give back.
83. Recruit people who are enough to worry that they will be squeezed out of the current work, recruit those you appreciate and can learn the role model, recruit those you want to work with him. It feels even more than you are hiring a job evaluation for a employee.
84.IT and operation are two completely different concepts. A good operation manager should be able to manage the enterprise it, but a traditional it engineer can not be able to handle the Internet operation and maintenance tasks.
85. When you start a new job or start every year, you should try to get the budget. This is not to say that the wheels of the AIDS ring go forward (it should mean to follow the rules), but rather a good copy based on historical data. If you are evaluating a new job, make sure you know exactly what the budget and the source of the budget are. At the same time, there should be the right to improve the budget.