At the end of last year, Netflix published an article entitled "five lessons learned from turning to Amazon Web Services" (5 lessons we 've ve learned using AWS ). Amazon Web Services (AWS) is undoubtedly an outstanding representative of the so-called "cloud computing. Therefore, this article can also be seen as a key suggestion for any website that wants to switch to the "Cloud. Of course, these suggestions are really great! The following is what shocked me the most:
Note: Netflix is an American company that provides on-demand streaming media playback on the Internet and online leasing services for DVDs and Blu-ray discs in the United States and Canada. Founded in 1997, the company is headquartered in Los gaitu, California. The subscription service starts on January 1, 1999. By 2009, the company had provided up to 0.1 million DVD movies and 10 million subscribers.
We often refer to Netflix's Software Architecture in AWS as "Blue BO ". In any case, every system must be able to achieve its own success without using any external force. When designing a distributed system, we always fully consider the fault tolerance capability of other systems we depend on.
Note: Rambo is the name of the hero in the movie "The first drop of blood". He is an invincible Lone hero.
If our recommendation system is on the machine, the quality of our response to the customer will definitely decline, but we still have to answer. Although personalized recommendations cannot be made in this case, we recommend the most popular movies to our customers. If our search system suddenly slows down, it will never prevent users from watching movies through streaming media.
One of the first systems that our engineers created in AWS was actually a "Cool monkey ". The monkey's job is to randomly kill components or services in our system architecture. If we do not continuously test our ability to recover ourselves or even succeed in the face of failures, then the system is likely to lose the chain at critical moments.
At first glance, you must think this suggestion is crazy! But we must face it. I'm not sure how many companies will agree with this practice, let alone how many companies will actually try it. If someone deploys a background program or service at the place where you work, specifically used to randomly kill the services or processes in your server cluster, please raise your hand!
If this person has not been fired by your company, please raise your other hand!
How can a person with a normal brain be willing to create a "monkey?
In fact, sometimes you don't have a choice. The monkey will find it on your own! The stackexchange network has encountered a strange problem and we have been struggling for a few months. This problem is: every few days, a server in the Oregon network center will suddenly stop responding to any request from the external network. There is no reason; and the server can be restored only after a slow shutdown or restart. During this period, blue screens will also appear on the server ......
It took us a few months to track the problem. We also listed a long list of possible causes and excluded one by one:
- Switch Network Port
- Change network cable
- Use a different Router
- Try Nic drivers of multiple versions
- Try network settings at various operating systems and driver levels
- Simplify our network configuration and turn off the tproxy service to use a more traditional X-FORWARDED-FOR
- Change virtualization software vendor
- Change our TCP/IP host Model
- Install the kernel upgrade package
- Seek help from senior supplier Customer Support Team
- Other attempts (I have forgotten, because I have now been freed from this pain)
Throughout the incident, our team members were so frustrated that they almost had to fight. (Team members work remotely. How can I "Beat "? We all use Skype. You know ......) Can this blame us? Every few days, one of our servers will appear randomly. The "Cool monkey" is always in disorder!
However, even at our most frustrating moment, I realized that we had made some positive changes:
- For some important functions, we used to use only one server. Now we have changed it to two.
- If some services do not have a reasonable fault tolerance or contingency plan, let's add one.
- We performed a comprehensive inspection on the system, removing unnecessary components until a minimal set of services sufficient for O & M.
- We have taken some contingency measures to make our system always run even when we think that the critical service suddenly fails.
One day later, we make our system a little more redundant every week, because we have to do this. Despite the pain of the whole process, the fact is in front of us. In fact, the "monkey" helps us a lot and it forces us to become very flexible. Hurry up! Don't wait for tomorrow, and don't expect a further day in the future, let alone "Let's talk about it later". Act now!
When you become a "monkey", you will soon find that there is always a reason behind everything (except for those completely random ). "The best way to avoid failure is to keep failing ." Understand the truth of this sentence, although it sounds crazy!