"How to explain why the database is difficult to expand", the problem comes from Quora, the owner added that he has some basic knowledge of the database, but still do not understand why the expansion of the database is so difficult. Bole Online compiled excerpts of this question and answer post two popular replies.
Paul King, Facebook data scientist (3.6K likes)
There are four big challenges to scaling a database: Search, concurrency, consistency, and speed.
Suppose you have a list with 10 names on it. If you want to find someone, just take a glance at the list.
But what if there are 1 million names on the list? At this point, you need some strategy. The phone book lists the names in alphabetical order so you can skip over the unwanted ones. This is a solution to the search problem.
What if 1 million people were using the phone book at the same time? This is the problem of concurrency . Either you wait in the city hall long queues to use the phone book, or copy the phone Book 1 million copies-the " master-slave copy " strategy. If you put these 1 million copies in everyone's home-" distributed " strategy-you can also get a quick response.
What if someone's phone number is changed? The master-slave replication strategy poses a problem: You must now make changes to the 1 million phonebook. And they're still being used, when can they be changed? If a one-copy change, it can cause data consistency issues. If all is recycled and new is issued, usability issues can be created.
What if hundreds of thousands of people change their phone numbers every hour? At this point you face serious information congestion caused by " resource contention ," which can also lead to " race conditions " (unpredictable output) and " Deadlocks " (Database deadlock).
All of the above issues have solutions, but these solutions can be very complex. For example, you can do this by issuing an appendix to the phone book (called " changelog ") instead of re-printing them, but you should always check your appendices. You can publish a new version of the phone book according to the date of the change so that people can exchange them for more consistency at the same time, but then the phone book will always be a little bit outdated.
Now, scale to millions of users, with billions of of data distributed across data centers around the world.
The basic goal of the database is to maintain the illusion that it has only one copy, and that at the same time only one person is modifying it, everyone sees the latest data and responds immediately. This goal cannot be achieved when the database expands to millions of people around the world using and modifying trillions of data.
So the task of database design is to use interlocking algorithm techniques to be as close as possible to this illusion.
Huang Yishan, former CEO of Reddit, (2k likes)
Here is an explanation of a non-technical person who is dedicated to laymen, for example, who does not understand the database at all.
(For those who understand the database, please ignore this and some minor technical errors in the next analogy.) )
"Extension" is difficult in many ways, but first I want to fundamentally explain why scaling is difficult-the reason is that "scaling" is not a single-direction activity. In general, it is essentially a complex system that becomes "better"-usually larger or more, and often has to be done quickly. The key is that a complex system cannot be made larger or more productive or more efficient in a simple way--usually the various aspects of the system interact with each other, so if you want to expand a part, then the other parts will usually fail, so you can't get the extension you want-you almost always have to do some refactoring
To make an analogy:
Think of the database as a library. where you store books or books (such as the full series of Harry Potter books). In particular, your Web application is also a library, which stores books and provides easy access for people to read. Imagine that this library appeared on the TechCrunch and became very popular, so you suddenly faced a series of expansion problems. Let's list a few questions and explain them in simple terms:
Example one: Many, many books
Your library is becoming more and more popular. So you have a lot more books than when you started to build a library, and for a long time you just put them on the new shelves in your room. But the present house cannot lay down all the books. They are beyond the size your small library can carry. You have to buy or rent an adjoining building and put the book in it. This can be problematic because the buildings near you are limited, or the real estate price is so high that it is unsustainable to rent the expensive house next door. So you have to think carefully about which house to rent and how to find those that are very close and cost-effective for storing books.
This is true simulation, the database is usually stored on the hard disk, the hard disk is stored in a limited space, you can only a computer (a data center, a rack) to store a lot of hard disk. To contend with this common problem data center design is great, but if you are a very large library, you may still encounter this limitation, the data center rack can not withstand so many hard drives, and even you need to discard the data center and build a new (this is rarely seen). Still, the core of the problem is that no matter how much space you start with, you always need to expand it, and you can't keep adding "units" to the space all the time (like adding bookshelves to the library), you'll eventually need to make a leap-forward change, such as renting a house next door or renting another rack or building another data center.
Example two: Finding books in the ocean of books
When your library occupies only one room, you just need to put all the books in alphabetical order and put them in the right place. If someone wants a book, he just needs to find the one he wants in the room in order. This will take about 30 minutes.
Now your library is very big, you rented a lot of houses. If someone wants to read a book, he may need to go all over the house. People simply can't accept that it takes so much time to find a book (which is similar to the load time that people suffer from web pages). People just want to go straight to the right house, go to the right floor, go to the right bookshelf, and get the book they want directly. They don't want to spend more than half an hour at all.
To achieve this goal, you need to create a new relational system called an index . Real-world libraries do encounter this problem, and the solution they use is the card catalog. As shown in the following:
The younger generation may not know what this is, because this is the product of the computer age. The card catalogue is simply a database of small drawers and small pieces of paper (cards). Because they are too cumbersome, we digitize them and put them in the computer. If you are less than 25 years old, you may not have seen them at all.
The card catalogue (i.e. index) builds a card for each book, placing the cards in a drawer, sorted by title, author, subject, etc. If you want to find a book, use the card catalogue directly-in a separate room-to find the card for the book you want, and the card will have the exact location of the book. So it takes only 10 minutes to retrieve the card, 10 minutes to the right house, 5 minutes to the exact floor and bookshelf, and 5 minutes to find the exact book.
If a library gets too big, it needs to introduce a card catalogue that keeps the time it takes to find a book in a reasonable range (for example, half an hour), or it takes days to search through all the houses to find a book so that people don't use the library. This is fundamentally different from just renting more houses and searching all of them-this example shows that when you encounter thresholds in an extension, you have to come up with a new solution to overcome the problem-not just to add bookshelves, you have to tidy up the bibliography and print out all the cards (this is difficult because you have to traverse all the books, It's painful to sort them by author, title and subject, because your book is filled with several houses, and then you have to leave a special room in the library by the doorway to store your card catalogue and tell everyone to check the card first.
Example three: Many people find the same book at the same time
Let's take a super-simple example to illustrate the problem: your library is now very popular and thousands of people are coming in at the same time every day. A crowd of crowds. That's not as ridiculous as it sounds-it's the most common problem when a Web application suddenly fires up.
Now many people want to read the same book as they get stuck in the gate. This sounds ridiculous, because this rarely happens in reality. But think about this--the door in reality can allow one second to pass by one person. So if there are 20 people who want to enter your library every second, there will be a long queue at the door soon. More and more people are coming, the team is getting longer. Eventually, the number of people queuing at the gate would exceed the number of people in the library, and the mass of waiting crowds could only wait at the door and never read a book. People who hate you who can't see the book are much more satisfied than those who see it, and bad word-of-mouth arises.
An obvious solution is to open more doors on the wall. You open another door now, two times the influx of people! You opened more doors, eventually hundreds of doors, and every wall was filled with doors. Hey, you just have to remove all the walls! In this way, more people can use the library! An order of magnitude surge!
But soon you'll have another problem, with only a limited amount of space in front of each book's bookshelf and only a limited number of people standing, and maybe they're fast-they can browse the shelves and find the book they want and leave. But they still need to stand in front of the bookshelf for a few seconds, but in the end because your library is so popular, hundreds of people are looking for the same book (or two books in the same vertical space) that they can't squeeze into a small space in front of the bookshelf.
Once again, the shelves were lined up with a long line-perhaps all bookshelves, perhaps just a bookshelf with a bestseller on it. The following are some possible solutions:
If the crowd is just around the bestseller rack, you just need to scatter the best sellers in the library. But then the books are no longer ordered, they become randomly distributed, so you need to refactor your card catalogue so that people can find them quickly. It's not that painful-because you just need to update all the best-sellers cards.
If people are crowded everywhere, for example, all books are popular, or just too many people, you can try to add copies. That is, copy your entire library and then rent some new houses on the other side of the city (or the next block) and move the average person to the new library. You can do this a few times and add some copies. To do so, you need to make sure that backups are up to date, and you have to make sure that all new books have the latest backup in multiple libraries. One solution is to refer to one of the libraries as the "main" library, where all new books are only accessible to the library, and each time a new book comes in, you have to send someone to copy the books and send them to the other library by a courier. These couriers also need to occupy a lot of traffic, which limits the number of people who can use your library, so you need to limit the number of people in each library, and when the number is too high, you'll have to create another library.
To summarize, at the very beginning you want to expand your library to deal with a large number of people, you just add new doors and double the number instantly. You can open another door again to increase the number of people passing. For a while, you can expand by adding new doors, until you have all the doors full on all the walls-that is, you remove all the walls. After that you want to continue to expand (remember, the people who come to your library will never stop growing), you have to think of a new solution, such as building several backup libraries. This has a lot to do with--you have to copy all your books, rent a new house, and then have to come up with a reasonable diversion that makes every library a reasonable number. All of these are new infrastructure, and you can't wait until you realize that you need to open enough doors to do it again, because the number of visitors is still growing and the waiting crowd is not satisfied. So you have to predict the rate of traffic growth to start increasing the number of libraries as early as possible when you open a new door strategy that might not work.
Example four: Adding many, many new books
All the libraries have to keep up with the times, which means they have to constantly add new books. Let's say you have a very active library, and hundreds of new books are added every moment.
In this way, you have to arrange for people to buy new books, copy them in the main library, and spread the books on the shelves. You have a super-smooth traffic and a sufficient number of libraries, even if couriers travel between libraries to deliver new books, it will not affect the speed of access to your library. Great, everything looks great, you've decided to take a vacation today.
Well, all the books in your library are in order, that is, they are not randomly placed on the shelves, but are placed according to the author or title or whatever sort (in reality, according to the Dolby Decimal classification, when you first contact with it in primary school, it must be absurd, But once you understand these database problems, it's strange that you can see that it actually makes sense. In any case, books are emitted sequentially, and when someone finds a book on a bookshelf through a card catalogue, he doesn't need to browse the entire bookshelf to find the book--every book on the bookshelf is ordered to be discharged, So you can refer to the book in the middle of the bookshelf, choose his left or right side to continue to consult, to find a specific book recursively. At any time, all books have to be sequenced so that users can find books in this way.
The books on the shelves are arranged on demand, and the shelves are full. It also means that when a courier wants to put a new book on the bookshelf, it needs to move the book at the end to the next shelf, and the book at the end of the next shelf is moved to another shelf. In the end, you have to redesign the bookshelf so that it has some voids to put in the book and no longer need to move the book to another bookshelf, which is annoying and time consuming, and more importantly, when you re-design your bookshelf, you don't want other users or couriers to take books or put books on those shelves. So you have to "lock" the shelves and all the bookshelves around to prevent you from having to move the book to another bookshelf, and the books on the other shelves have to be moved to another bookshelf, and so on.
This lock-up creates a huge traffic problem. The problem from looking for a book needs to wait for other people standing in front of the bookshelf to leave and become a lot of people need to wait in the periphery of the bookshelf area, waiting for the courier to insert the new book and mobile book location. If there are many couriers, the problem will often occur. A large number of people need to wait for a courier to complete the operation, even worse, a courier may need to wait for another courier. In reality, the new book does not often increase, but if you run a site there are many people upload a lot of things, it is equivalent to a lot of courier to the library every moment to express new books, you will encounter the above problems.
At the same time, if the courier does not act fast enough, for example, unable to send all the backups to all the libraries at the same time, then some people can't find a book but others can, they will get conflicting information.
The solution to this problem is to leave it to the reader for practice.
Keep in mind that solving this problem will have to come up with a whole new solution for the couriers (maybe you can update the nearby books in the same batch so that you can update a lot of books on one shelf at a time), or change the way the library is laid out (maybe you can leave a vacancy for all the books around you). But what if the vacancies were filled again? --All in all, we can get the following lesson: If you don't have a lot of couriers and you don't have a lot of new books per day, that's not a big deal, but if you have more than one threshold (like a tidal courier and a new book) you have to change your entire solution and physical layout, And that often requires imagination and creativity.
This is why the database is difficult to scale.
In fact, this is why it is difficult to extend any of the slightly more complex systems (such as Web applications, which contain databases and other servers and their interactions). Refer to a complete research university, which contains labs, classrooms and dormitories, and so on, the library and its uses are just one of the entire university's components. The rapid growth in the use of any one of these components can cause scaling problems. To overcome this problem, you usually have to refactor the entire dynamic underlying protocol and rearrange the operation protocols (that is, the algorithm in your code) or your resource layout. Implementing these always requires creative effort and the underlying details of different systems vary, so there are only a few common solutions--some of which are mentioned above--and you often have to adjust these general-purpose solutions to suit your situation: You may find it difficult to rent a house to use as a copy library; maybe your book is a big encyclopedia. , The courier can not express a lot of books at once; maybe you have a cold environment, you can't open many doors, or the customer will freeze to death. The solution for this environment is not fully applicable to that environment, so every time you have to adapt the solution to the immediate environment, keen observation is essential to help you find the problem and find the solution that is most appropriate for the problem.
Finally remember, when you redesign the solution, there will be more and more customers knocking at your door, shouting at you, because the team is too long they can not see the book you want to read!
Oh, that's right. Your library may make your wallet empty and your heart bleed, because you have to buy enough shelves to rent enough houses, so how can you start making money? All VCs don't want to fund you, because you want to get customers to borrow books for free? You're not willing to advertise in the centerfold?
Thanks for watching! ^_^
"Why the database is difficult to expand," the popular explanation