How does Baidu give everyone 2TB storage space for free?

Last Update:2018-07-26 Source: Internet

Author: User

Tags md5

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The previous time in the use of Baidu network disk, suddenly found that Baidu network disk can be free to receive 2TB space.

Network hard disk Everyone may have more or less contact, have to say in all things are cloud era, this is a very good network tools, and for our poor to the dregs of free users, hard disk space is a mishap, just started to use the time is really for space, all kinds of tossing (do his so-called task), In the end, it only expanded about 5G. Now, casually, lightly loose on the 2T space.

And how did this sudden 2T space be realized?

The fact is that it's such a drop.

If I want to provide each user with 1G of network storage space.

If the server has a 1000G hard disk can all provide users with data storage, if each user to allocate 1G maximum storage space, then how many users can be allocated to use it.

You must say it's a 1000/1=1000 user.

But in fact, you are so assigned, you will find that each user will not upload 1G of things would be full of capacity, there are few, but the average user usually only upload 50M files , that is, if you will 1000G of hard disk to 1000 people to use, However, the space of the 50m*1000=50g is only effectively utilized, and the remaining 950G space is completely wasted.

So how to solve it.

You can adapt to this 1000G space allocated to 20,000 users, each person's upload limit capacity or 1G, but each person usually still upload 50M of data, then 20000*50m=1000g, this will be valuable server storage space fully utilized. But you are afraid of such a distribution to 20,000 people, in case a moment people suddenly upload more data, then the user is not aware that you give someone else's 1G space is false? So can not allocate so many people, only assigned to 19000 people, leaving some space for emergency use.

suddenly found that the number of users can be allocated 19 times times Ah , great. Is there any way to use it more effectively?

If I have more than 1000 servers, a server has 1000G space, then each of us on each server to leave a 50G of empty space for users to suddenly upload big data when the data is full, then I have 1000 servers on the empty 1000 *50g=50000g What a pity it is to be wasted. So the siege Lions invented the storage cluster, so that a user's data can be allocated on multiple servers to store , but in the user that seems to be just a 1G of continuous space, then there is no need to set up a contingency space on each server, or even full of the previous server full, Plug in the data down one server. This ensures that the maximum utilization of server space, if a moment the administrator found that users are crazy upload data (in a large user base, such a small probability of less) lead me to provide enough space, it's OK, just need to add a few hard disk or server to solve.

All right, that's it. Our server space utilization is much higher, you can allocate a certain amount of space for the most users to use. But there is no better plan to improve it.

One day, the administrator found that even if each user on average only store 50M of things, but this 50M is not an overnight, with 1-2 years of use slowly reached this amount, that is, a new user just register My network space, will not upload things, or just upload a little something very small. Then I have for each user initially allocated 50M space, even if they will fill the 50M in the next 2 years, but this period of space is a lot of wasted ah. So the smart siege Lion said: Since we can distributed, clustered storage, a user's data can be distributed across multiple servers, then we assume that a new registered user in the beginning to provide 0M of space, in the future, how much I give him to provide the amount of storage space, This will completely guarantee the use of the hard drive . But the user's front-end is still to display 1G.

The engineer's idea, so that I in the beginning of the establishment of the network disk can use 1 1000G server to provide about 1000000 people to register and use, with the number of registered people, I also have money, can also continue to increase the server to provide their late storage. At the same time because a part of the server completed more than a year to buy, my purchase cost has come down.

So... Is that the end of it?

If the mailbox provider, this utilization is high enough. But the network is not the same.

Smart Engineers found: Unlike mailboxes, the vast majority of our content and attachments are self-created and different. But many of the things we upload on the web are repetitive .

For example: Zhang San today downloaded a "Tokyo hot" uploaded to their own online disk, John Doe in three days after downloading the same "Tokyo Hot" uploaded to the network hard disk, with the increase in users, you will find a total of 1000 people uploaded 1000 The same file to your valuable server space, so the engineer came up with a way, since it is the same file, I will only save a copy of the soon good, and then the user's front-end display is no one will have a copy soon. When some users want to delete this file, I do not really delete, just want to appear in the front-end display deleted, but the backend has been reserved for other users who own this file to download. Until all users who use this file have deleted the file, I will delete it.

This way, as more and more data is stored, more and more users are registering, and they are uploading more and more duplicate data. You find that this detection of duplicate file storage is more efficient. This counts as if everyone uploads an 1m/file that only averages the user. This allows you to provide more than 50 times times more users to use your limited space.

But along with the use, you find a pattern:

Zhang San uploaded "TOKYO hot N0124" and John Doe uploaded "TH n124" is the same file, but the file name is not the same, I can not recognize that they are a file, and then only to different users to save a different file name is not OK. Yes, but this takes advantage of some algorithms that identify file similarities, such as MD5 values. As long as the MD5 value of two files is the same as the file size, I think they are the same file, just need to save a file and give different users a different file name.

One day you find that because each file needs to calculate the MD5 value, resulting in a large CPU load, and the original file is not to waste bandwidth upload back to detect consistency, can improve it.

Clever engineer wrote a small software or small plug-in, the United States its name "Upload Control", will calculate the work of MD5 use this software to upload the user's computer to complete, once the user to upload the data and the server has been stored on a data is the same, simply do not upload, directly in the user's mark Remember that this file has been uploaded according to XX file name successfully. This process was almost instantaneous, and gave it a high-handsome name "second pass".

Through so many steps above, you find that you can only provide 1000 of users with network space, so many improvements, in the user side of the display of 1G space is unchanged, nearly can provide 1 million users with network space.

So if you are in a good mood, the publicity said: I want to increase the maximum storage space per user to 1TB. So each user on average or only upload 50M of data, only a few users upload a breakthrough 1G raw space data, you will find that the cost is almost negligible.

The hard-working siege lion is still trying to make more efficient use of the disk space provided by the server in disdain and digging ...

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More