Why are domestic web-disk companies competing at the TB level and the cost is too high? (There are many other responses)

Source: Internet
Author: User

Du
Links: http://www.zhihu.com/question/21591490/answer/18762821
Source: Know
Copyright belongs to the author. Commercial reprint please contact the author for authorization, non-commercial reprint please specify the source.

@Jerry have explained a very critical step, I would add some and make it easier to understand.

I want to provide 1G of network storage space for each user.

If the server has a 1000G hard disk can all provide users with data storage, if each user to allocate 1G maximum storage space, then how many users can be allocated to use it?

You must say it's a 1000/1=1000 user.

But in fact, you are so assigned, you will find that each user will not upload 1G of things would be full of capacity, there are few, but the average user usually only upload 50M files, that is, you will be 1000G of hard disk to 1000 people use, but only effective use of the 50m*1000= 50G of space, the remaining 950G of space is basically wasted.

so how to solve it?

You can adapt to this 1000G space allocated to 20,000 users, each person's upload limit capacity or 1G, but each person usually still upload 50M of data, then 20000*50m=1000g, this will be valuable server storage space fully utilized. But you are afraid of this distribution to 20,000 people, in case a moment people suddenly upload more data, then the user is not aware that you give someone else's 1G space is false? So can not allocate so many people, only assigned to 19000 people, leaving some space for emergency use.

suddenly found that the number of users can be allocated 19 times times ah, great. How about a more efficient way to use it?

If I have more than 1000 servers, a server has 1000G of space, then we have a server to leave a 50G of empty space for users to suddenly upload big data when the data is full, then I have 1000 servers on the empty 1000 *50g=50000g The space is wasted, what a pity. So we invented the storage cluster, so that a user's data can be allocated on multiple servers to store, but in the user that seems to be just a 1G of continuous space, then there is no need to set up a contingency space on each server, or even full of the previous server full, Plug in the data down one server. This ensures that the maximum utilization of server space, if a moment the administrator found that users are crazy upload data (in a large user base, such a small probability of less) lead me to provide enough space, it's okay, just need to add a few hard drives or add a few servers to solve.

Okay, that's it. Our server space utilization is much higher, you can allocate a certain amount of space to the largest number of users to use. But is there any better plan for improvement?

One day, the administrator found that even if each user is down only to store 50M of things, but this 50M is not an overnight, with 1-2 years of time use slowly reached this amount, that is, a new user just register My network space, will not upload things, or just upload a little something very small. Then I have allocated 50M of space for each user, even though they will fill the 50M in the next 2 years, but this space is wasted. So the clever engineer said: Since we can distributed, clustered storage, a user's data can be distributed across multiple servers, then we assume that a new registered user in the beginning to provide 0M of space, in the future, how much I will give him how much storage space, so that the full use of the drive to ensure. Of course the user's front end is to display 1G, but this 1G "space" is just a number.

The engineer's idea, so that I in the beginning of the establishment of the network disk can be 1 1000G server to provide about 1000000 people to register and use, with the number of registered people, I also have money, can also continue to increase the server to provide their late storage. At the same time, because a part of the server to buy more than a year later, my purchase cost has come down.

so ... Is this the end of it? If the mailbox provider, this utilization is high enough. But the network is not the same.


Intelligent engineers found: different from the mailbox, the mailbox server upload content and attachments are mostly self-created and not similar. But many of the things we upload on the web are repetitive.

For example: Zhang San today downloaded a "Tokyo hot" upload uploaded to their own network, John Doe in three days after downloading the same "Tokyo Hot" uploaded to the network hard disk, with the increase of users, you will find a total of 1000 people uploaded 1000 copies The same file to your valuable server space, so the engineer came up with a way, since it is the same file, I will only save a copy of the soon good, and then the user's front-end display is everyone has a copy of the line. When some users want to delete this file, I do not really delete, just want to appear in the front-end display deleted, but the backend has been reserved for other users who own this file to download. Until all the users who have this file have deleted the file, I'll really delete it.

This way, as more and more data is stored, more and more users are registering, and they are uploading more and more duplicate data. You find that this detection of duplicate file storage is more efficient. This counts as if everyone uploads an 1m/file that only averages the user. This allows you to provide more than 50 times times more users to use your limited space.

but along with the use, you find a pattern:

Zhang San uploaded "TOKYO hot N0124" and John Doe uploaded "TH n124" is the same file, but the file name is not the same, I can not recognize that they are a file, and then only to different users to save a different file name soon OK? Yes, but this takes advantage of some algorithms that identify file similarities, such as MD5 values. As long as the MD5 value of two files is the same as the file size, I think they are the same file, just need to save a file and give different users a different file name.

One day you find that because each file needs to calculate the MD5 value, resulting in a large CPU load, and the original file is not to waste bandwidth upload back to detect consistency, can improve it?

Clever engineer wrote a small software/small plug-in, the United States its name "Upload Control", will calculate the MD5 value of the work using this software to upload the user's computer to complete, once the user to upload the data and the server has been stored on a data is the same, simply do not upload, This file has been uploaded successfully according to XX file name directly on the user's mark. The process was almost instantaneous, and gave it a high-handsome name "Second Pass"!


Through so many steps above, you find that you can only provide 1000 of users with network space, so many improvements, in the user side of the display of 1G space is unchanged, nearly can provide 1 million users with network space.

so if you are in a good mood, the publicity said: I want to increase the maximum storage space per user to 1TB. So each user on average or only upload 50M of data, only a few very individual users upload a breakthrough 1G raw space data, you will find that the cost is almost negligible.

Hard engineers are still working hard and digging for more efficient use of the disk space provided by the server ...

Why are domestic web-disk companies competing at the TB level and the cost is too high? (There are many other responses)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.