Storage mechanism of network cloud disk

Source: Internet
Author: User

Before the storage mechanism of the network cloud disk has been guessed, today read the online blog analysis (below the original text), most confirmed my conjecture. In summary, the current domestic network cloud disk storage mechanism mainly from the following three points start:

    1. For users to use habits and behavior patterns of research and analysis, you can roughly analyze the normal user day/January/year average storage capacity requirements;
    2. The use of storage cluster (cloud computing technology solution) and distributed storage technology to achieve user file storage, to solve the user's sudden demand for storage and slow growth of the problem;
    3. For users of repetitive storage problems, the calculation of the storage file MD5 value and only save unique files in a way to solve, by inducing users to use the client to share the performance of computing MD5 load.

The storage principle of the network disk

Original link: http://club.mil.news.sohu.com/it/thread/1r0702kmyo0

Before the domestic big companies in the network between the noisy uproar, a lot of students in their own blog said today to collect xxxt capacity, 1 Yuan xxxx, free upgrade to XXT and the like, chicken jelly you did not find Dropbox this network disk originator but always provide so little capacity, you have not thought why? I just want to say, the pattern Tucson broken AH ~
The truth is this ~
I want to provide 1G of network storage space for each user.
If the server has a 1000G hard disk can all provide users with data storage, if each user to allocate 1G maximum storage space, then how many users can not be allocated to use it?
You must say it's a 1000/1=1000 user.
But in fact, you are so assigned, you will find that each user will not upload 1G at all, the capacity of the long, there are few, but the average user only upload 50M files, that is, you will be 1000G of hard disk to 1000 people use, but only effective use of the 50m*1000= 50G of space, the remaining 950G of space is basically completely wasted.
So how to solve it?
You can adapt to this 1000G space allocated to 20,000 users, each person's upload limit capacity or 1G, but each person usually still upload 50M of data, then 20000*50m=1000g, this will be valuable server storage space fully utilized. But you are afraid of this distribution to 20,000 people, in case a moment people suddenly upload more data, then the user is not aware that you give someone else's 1G space is false? So can not allocate so many people, only assigned to 19000 people, leaving some space for emergency use.
Suddenly found that the number of users can be allocated 19 times times ah, great. How about a more efficient way to use it?
If I have more than 1000 servers, a server has 1000G space, then we have a server to leave a 50G of empty space for users to suddenly upload big data when the data is full of the situation, I 1000 servers on the empty 1000 *50g=50000g The space is wasted, what a pity. So we invented the storage cluster, so that a user's data can be allocated on multiple servers to store, but in the user that seems to be just a 1G of continuous space, then there is no need to set up a contingency space on each server, or even full of the previous server full, Plug in the data down one server. This ensures that the maximum utilization of server space, if a moment the administrator found that users are crazy upload data (in a large user base, such a small probability of less) lead me to provide enough space, it's OK, just need to add a few hard disk or server to solve.
All right, that's it. Our server space utilization is much higher, you can allocate a certain amount of space for the most users to use. But is there any better plan for improvement?
One day, the administrator found that even if each user draws down only 50M of things, but this 50M is not an overnight, with 1-2 years of use slowly reached this amount, that is, a new user just register My network space, will not upload things, or just upload a little something very small. Then I have allocated 50M of space for each user, even if they will fill this 50M in the next 2 years, but there is a lot of wasted time in this space. So the clever engineer said: Since we can distributed, clustered storage, a user's data can be distributed across multiple servers, then we assume that a new registered user in the beginning to provide 0M of space, in the future, how much I will give him how much storage space, so that the full use of the drive to ensure. But the user's front-end is still to display 1G.
The engineer's idea, so that I in the beginning of the establishment of the network disk can be 1 1000G server to provide about 1000000 people to register and use, with the number of registered people, I also have money, can also continue to increase the server to provide their late storage. At the same time because a part of the server over a year to buy, my purchase cost has come down.
So... Is this the end of it?
If the mailbox provider, this utilization is high enough. But the network is not the same.
Intelligent engineers found: Unlike the mailbox, the majority of the content of the attachments are self-created and different. But many of the things we upload on the web are repetitive.
For example: Zhang San today downloaded a "Tokyo hot" upload uploaded to their own network, John Doe in three days after downloading the same "Tokyo Hot" uploaded to the network hard disk, with the increase of users, you will find a total of 1000 people uploaded 1000 copies The same file to your valuable server space, so the engineer came up with a way, since it is the same file, I will only save a copy of the soon good, and then the user's front-end display is no one will have a copy soon. When some users want to delete this file, I do not really delete, just want to appear in the front-end display deleted, but the backend has been reserved for other users who own this file to download. Until all users who use this file have deleted the file, I will delete it.
This way, as more and more data is stored, more and more users are registering, and they are uploading more and more duplicate data. You find that this detection of duplicate file storage is more efficient. This counts as if everyone uploads an 1m/file that only averages the user. This allows you to provide more than 50 times times more users to use your limited space.
But along with this use, you find a pattern:
Zhang San uploaded "TOKYO hot N0124" and John Doe uploaded "TH n124" is the same file, but the file name is not the same, I can not recognize that they are a file, and then only to different users to save a different file name soon OK? Yes, but this takes advantage of some algorithms that identify file similarities, such as MD5 values. As long as the MD5 value of two files is the same as the file size, I think they are the same file, just need to save a file and give different users a different file name.
One day you find that because each file needs to calculate the MD5 value, resulting in a large CPU load, and the original file is not to waste bandwidth upload back to detect consistency, can improve it?
The clever engineer wrote a small software/. Small plug-in, the United States its name "Upload Control", will calculate the work of MD5 use this software to upload users to the point of the old to complete, once calculated the user to upload the data and the server has been stored on a data is the same, simply do not upload, This file has been uploaded successfully according to XX file name directly on the user's mark. The process was almost instantaneous, and gave it a high-handsome name "Second Pass"!
Through so many steps above, you find that you can only provide 1000 of users with network space, so many improvements, in the user side of the display of 1G space is unchanged, nearly can provide 1 million users with network space.
So if you are in a good mood, the publicity said: I want to increase the maximum storage space per user to 1TB. So each user on average or only upload 50M of data, only a few very individual users upload a breakthrough 1G raw space data, you will find that the cost is almost negligible.
Hard engineers are still working hard and digging for more efficient use of disk space provided by servers ...

Storage mechanism of network cloud disk

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.