Face the big data storage infrastructure how to deploy

Source: Internet
Author: User
Keywords We say large data disk should

Recently, people have been discussing the value of big data analysis and the business intelligence it brings, but before companies can dig out the data, they have to figure out how to store the big data. Managing large data (petabytes or larger data) is completely different from managing traditional large datasets, and the online photo-sharing platform Shutterfly Company is very clear about this.

Shutterfly is an online photo-sharing site that allows users to upload an unlimited number of photos and save them at the resolution of the user's upload, never compressing the dimensions, which is different from other photo-sharing platforms, and Shutterfly says it never deletes a picture.

"Our photo archive is about 30PB of data," said Neil Day, senior vice president and chief technology officer at Shutterfly Company, "Our storage pool is growing faster than our customers." When we get a client, the first thing they do is upload a bunch of photos to us, then they'll fall in love with our service and then they'll upload another bunch of photos. ”

To get a sense of the size of the data, you can take a look at this information: 1 PB is equivalent to 1 million TB or 1 billion GB, the NASA Hubble Space Telescope in the first 20 years of the image data observed approximately 45TB of data, while the 128 KB recorded 1TB compressed audio approximately contains 17,000 hours of audio.

Completely different PB-level infrastructure

"PB-level infrastructure is a completely different thing," Day says, "they are difficult to build and maintain." The difference between a PB or a petabyte infrastructure and a traditional large-scale dataset is just like the difference between day and night, like processing data on a portable computer and processing data on a RAID array. ”

When Day joined Shutterfly in 2009, storage became the company's biggest expense and grew at a rapid rate.

"Every n PB of extra storage means we need another storage administrator to support the physical and logical infrastructure," Day says. "In the face of massive data storage, systems are more prone to problems, and anyone managing an oversized store will often have to deal with hardware failures." The fundamental question that everyone is trying to solve is: when you know that a part of storage will be a problem over time, how do you ensure data availability while ensuring that performance is not degraded?

RAID issues

The standard answer to troubleshooting is replication, usually in the form of a RAID array. However, Day says that when faced with large amounts of data, RAID solves problems and may create more problems. In traditional RAID data storage scenarios, replicas of each data are mirrored and stored on different disks in the array to ensure integrity and availability. But this means that each mirrored and stored data will require more than five times times its own storage space. As the disks used in the RAID array become larger (3TB disks are very attractive from a density and power perspective), the time to replace the failed drives will become longer.

"In fact, we don't have any operational problems with raid," Day says, "and what we see is that as the disk gets bigger and larger, when any component fails, we get back to a fully redundant system." The generated checksum is proportional to the size of the dataset. When we started using 1TB and 2TB disks, the time to go back to fully redundant systems became very long. It can be said that this trend is not moving in the right direction. ”

For Shutterfly, reliability and availability are critical factors, as are enterprise-level storage requirements. Day says its fast-expanding storage costs make commodity systems more attractive. When day and its team are studying potential technology solutions to help control storage costs, they are very interested in a technology called the Erasure Code (erasure code).

(Responsible editor: The good of the Legacy)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.