Erasure code saves data recovery bandwidth for Hadoop

Source: Internet
Author: User
Keywords DFS network traffic bandwidth

7 authors from the University of Southern California and Facebook have jointly completed the paper "XORing elephants:novel Erasure code for big Data." The author developed a new member of the Erasure code family--locally repairable codes (that is, local copy storage, hereinafter referred to as LRC, which is based on XOR. Significantly reduces I/O and network traffic when repairing data. They applied these codes to new Hadoop components, called Hdfs–xorbas, and tested them on Amazon AWS and Facebook.

From Reed Solomon code to LRC.

About 10 years ago, the industry began using Reed Solomon code to distribute two or three copies of the data, replacing traditional RAID5 or RAID6. This approach is very economical because of the inexpensive disks that replace expensive storage arrays. Reed Solomon Code and XOR are all branches of erasure code. In which, XOR only allows the loss of a piece of data, and Reed Solomon code can tolerate the loss of multiple pieces of data.

But the standard Reed Solomon code does not solve the hyper-mass Hadoop load very well. Because of the high cost of data repair time and cost (mainly for I/O and network traffic). At the same time, for a while, the number-level growth figures exceed the infrastructure capabilities of Internet companies. The three replicas sometimes do not meet the higher reliability requirements.

Now, these internet giants are designing storage-system standards: Even if four storage objects fail at the same time (these objects include disks, servers, nodes, and even the entire datacenter), you can't lose any data (currently Reed Solomon code is using (10,4) policy, That is, 10 blocks of data generate 4 checksum files, which can tolerate the loss of 4 pieces of data. )。 From this paper, Facebook uses the Erasure code method, which requires only 60% I/O and network traffic relative to Reed Solomon code.

The authors analyzed the 3,000 nodes in the Facebook Hadoop cluster, which involved 45PB data. These data have an average of 22 nodes per day, sometimes more than 100 failures a day, as shown in Figure 1.

Figure 1: Day node failure chart

The network of Hadoop clusters is often passively occupied, and several active disks can occupy 1Gb of bandwidth, and it is impossible to ignore the congestion generated by repairing the failed data. An ideal storage solution is not only to ensure storage efficiency, but also to reduce the amount of traffic required to repair the data.

LRC test results of the main indicators:

disk I/O and network traffic are halved than reed Solomon code, which consumes 14% more storage space than Reed Solomon code, a significantly shorter repair time, stronger reliability, and the appropriate data physical distribution for network traffic requirements, Even spread across data centers.

Table 1:LRC contrasts with Reed Solomon code and the traditional Hadoop three-copy strategy. LRC is up to two orders of magnitude faster than reed Solomon code, and repairs the flow by half.

Figure 2:LRC and Reed Solomon code in the case of multiple-node, multiple-block failure, HDFs read data volume, network traffic and repair time comparison, LRC basically save half the resources than Reed Solomon code.

Including HDFS-3544, the industry is constantly pursuing a highly reliable approach to network bandwidth savings, which is self-evident for internet giants and cloud computing infrastructure providers. The simple Regenerating codes, which is joined by the University of Southern California, Wayne State University and Microsoft, is also working in this direction. Notably, the LRC, HDFS-3544, and simple Regenerating codes described above are all based on increasing local data to reduce network traffic needed to repair data.

On ATC2012, Microsoft Azure engineer Cheng Huang and his colleagues shared Erasure job in Windows Azure Storage. Cheng Huang said Microsoft also used LRC technology on Azure. Here you can see Cheng Huang sharing video. In addition, Cheng Huang also participated in the "Simple Regenerating codes."

At home, Azure deployed services in two data centers in the century-interconnected Beijing and Shanghai. In an interview with CSDN, Microsoft Cloud Computing and Server Division general manager Zhi Jiqing revealed:

The data on reregistering Azure is stored in 6 copies, even if the local storage of the virtual machine is not exceptional. In China, no publicly-owned cloud-computing company is willing to commit three to 9 such SLAs, but Microsoft promises 3 9 or higher.

About Hdfs–xorbas, LRC and GFS2

Currently, Hdfs–xorbas is based on Facebook's hdfs-raid version of Hadoop (GitHub portal, Apache Portal), and managed code on GitHub.

The Hdfs–xorbas project is maintained by Maheswaran Sathiamoorthy, a candidate professor at the University of Southern California's Xie Electronic Engineering department. "Several of the authors in the paper have created the company," said Robin Harris, founder of consultancy Technoqwan, in the article.

Dhruba Borthakur, one of the authors of the paper, is a Facebook engineer who introduced the Erasure code in a blog post in 2009:

I know the idea of using erasure code comes from Diskreduce, a bunch of guys from Carnegie Mellon University. So I borrowed this idea and added this feature HDFS-503 to Apache Hadoop.

Dhruba stressed that the HDFS Erasure code is only on the HDFS, and does not modify HDFS internal code. The reason for this is that the HDFs code is already very complex and doesn't want to get into trouble to make it more complicated.

Dhruba also talked about Hdfs-raid's internal operation on Facebook at a HDFS seminar in Hadoop Summit 2012. Liang, a data engineer, shared Dhruba's views in his blog:

The data stored on the HDFs is divided into two kinds: thermal data and cold data. Thermal data is typically stored in three backups, because the data is often used, so multiple backups can be load-balanced in addition to efficient redundancy. For cold data, it is not essential to keep 3 copies in the HDFs. Dhruba introduced two different raid scenarios, for the not too cold block of data a/b/c, generated by XOR validation data block, the original block of data a/b/c each reserved 2 copies, the checksum data block also has two copies. In this way, the replica coefficients are reduced from 3 to 2.6 (theoretical values).

For very cold data, the program is more radical, 10 blocks through reed Solomon code to generate 4 checksum files, for the original block of data, only one copy, verify the data block has 2 copies. In this way, the copy factor drops to 1.2.

Liang shares the Dhruba introduction of the Distributed RAID file system implementation principle, in 2009 Dhruba blog also introduced this, you can check separately.

Of course, Hadoop is just the open-source implementation of GFS, so how does Google solve the high cost of data repair? In Google GFS2 (Colossus) using Reed Solomon code to copy. In Google last year published the "Spanner:google's globally-distributed Database" (Csdn Passage version) revealed:

Reed Solomon code can reduce the original 3 copies to 1.5 copies, improve write performance and reduce latency.

But Google revealed very limited information about GFS2. Andrew Fikes, Google's chief engineer, shared the Google Storage Architecture Challenge at the faculty Summit 2010 meeting and talked about why Google used Reed Solomon code, citing the following reasons:

cost.  In particular, data copies across clusters.  Increase average no failure time (MTTF).  More flexible cost control and availability options. Reference: Ruisheng blog EMC China Academy Yan Kai Blog highscalability storagemojo

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.