Recently, 7 authors from the University of Southern California and Facebook completed a paper "XORing elephants:novel Erasure code for big Data." This paper describes the new members of the Erasure code family--locally repairable codes (that is, local copy storage, hereinafter referred to as LRC, which is based on XOR. , this technique significantly reduces the I/O and network traffic when repairing data . They applied these encodings to the new Hadoop component, called it Hdfs–xorbas, and tested it on Amazon AWS and inside Facebook.
About 10 years ago, the industry began using Reed Solomon code to distribute two or three copies of the data, replacing traditional RAID5 or RAID6. This approach is very economical because of the inexpensive disks that replace expensive storage arrays. Reed Solomon Code and XOR are all branches of erasure code. In which, XOR only allows the loss of a piece of data, and Reed Solomon code can tolerate the loss of multiple pieces of data.
But the standard Reed Solomon code does not solve the hyper-mass Hadoop load very well. Because of the high cost of data repair time and cost (mainly for I/O and network traffic). At the same time, for a while, the number-level growth figures exceed the infrastructure capabilities of Internet companies. The three replicas sometimes do not meet the higher reliability requirements.
Now, these internet giants are designing storage-system standards: Even if four storage objects fail at the same time (these objects include disks, servers, nodes, and even the entire datacenter), you can't lose any data (currently Reed Solomon code is using (10,4) policy, That is, 10 blocks of data generate 4 checksum files, which can tolerate the loss of 4 pieces of data. )。 From this paper, Facebook uses the Erasure code method, which requires only 60% I/O and network traffic relative to Reed Solomon code.
The authors analyzed the 3,000 nodes in the Facebook Hadoop cluster, which involved 45PB data. These data have an average of 22 nodes per day, sometimes more than 100 failures a day, as shown in Figure 1.
The network of Hadoop clusters is often passively occupied, and several active disks can occupy 1Gb of bandwidth, and it is impossible to ignore the congestion generated by repairing the failed data. An ideal storage solution is not only to ensure storage efficiency, but also to reduce the amount of traffic required to repair the data.
LRC test results of the main indicators:
--Disk I/O and network traffic are halved than reed Solomon code;
--LRC uses 14% more storage space than reed Solomon code.
--The repair time is shortened greatly;
--stronger reliability;
--the reduction of network traffic demand will achieve the appropriate data physical distribution, even across the data center distribution.
Including HDFS-3544, the industry is constantly pursuing a highly reliable approach to network bandwidth savings, which is self-evident for internet giants and cloud computing infrastructure providers. The simple Regenerating codes, which is joined by the University of Southern California, Wayne State University and Microsoft, is also working in this direction. Notably, the LRC, HDFS-3544, and simple Regenerating codes described above are all based on increasing local data to reduce network traffic needed to repair data.
On ATC2012, Microsoft Azure engineer Cheng Huang and his colleagues shared Erasure job in Windows Azure Storage. Cheng Huang said Microsoft also used LRC technology on Azure. Here you can see Cheng Huang sharing video. In addition, Cheng Huang also participated in the "Simple Regenerating codes."
At home, Azure deployed services in two data centers in the century-interconnected Beijing and Shanghai. In an interview with CSDN, Microsoft Cloud Computing and Server Division general manager Zhi Jiqing revealed:
The data on reregistering Azure is stored in 6 copies, even if the local storage of the virtual machine is not exceptional. In China, no publicly-owned cloud-computing company is willing to commit three to 9 such SLAs, but Microsoft promises 3 9 or higher.
About Hdfs–xorbas, LRC and GFS2
Currently, Hdfs–xorbas is based on Facebook's hdfs-raid version of Hadoop (GitHub portal, Apache Portal), and managed code on GitHub.
The Hdfs–xorbas project is maintained by Maheswaran Sathiamoorthy, a candidate professor at the University of Southern California's Xie Electronic Engineering department. "Several of the authors in the paper have created the company," said Robin Harris, founder of consultancy Technoqwan, in the article.
Dhruba Borthakur, one of the authors of the paper, is a Facebook engineer who introduced the Erasure code in a blog post in 2009:
I know the idea of using erasure code comes from Diskreduce, a bunch of guys from Carnegie Mellon University. So I borrowed this idea and added this feature HDFS-503 to Apache Hadoop.
Dhruba stressed that the HDFS Erasure code is only on the HDFS, and does not modify HDFS internal code. The reason for this is that the HDFs code is already very complex and doesn't want to get into trouble to make it more complicated.
Dhruba also talked about Hdfs-raid's internal operation on Facebook at a HDFS seminar in Hadoop Summit 2012. Liang, a data engineer, shared Dhruba's views in his blog:
The data stored on the HDFs is divided into two kinds: thermal data and cold data. Thermal data is typically stored in three backups, because the data is often used, so multiple backups can be load-balanced in addition to efficient redundancy. For cold data, it is not essential to keep 3 copies in the HDFs. Dhruba introduced two different raid scenarios, for the not too cold block of data a/b/c, generated by XOR validation data block, the original block of data a/b/c each reserved 2 copies, the checksum data block also has two copies. In this way, the replica coefficients are reduced from 3 to 2.6 (theoretical values).
For very cold data, the program is more radical, 10 blocks through reed Solomon code to generate 4 checksum files, for the original block of data, only one copy, verify the data block has 2 copies. In this way, the copy factor drops to 1.2.
Liang shares the Dhruba introduction of the Distributed RAID file system implementation principle, in 2009 Dhruba blog also introduced this, you can check separately.
Of course, Hadoop is just the open source implementation of GFS, so how does Google solve the high cost of data repair? Colossus, using Reed Solomon code in Google GFS2. In Google last year published the "Spanner:google's globally-distributed Database" (Csdn Passage version) revealed:
Reed Solomon code can reduce the original 3 copies to 1.5 copies, improve write performance and reduce latency.
But Google revealed very limited information about GFS2. Andrew Fikes, Google's chief engineer, shared the Google Storage Architecture Challenge at the faculty Summit 2010 meeting and talked about why Google used Reed Solomon code, citing the following reasons:
--Cost. In particular, data copies across clusters.
--Average free time (MTTF).
-More flexible cost control and availability options.
"Edit Recommendation"
Hadoop is currently just "poor ETL" Hadoop 2.0 will release a new breakthrough in big data soon Tcloud hand WANdisco accelerate Hadoop China development Big Data start-ups Wibidata "pack" the Hadoop "executive editor: Xiao Yun TEL: (010) 68476606 "