Hadoop detailed (vi) HDFS data integrity

Last Update:2017-02-27 Source: Internet

Author: User

Tags command line crc32

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Data integrity

IO operation process will inevitably occur data loss or dirty data, data transmission of the greater the probability of error. Checksum error is the most commonly used method is to calculate a checksum before transmission, after transmission calculation of a checksum, two checksum if not the same data exist errors, more commonly used error check code is CRC32.

HDFs Data integrity

The checksum is computed when the HDFs is written, and then the checksum is computed each time it is read. The point to note is that the HDFs computes a checksum for each fixed length, which is specified by io.bytes.per.checksum and is 512 bytes by default. Because the CRC32 is 32 bits or 4 bytes, the checksum takes less space than the original data of 1%. 1% This number is often seen in Hadoop. There will be time to organize a copy of Hadoop and 1% stories to tell.

Datanode verifies the checksum of the data before storing the received data, such as receiving data from the client or other copies. Think about the previous article Hadoop in-depth study: (iii)--HDFS data stream in the client writes data to the HDFS data stream, the last Datanode in the pipeline will check this checksum, if found to be wrong, will be thrown checksumexception to the client.

When the client reads the data from the Datanode, it checks the checksum, and each datanode also saves a log that checks the checksum, and every checksum of the client is recorded in the log.

In addition to the read and write operations that check the checksum, Datanode also runs a background process (Datablockscanner) to periodically verify the block that exists on it, because in addition to the read and write process will produce data errors, the hardware itself will produce data errors, such as bit attenuation ROT).

If the client finds that block is broken, how will it recover the bad chunk, mainly in a few steps:

1. The client will report the bad block and block Datanode to Namenode before throwing the checksumexception

2.namenode marks the block as corrupted so that Namenode does not point the client to the block, nor does it replicate the block to the other datanode.

3.namenode will copy a good block to another Datanode

4.namenode Remove the bad block

If, for some reason, you do not want HDFs to check the checksum code at the time of the operation, call the Setveritychecksum method before calling the filesystem's Open method, and set it to false so that you can use the-IGNORECRC parameter under the command line.

Realize

LocalFileSystem inherits from Checksumfilesystem, has implemented the checksum function, checksum information stored in the CRC file with the same name as the filename, found the wrong file in the Bad_files folder. If you are sure that the top-level system has implemented the checksum function, then you do not need to use LocalFileSystem, instead of using Rowlocalfilesystem. Can be specified globally by changing fs.file.impl=org.apache.hadoop.fs.rawloacalfilesystem, or it can be instantiated directly by code.

Configuration conf=  
... FileSystem fs=new Rawlocalfilesystem ();  
Fs.initialize (NULL, conf);

If other filesystem want to have checksum function, only need to use Checksumfilesystem packing one layer can:

FileSystem rawfs=  
... FileSystem checksummedfs=new Checksumfilesystem (FS) {};

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More