Hadoop in-depth research: (vi)--HDFS data integrity

Source: Internet
Author: User
Tags crc32

Reprint Please specify source: Hadoop in-depth study: (vi)--HDFS data integrityData IntegrityDuring IO operation, data loss or dirty data is unavoidable, and the higher the data transfer rate, the higher the probability of error. The most common way to verify errors is to calculate a checksum before transmission, the transmission after the calculation of a checksum, two checksum if not the same indicates that the data has errors, the more commonly used error check code is CRC32.HDFs Data IntegrityThe checksum is computed when HDFs is written, and then the checksum is computed each time it is read. One thing to note is that HDFs calculates a checksum for each fixed length, which is specified by io.bytes.per.checksum and is 512 bytes by default. Because the CRC32 is 32 bits or 4 bytes, the checksum occupies less space than 1% of the original data. 1% This number is often seen in Hadoop. There will be time to sort out a list of Hadoop and 1% stories to tell. Datanode verifies the checksum of the data before it stores the received data, such as receiving data from the client or other copies. Think about the previous article Hadoop in-depth research: (c)--hdfs data flow in the client when writing data to HDFs, the last Datanode in the pipeline will check this checksum, if found error, will throw checksumexception to the client. The client checks the checksum as it reads the data from the Datanode, and each datanode also holds a log that checks the checksum, and each time the client's checksum is recorded in the log. In addition to read and write operations that check checksums, Datanode also runs a background process (Datablockscanner) to periodically check the blocks that exist on it because, in addition to the data errors that occur during the read and write process, the hardware itself generates data errors, such as bit attenuation (bits ROT). If the client finds a block broken, how will it recover this bad block, mainly in a few steps: 1. The client will report the bad block and block Datanode to Namenode 2 before throwing the checksumexception. Namenode this block as corrupted so that Namenode does not point the client to the block, nor does it replicate the block to other Datanode. 3.namenode will copy a good block to another datanode 4.namenode to remove the bad block if you do not want HDFs to check the checksum for some reason, Call the Setveritychecksum method before calling filesystem's Open method and set to False to use the-IGNORECRC parameter at the command line.ImplementLocalFileSystem inherits from Checksumfilesystem, has implemented the checksum function, checksum information is stored in the CRC file with the same name as the filename, found the wrong file in the Bad_files folder. If you confirm that the top-level system has implemented the checksum function, then you do not need to use LocalFileSystem instead of Rowlocalfilesystem. You can either change the Fs.file.impl=org.apache.hadoop.fs.rawloacalfilesystem global designation, or you can instantiate it directly from the code.

Configuration conf=
        ... FileSystem fs=new Rawlocalfilesystem ();
        Fs.initialize (NULL, conf);
If other filesystem want to have checksum function, only need to use Checksumfilesystem to wrap a layer:
FileSystem rawfs=
        ... FileSystem checksummedfs=new Checksumfilesystem (FS) {};


Thanks to Tom White, most of this article comes from the Great God's definitive guide, but the Chinese version of the translation is too bad, on the basis of the English original and some of the official documents to add some of their own understanding. It's all about reading notes, the superfluous lifting.
If my article is helpful to you, please use Alipay to reward:




Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.