Data integrity
IO operation process will inevitably occur data loss or dirty data, data transmission of the greater the probability of error. Checksum error is the most commonly used method is to calculate a checksum before transmission, after transmission calculation of a checksum, two checksum if not the same data exist errors, more commonly used error check code is CRC32.
HDFs Data integrity
The checksum is computed when the HDFs is written, and then the checksum is computed each time it is read. The point to note is that the HDFs computes a checksum for each fixed length, which is specified by io.bytes.per.checksum and is 512 bytes by default. Because the CRC32 is 32 bits or 4 bytes, the checksum takes less space than the original data of 1%. 1% This number is often seen in Hadoop. There will be time to organize a copy of Hadoop and 1% stories to tell.
Datanode verifies the checksum of the data before storing the received data, such as receiving data from the client or other copies. Think about the previous article Hadoop in-depth study: (iii)--HDFS data stream in the client writes data to the HDFS data stream, the last Datanode in the pipeline will check this checksum, if found to be wrong, will be thrown checksumexception to the client.
When the client reads the data from the Datanode, it checks the checksum, and each datanode also saves a log that checks the checksum, and every checksum of the client is recorded in the log.
In addition to the read and write operations that check the checksum, Datanode also runs a background process (Datablockscanner) to periodically verify the block that exists on it, because in addition to the read and write process will produce data errors, the hardware itself will produce data errors, such as bit attenuation ROT).
If the client finds that block is broken, how will it recover the bad chunk, mainly in a few steps:
1. The client will report the bad block and block Datanode to Namenode before throwing the checksumexception
2.namenode marks the block as corrupted so that Namenode does not point the client to the block, nor does it replicate the block to the other datanode.
3.namenode will copy a good block to another Datanode
4.namenode Remove the bad block
If, for some reason, you do not want HDFs to check the checksum code at the time of the operation, call the Setveritychecksum method before calling the filesystem's Open method, and set it to false so that you can use the-IGNORECRC parameter under the command line.
Realize
LocalFileSystem inherits from Checksumfilesystem, has implemented the checksum function, checksum information stored in the CRC file with the same name as the filename, found the wrong file in the Bad_files folder. If you are sure that the top-level system has implemented the checksum function, then you do not need to use LocalFileSystem, instead of using Rowlocalfilesystem. Can be specified globally by changing fs.file.impl=org.apache.hadoop.fs.rawloacalfilesystem, or it can be instantiated directly by code.
Configuration conf=
... FileSystem fs=new Rawlocalfilesystem ();
Fs.initialize (NULL, conf);
If other filesystem want to have checksum function, only need to use Checksumfilesystem packing one layer can:
FileSystem rawfs=
... FileSystem checksummedfs=new Checksumfilesystem (FS) {};