The original post is posted on my other blog http://hadoopforcloud.javaeye.com
4. hadoop I/O
4.1. Data Integrity
In general, checksum is used to check data integrity, but it can only check the integrity, without providing any repair methods. The Checksum value may also be wrong.
Hadoop adopts checksum with different policies to overcome the above shortcomings.
4.1.1. Data Integrity in HDFS
1) HDFS transparently calculates the checksum of its internal data and verifies the checksum when reading the data.
2) HDFS creates checksum for each Io. bytes. Per. Checksum byte, which is created every 512bytes by default.
3) datanode verifies the data before the checksum of the stored data and data. In the datanodes in the pipeline, the last one is responsible for verifying the checksum of the data. If an error occurs, a checksumexception is thrown.
4) when the client reads data, it also verifies checksum, which compares the obtained data checksum with the checksum of datanode. (Each datanode stores the checksum verification record. After the client successfully verifies a block, it will tell datanode that datanode updates the verification record. Storing such records is of great value for detecting hard disk damages)
5) in addition to verifying the block on the client, each datanode runs a background thread, datablockmask, to regularly verify the data.
6) Because HDFS stores block replication, HDFS can generate new block replication by copying blocks on the datanode that are not damaged.
7) You can pass false to the setverifychecksum () method of filesystem to disable data verification before reading files through the open () method. You can also use the-ignorecrc option in shell with the-Get or-copytolocal command to disable data verification.
4.1.2. localfilesystem
Localfilesystem calculates the checksum of client disconnection. When you write the filename file, the file system will transparently create a hidden filename. CRC file. The same directory as the filename. CRC file contains the checksum of each chunk of filename. The chunk size is controlled by Io. bytes. Per. checksum (512 by default) and stored as metadata in the. CRC file. Because of this design, data can be accurately read after the chunk size value is changed.
The overhead of checksum is relatively small, but it can also be disabled. How to disable it:
Global: set FS. file. impl to org. Apache. hadoop. fs. rawlocalfilesystem.
You can also create a rawlocalfilesystem instance, which is useful when you need to disable checksum for some read operations:
Configuration conf = ......
Filesystem FS = new rawlocalfilesystem ();
FS. initalize (null, conf );
4.1.3. checksumfilesystem
Localfilesystem uses checksumfilesystem to complete its work. The checksumfilesystem class makes it easy to add checksum to filesystem without the checksum function. A typical usage is as follows:
Filesystem rawfs = ......
Filesystem checksumfs = new checksumfilesystem (rawfs );
The underlying file system becomes raw filesystem and can be obtained using the getfilesystem method of checksumfilesystem. There are several useful methods for checksumfilesystem. For example, getchecksumfile () can be used to obtain the checksum file path of a specified file.
4.2. Compression
Benefits of File compression: space saving and faster transmission on the network
Below are several compression formats:
Note the trade-off between compression ratio and compression speed. The compression speed is faster, the corresponding compression ratio is small, and the compression speed is slow.
Split (splitable) compression is very suitable for mapreduce programs (because it can be seek to any point in the stream ).
4.2.1. codecs
Codecs is the compression and decompression algorithm.
In hadoop, codecs is represented by the compressioncode implementation. The following are some implementations:
4.2.1.1. Use compressioncodec to compress and decompress the stream
Use the createoutputstream (outputstream out) method of compressioncodec to create compressionoutputsream. You can use the compressed format to write your uncompressed data into the stream.
Use createinputstream (inputstream in) of compressioncodec to obtain compressioninputstream. Extract the data from the stream.
Compressionoutputsream and compressioninputstream provide the ability to reset compressor and decompressor.
The following example demonstrates how to read and compress data from the standard input and then write the data to the standard output stream:
Public class streamcompressor {
Public static void main (string [] ARGs) throws exception {
String codecclassname = ARGs [0];
Class> codecclass = Class. forname (codecclassname );
Configuration conf = new configuration ();
Compressioncodec codec = (compressioncodec) reflectionutils
. Newinstance (codecclass, conf );
Compressionoutputstream out = codec. createoutputstream (system. Out );
Ioutils. copybytes (system. In, out, 4096, false );
Out. Finish ();
}
}
Use reflectionutils to create a codecs instance. Pass a standard input to createoutputstream to obtain a compressionoutputstream that encapsulates system. Out. Use the copybytes method of ioutils to copy the standard input data to compressionoutputstream.
The following is the running method:
% Echo "text" | hadoop streamcompressor org. Apache. hadoop. Io. Compress. gzipcodec | gunzip-
Text
4.2.1.2. Use compressioncodecfactory to deduce compressioncodecs
Compressioncodecfactory provides a ing from the extension to compressioncodec. The following is an example:
Public class filedecompressor {
Public static void main (string [] ARGs) throws exception {
String uri = ARGs [0];
Configuration conf = new configuration ();
Filesystem FS = filesystem. Get (URI. Create (URI), conf );
Path inputpath = New Path (URI );
Compressioncodecfactory factory = new compressioncodecfactory (CONF );
Compressioncodec codec = factory. getcodec (inputpath );
If (codec = NULL ){
System. Err. println ("No codec found for" + URI );
System. Exit (1 );
}
String outputuri = compressioncodecfactory. removesuffix (Uri, Codec
. Getdefaultextension ());
Inputstream in = NULL;
Outputstream out = NULL;
Try {
In = codec. createinputstream (FS. Open (inputpath ));
Out = FS. Create (New Path (outputuri ));
Ioutils. copybytes (In, out, conf );
} Finally {
Ioutils. closestream (in );
Ioutils. closestream (out );
}
}
}
The following is the running method:
% Hadoop filedecompressor file.gz
4.2.1.3. Native Libraries
Using a local database can improve efficiency, but note that not all implementations provide local libraries, but some implementations only provide local libraries (native implementation, native implementation ). Below are several implementations and local libraries:
By default, you do not need to modify any settings when using the local database. You can disable the local library by setting hadoop. Native. lib to false.
4.2.1.4. codecpool codec pool
As the name suggests, codecpool can reduce the overhead of repeatedly creating codec instances. The following example demonstrates how to use it:
Public class pooledstreamcompressor {
Public static void main (string [] ARGs) throws exception {
String codecclassname = ARGs [0];
Class> codecclass = Class. forname (codecclassname );
Configuration conf = new configuration ();
Compressioncodec codec = (compressioncodec) reflectionutils
. Newinstance (codecclass, conf );
Compressor compressor = NULL;
Try {
Compressor = codecpool. getcompressor (codec );
Compressionoutputstream out = codec. createoutputstream (
System. Out, compressor );
Ioutils. copybytes (system. In, out, 4096, false );
Out. Finish ();
} Finally {
Codecpool. returncompressor (compressor );
}
}
}
Note: After compressor is used, return it to the pool.
4.3. Compression and inputsplit compression and input parts
When considering how to compress the data that will be processed by mapreduce, it is very important to consider whether the compression format supports split. For GZIP format, stream in GZIP format cannot be used for mapreduce because it cannot be read from any point. In this case, mapreduce will not split the file, because the file extension can be used to know that the file is in GZIP format, so that the file is not split. however, because every block in the file cannot be stored in the same datanode, the efficiency is not high, because data may need to be read from another datanode over the network.
Bzip2 supports split input.
Zip is an archive file that stores multiple files. The file is stored in the central directory at the end of the ZIP file. Therefore, it is supported theoretically, but hadoop does not yet support the ZIP format split input.
How to select the compression format:
For large files with no boundaries, such as log files, the following policies are available for reference:
1) stored as non-compressed
2) use a spliting-supported image, such as Bzip2.
3) divide the file into chunks in the program and compress the chunks separately (any compression format is required ). In this case, you need to select the chunk size to be close to the HDFS block size.
4) use sequencefile, which supports compression and split)
4.4. Use compression in mapreduce
As mentioned in section 4.2.1.2 using compressioncodecfactory to deduce compressioncodec, if the input file is compressed, they will be automatically decompressed when being read by mapreduce (using the file extension to determine the CODEC to be used ).
How to compress the output result of mapreduce job:
1) set the mapred. Output. Compress attribute to true in job configuration.
2) set the mapred. Output. Compression. codec attribute to the class name corresponding to the CODEC to be used.
The following is an example:
Public class maxtemperaturewithcompression {
Public static void main (string [] ARGs) throws ioexception {
If (ARGs. length! = 2 ){
System. Err
. Println ("Usage: maxtemperaturewithcompression"
+"");
System. Exit (-1 );
}
Jobconf conf = new jobconf (maxtemperaturewithcompression. Class );
Conf. setjobname ("max temperature with output compression ");
Fileinputformat. addinputpath (Conf, New Path (ARGs [0]);
Fileoutputformat. setoutputpath (Conf, New Path (ARGs [1]);
Conf. setoutputkeyclass (text. Class );
Conf. setoutputvalueclass (intwritable. Class );
Conf. setboolean ("mapred. Output. Compress", true );
Conf. setclass ("mapred. Output. Compression. codec", gzipcodec. Class,
Compressioncodec. Class );
Conf. setmapperclass (maxtemperaturemapper. Class );
Conf. setcombinerclass (maxtemperaturereducer. Class );
Conf. setreducerclass (maxtemperaturereducer. Class );
Jobclient. runjob (CONF );
}
}
Running method:
% Hadoop maxtemperaturewithcompression input/ncdc/sample.txt.gz output
The final output result is compressed. The result of this example is as follows:
% Gunzip-C output/part-00000.gz
1949 111
1950 22
If the input is sequencefile, you can set the attributes of mapred. Output. Compression. type to control the compression type. The default value is record (compression of a single records). If it is replaced by block, it will be compressed into groups of records, which is recommended.
Compressing map output
It can compress the intermediate files generated by map, because the intermediate files must be transmitted over the network, so compression can save costs.
The following are two property values and examples.
The following two lines are added to the program:
Conf. setcompressmapoutput (true );
Conf. setmapoutputcompressorclass (gzipcodec. Class );
4.5. serialization
4.6. file-based Data Structures