Compression decompression for Hadoop
Hadoop has built-in support for a number of common compression algorithms for our mapreduce, without our concern. After the map, the data will produce output through shuffle, this time the shuffle process especially needs to consume the network resources, it transmits the amount of data, the less The more meaningful the run time of the job, in which case we can compress the output. After the output compression, reducer will be received, and then decompression, reducer after processing also need to do the output, you can also do compression. For our program, the input compression is our original, is not a program, because the input source is such a child, reduce the compression of the output can be determined.
After the map output is compressed, the amount of data transferred is less during the transmission to reduce. If the map output is large, you can use the output compression of the map end, which is good for our whole output job.
Select the compression algorithm focus point:
1. Whether to support separation. If the separation is not supported, then the entire input file will be treated as an input source of the data. The inability to separate means that a T's data is given to a map processing.
2. What is the compression decompression speed? Typically, Hadoop operates on disk/io-intensive operations. The bottleneck of the operation is generally on disk IO, CPU This utilization is not high, the algorithm is determined, mapreducer this algorithm generally does not have what recursive this kind of operation. CPU usage is not high, and mapreduce is suitable for processing massive amounts of data, so the amount of data is generally very large, you need to read the disk, so our mapareduce is generally disk IO-intensive operation.
The output of the MapReduce is compressed:
Map-side output for compression
Conf.setboolean ("Mapred.compress.map,output", true);
Reduce side for compression
Conf.setboolean ("Mapred.output.compress", true);
The classes used by the reduce-side output compression:
Conf.setclass ("Mapred.output.compression.codec"). Gzipcodec.class,compressioncodec.class);
Reduce side join:
Data processing process, the processing of documents not from a batch of files, may come from a number of files, the two batches of files are connected, this time involves the join
In the map phase, the map function reads two files File1 and File2 simultaneously, in order to distinguish between two sources of key/value data pairs, to label each piece of data (tag), such as tag=0 that comes from file1,tag= 2 from File2. The main task of the map phase is to label the data in different files.
In the reduce phase, the reduce function obtains the same key as the value list of the file1 and file2 files, and then joins the data in File1 and file2 for the same key (Cartesian product). That is: reduce Perform the actual connection operation.
When the map reads the original file, can you tell if it is file1 or file2?
Can. Filesplit filesplit = (filesplit) context.getinputsplit ();
String path= Filesplit.getpath (). toString ();
How do I mark a map when it is output?
When we figure out that it is file1, mark the V2, let the value of v2 be #zhangsan if it is file2, the value of V2 is *45
Map-side join:
The reason for the join of the reduce side is because the map stage cannot get all the required join fields, that is, the fields corresponding to the same key may be in different maps, and the reduce-side join is very inefficient because the shuffle phase requires a lot of data transfer.
The reduce side is inefficient, but does not completely replace the reduce side, because the map end adapts to a particular scenario,
The join on the map side is optimized for the following scenarios:
In two connected tables, one table is very large, and the other is so small that the table can be placed directly in memory. So we can.
Put the record of the small table in the map side of the memory, only need to read in the map side of the file is a large table can be, on the map side of this record join operation. Two tables of information are read in, the data is mixed, our user Information table file is not small, you can separate a file path, Do not use map reading, you can use the Setup (..) , using filesystem to read the record, the parsed data into the global variable map<integer,string>, and then read in the map is the sales amount, you can according to the value of the table ID to go to map to find data, do join operation.
Applicable scenarios:
The small table can be fully read into memory, two large tables in memory, not suitable for the map end join.
You can run multiple map tasks in one tasktracker, and each map task is a Java process,
The map phase reads data in HDFs via Setup () there is no problem, but if there is a lot of map, it means that every time you have to set up setup () to read, every time you have to read from HDFs, create a lot of files, so it is a bit wasteful.
If each map reads the same small table content from HDFs, it's a bit wasteful, and with Distributedcache, the small table content can be loaded on Tasktracker Linux disks. Every map runtime needs to load data from a Linux disk. You do not have to load from HDFs every time.
(1), before the job is submitted, the user specifies the file to be copied using the static method Distributedcache.addcachefile (), whose parameters are the file's uri.jobtracker gets the list of URIs before the job is started. and copy the corresponding files to the local disks on each tasktracker.
(2), in the Mapper Class of Setup (), the user uses the Distributedcache.getlocalcachefiles () method to obtain the file directory, read the contents of the file, cached into memory to load the data.
The process of writing to the disk is performed before it is run.
Compression decompression for Hadoop, reduce end-of-Join,map join