Before using MapReduce to solve any problem, we need to consider how to design it. Map and reduce jobs are not required at all times.
1 MapReduce design mode (MapReduce)1.1 Input-map-reduce-output1.2 Input-map-output1.3 Input-multiple Maps-reduce-output1.4 Input-map-combiner-reduce-output
MapReduce design mode (MapReduce)
The whole mapreduce operation stage can be divided into the following four kinds:1, Input-map-reduce-output
2, Input-map-output
coupling between the reader and the electronic tag:(Ⅰ) inductance coupling. The transformer model, through the space high frequency alternating magnetic field realizes the coupling, is based on the electromagnetic induction law.(Ⅱ) Electromagnetic backscatter coupling. The radar principle model, the emitted electromagnetic wave, touches the target to reflect, simultaneously carries back the target information, is based on the electromagnetic wave space propagation law.
Induc
last_key. Now we still use Unix pipe to simulate the entire mapreduce process:
% Cat input/ncdc/sample.txt | ch02/src/main/Ruby/max_temperature_map.rb | \Sort | ch02/src/main/Ruby/max_temperature_performance.rb1949 1111950 22
As you can see, this output is the same as that of Java. Now we use hadoop to run it. Because the hadoop command does not support the streaming option, you must use the jar option to declare that you want to process streaming jar files. As follows:
process has a very big impact on the total time of the job operation, the general MapReduce tuning is mainly to adjust the parameters of the shuffle stage.such as: Data flow for multiple reduce tasksIv. How to reduce the amount of data from map to reduceThe available bandwidth on the cluster limits the number of MapReduce jobs because the intermediate results of the map are passed to reduce for transmission over the network, so the most important point is to minimize the amount of data transfe
jobconf, and a combiner class in some applications, it is also the implementation of reducer.
2.1.2 jobtracker and tasktracker
They are all scheduled by one master service jobtracker and multiple slaver service tasktracker running on multiple nodes. The master is responsible for scheduling each sub-task of a job on slave, and monitoring them. If a failed task is found, the master re-runs it. Slave is responsible for directly executing each task. Task
check is done here. If it is not a native type (that is, it complies with the type, array, map class), an exception is thrown, and Operator Overloading is also implemented. For integer types, use genericudafsumlong to implement the UDAF logic. For floating point types, use genericudafsumdouble to implement the UDAF logic.
Implement Evaluator
AllEvaluators must inherit from the abstract class org. Apache. hadoop. hive. QL. UDF. Generic. genericudafevaluator. Subclass must implement some o
The sum process and the product process that we have completed before are described in section 1.32, indicating that they are special cases of the Process named accumulate.
Conversely, we need to abstract the sum and product processes to form a more general and general process.
We have discussed in the problem-solving summary in exercise 1.31. In fact, the sum process differs a little from the product process, that is, the cumulative operations are different, and the initialization values are
time, but will be divided multiple times, each time up to 10 stream. This means that when the middle result of the map is very large, it is helpful to reduce the number of merge times and to reduce the reading frequency of the map to the disk, and it is possible to optimize the io.sort.factor of the work.
When the job specifies Combiner, we know that map results are merged on the map side based on the functions defined by
Duce.task.io.sort.factor (DEFAULT:10) to reduce the number of merges, thereby reducing the disk operation;
Spill this important process is assumed by the spill thread, spill thread from the map task to "command" began to work formally, the job called Sortandspill, originally not only spill, before spill there is a controversial sort.
When the combiner is present, the results of the map are merged according to the functions defined by
interact with external resourcesthree. Reducer1. Reduce can also choose to inherit the base class Mapreducebase, which functions like mapper.2. The reducer must implement the Reducer interface, which is also a generic interface with a meaning similar to that of Mapper 3. To implement the reduce method, this method also has four parameters, the first is the input key, the second is the input of the value of the iterator, you can traverse all the value, the equivalent of a list, Outputcollector i
a job specifies a combiner, we all know that after the introduction of map, the map results will be merged on the map end based on the functions defined by combiner. The time to run the combiner function may be before or after merge is completed. This time can be controlled by a parameter, that isMin. Num. Spill. For. Combine(Default 3) when the
:
Key1
Value1
0
Hello Hadoop GoodBye Hadoop2 map output/combine Input
The output result of map1 is as follows:
Key2
Value2
Hello
1
World
1
Bye
1
World
1
The output result of map2 is as follows:
Key2
Value2
Hello
1
Hadoop
1
GoodBye
1
Hadoop
13 combine output
The Combiner class combines the values of the same key, which is also an CER implementation.
The output of combine1 is as follows:
Key2
Value2
Hello
1
World
2
Bye
1
The output of combine2 is as fol
parallel streams that can be written to the merge file when the merge spill file is used. For example, if the data produced by map is very large, the generated spill file is larger than 10, and io. sort. factor uses the default value 10. When map computing completes merge, there is no way to split all the spill files into merge at a time, but multiple times, A maximum of 10 streams can be created at a time. This means that when the intermediate result of map is very large and I/O. sort. factor
partition for a reduce task. This is done to avoid some of the reduce tasks being allocated to large amounts of data, while some reduce tasks have little or no data embarrassment. In fact, partitioning is the process of hashing data. The data in each partition is then sorted, and if combiner is set at this point, the sorted result is combia and the purpose is to have as little data as possible to write to the disk.3. When the map task outputs the las
added the token as "" or "\ T" when the output is not output, But the final result inside still has the blank word, is inconceivable. 2.Mapper Output if using ((TERM:DOCID), TF) in the form of ":" to separate the term and docid, then in combiner if I use ":" to separate the key (that is, the bottom of the wrong mapper way), So the number of strings you get is sometimes
public static class Inverseindexmapper extends Mapper
Use of
name as key, then we will achieve our original goal, because the map output will become a.txt-> words. Words.. Words
This is obviously not the result we want.
So the format of the map output should be
Text 1 with single word
Such as:
Hello->a.txt 1
This is used here as a separation between the word and the text where it resides
This will not affect our results when merging according to Key.
The map code is as follows:
public static class Mymapper extends Mapper
After map execution is com
achieve our original goal, because the map output will become a.txt-> words. Words.. WordsThis is obviously not the result we want.So the format of the map output should beText 1 with single wordSuch as:Hello->a.txt 1This is used here as a separation between the word and the text where it residesThis will not affect our results when merging according to Key.The map code is as follows:public static class Mymapper extends MapperAfter map execution is completeWe need a
MapReduce design Pattern (mapreduce)The entire MapReduce operation stage can be divided into the following four types:1, Input-map-reduce-output2, Input-map-output3, Input-multiple Maps-reduce-output4, Input-map-combiner-reduce-outputI'll show you which design patterns to use in each scenario.Input-map-reduce-outputInput? Map? Reduce? OutputIf we need to do some aggregation operations (aggregation), we need to use this pattern.
Scene
buffer ratio of start spill defaults to 0.80, which can be mapreduce.map.sort.spill.percent configured. While the background thread writes, map continues to write the output to this ring buffer, and if the buffer pool is full, the map blocks until the spill process completes without overwriting the existing data in the buffer pool.Before writing, the background thread divides the data according to the reducer that they will send to, and by invoking Partitioner the getPartition() method it knows
adjustment.Note: The result of the merge sort is two files, one is index and the other is a data file, and the index file records the offset of each different key in the data file (that is, partition).On the map node, if you find that the child node of the map is heavier than the machine IO, the reason may be io.sort.factor This setting is relatively small, io.sort.factor set smallWords, if the spill file is more, merge into a file for a lot of read operations, which increases the load of IO. I
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.