The MapReduce operating mechanism, which includes input fragmentation, map phase, combiner phase, shuffle phase, and reduce stage , in chronological order.
Partition is certain, just the number from 1 to n
combiner can be defined.
1. Input partition (input split): before the map calculation, MapReduce will calculate the input partition according to input file (input split), each input fragment (input sp
task will be partitioned for output (partition), which is to create a partition for each reduce task. Each partition has many keys (and their corresponding values), but the key/value pair records for each key are in the same partition. Partitions are controlled by user-defined partition functions, but are usually partitioned by a hash function with the default partitioner, which is efficient.
In general, the data flow for multiple reduce tasks is shown in the following figure. This also shows w
content.The following is the input data for MAP1:Key1Value10Hello World ByeThe following is the input data for MAP2:Key1Value10Hello Hadoop GoodBye Hadoop2 Map Output/combine inputThe following is the output of the MAP1Key2Value2Hello1World1Bye1World1The following is the output of the MAP2Key2Value2Hello1Hadoop1GoodBye1Hadoop13 Combine outputThe Combiner class implementation combines the values of the same key, and it is also a reducer implementation
I. Overview
1. Current status and progress of domestic CATV network
Radio and television is the largest media in China, it provides radio and television broadcasting services from wireless to cable, has the largest number of information audiences, is the biggest advertising, economic, market, entertainment and cultural information collection, production, dissemination and provider. By 1999, CCTV and more than 20 provincial-level television stations
MapReduce is a distributed computing model, proposed by Google, primarily for the search field, and the MapReduce programIn essence, it is run in parallel, so it can solve the computational problem of massive data.The MapReduce task process is divided into two processing stages: the map phase and the reduce phase. Each stage is keyedvalue pairs as input and output. Users only need to implement the map () and reduce () two functions to achieve distributed computing.To perform the steps:Map Task P
Before using MapReduce to solve any problem, we need to consider how to design it. Map and reduce jobs are not required at all times.
1 MapReduce design mode (MapReduce)1.1 Input-map-reduce-output1.2 Input-map-output1.3 Input-multiple Maps-reduce-output1.4 Input-map-combiner-reduce-output
MapReduce design mode (MapReduce)
The whole mapreduce operation stage can be divided into the following four kinds:1, Input-map-reduce-output
2, Input-map-output
. Several optical access methods
2.1 Optical Fiber Distributed Network (FDN)FDN is divided into active and passive optical fiber networks.The difference between a passive optical fiber network and an active optical fiber network is that it replaces a remote Optical Fiber Device (ROLT) in an active optical fiber network with a pair of optical passive splitter ). the transmission protocols used by the two are also different. The active optical fiber network adopts PDH or SDH transmission protocols
nodes may still is performing several more map tasks.But They also begin exchanging the intermediate outputs from the map tasks to where they are the required by the reducers. This process's moving map outputs to the reducers is known as shuffling.
-Sort
Each reduce task is responsible to reducing the values associated with several intermediate keys. The set of intermediate keys on a single node are automatically sorted by Hadoop before they are presented to the reducer
Q9. If No custom parti
last_key. Now we still use Unix pipe to simulate the entire mapreduce process:
% Cat input/ncdc/sample.txt | ch02/src/main/Ruby/max_temperature_map.rb | \Sort | ch02/src/main/Ruby/max_temperature_performance.rb1949 1111950 22
As you can see, this output is the same as that of Java. Now we use hadoop to run it. Because the hadoop command does not support the streaming option, you must use the jar option to declare that you want to process streaming jar files. As follows:
" and "0" digital signals are transmitted using copper wires, the high voltage and low voltage statuses usually correspond to "1" and "0" respectively ". When the transmitted signal voltage is applied to the laser diode to convert it into a light signal, the light will generate or disappear. If the light-emitting and disappearing statuses are set to "1" and "0" respectively, digital data can be transmitted through optical fiber.
If the optical signal transmitted from the optical fiber is restore
by efficient use of fiber and copper wire.The CATV is one of the representatives. In order to send TV programs, the coaxial line made of copper wire will be connected with CATV television station. Recently, Internet connections have started and applications have been made to local telephones, with more speed, more channels, and two-way communication.
The coaxial axis is difficult to transmit high frequenc
time, but will be divided multiple times, each time up to 10 stream. This means that when the middle result of the map is very large, it is helpful to reduce the number of merge times and to reduce the reading frequency of the map to the disk, and it is possible to optimize the io.sort.factor of the work.
When the job specifies Combiner, we know that map results are merged on the map side based on the functions defined by
Duce.task.io.sort.factor (DEFAULT:10) to reduce the number of merges, thereby reducing the disk operation;
Spill this important process is assumed by the spill thread, spill thread from the map task to "command" began to work formally, the job called Sortandspill, originally not only spill, before spill there is a controversial sort.
When the combiner is present, the results of the map are merged according to the functions defined by
interact with external resourcesthree. Reducer1. Reduce can also choose to inherit the base class Mapreducebase, which functions like mapper.2. The reducer must implement the Reducer interface, which is also a generic interface with a meaning similar to that of Mapper 3. To implement the reduce method, this method also has four parameters, the first is the input key, the second is the input of the value of the iterator, you can traverse all the value, the equivalent of a list, Outputcollector i
a job specifies a combiner, we all know that after the introduction of map, the map results will be merged on the map end based on the functions defined by combiner. The time to run the combiner function may be before or after merge is completed. This time can be controlled by a parameter, that isMin. Num. Spill. For. Combine(Default 3) when the
:
Key1
Value1
0
Hello Hadoop GoodBye Hadoop2 map output/combine Input
The output result of map1 is as follows:
Key2
Value2
Hello
1
World
1
Bye
1
World
1
The output result of map2 is as follows:
Key2
Value2
Hello
1
Hadoop
1
GoodBye
1
Hadoop
13 combine output
The Combiner class combines the values of the same key, which is also an CER implementation.
The output of combine1 is as follows:
Key2
Value2
Hello
1
World
2
Bye
1
The output of combine2 is as fol
parallel streams that can be written to the merge file when the merge spill file is used. For example, if the data produced by map is very large, the generated spill file is larger than 10, and io. sort. factor uses the default value 10. When map computing completes merge, there is no way to split all the spill files into merge at a time, but multiple times, A maximum of 10 streams can be created at a time. This means that when the intermediate result of map is very large and I/O. sort. factor
partition for a reduce task. This is done to avoid some of the reduce tasks being allocated to large amounts of data, while some reduce tasks have little or no data embarrassment. In fact, partitioning is the process of hashing data. The data in each partition is then sorted, and if combiner is set at this point, the sorted result is combia and the purpose is to have as little data as possible to write to the disk.3. When the map task outputs the las
added the token as "" or "\ T" when the output is not output, But the final result inside still has the blank word, is inconceivable. 2.Mapper Output if using ((TERM:DOCID), TF) in the form of ":" to separate the term and docid, then in combiner if I use ":" to separate the key (that is, the bottom of the wrong mapper way), So the number of strings you get is sometimes
public static class Inverseindexmapper extends Mapper
Use of
name as key, then we will achieve our original goal, because the map output will become a.txt-> words. Words.. Words
This is obviously not the result we want.
So the format of the map output should be
Text 1 with single word
Such as:
Hello->a.txt 1
This is used here as a separation between the word and the text where it resides
This will not affect our results when merging according to Key.
The map code is as follows:
public static class Mymapper extends Mapper
After map execution is com
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.