MapReduce is a distributed computing model, proposed by Google, primarily for the search field, and the MapReduce program
In essence, it is run in parallel, so it can solve the computational problem of massive data.
The MapReduce task process is divided into two processing stages: the map phase and the reduce phase. Each stage is keyed
value pairs as input and output. Users only need to implement the map () and reduce () two functions to achieve distributed computing.
To perform the steps:
Map Task Processing:
1. Read the contents of the input file and parse it into a key-value pair (Key/value). For each line of the input file, parse to
Key-value pairs (key/value). Call the map function once for each key-value pair
2. Write your own logic, the input key value pair (Key/value) processing, converted to a new key value pair
(key/value) output.
3. Partition the output key-value pairs (key/value). (partition)
4. The data of different partitions are sorted by key and grouped. The same key/value put in
In a collection. (Shuffle)
5. The data after grouping is the statute. (Combiner, selectable)
Reduce task Processing:
1. For the output of multiple map tasks, copy the network to different reduce nodes according to different partitions.
2. Merge and sort the output of multiple map tasks. Write the reduce function's own logic, on the input
Key/value processing, converted into a new key/value output.
3. Save the output of reduce to a file (written to HDFs).
MapReduce Job Flow:
1. Code Writing
2. Job configuration (Input/output path, reduce quantity, etc.)
3. Job Submission
3.1 Commits through jobclient, communicates with Jobtracker to get a jar's storage path and Jobid.
3.2 Check the path of the input and output
3.3 Compute shard information.
3.4 will be assigned to the required resource (jar, configuration file, computed input shard) to the job ID
Named HDFs on
3.5 inform the Jobtracker that the job is ready for execution.
4. Job initialization
When Jobtracker receives the submitted job, it puts the call into an internal queue,
Dispatched by the job scheduler, by default (FIFO), and initialized.
Initialize: Creates an object that represents the running job--the split task and record information so
Track the status and process of a task.
In order to create a task list, the job scheduler first obtains the computed from the shared file system.
Enter the Shard information. Then create a map task for each shard, and the scheduler creates the corresponding number
The reduce task to run. At this point, the task is assigned an ID.
5. Task Assignment
Tasktracker runs a simple loop to periodically send a "heartbeat" to Jobtracker,
Heartbeat tells Jobtracker,tasktracker if the Tasktracker is still alive and indicates whether the
is ready to run the new task, and if so, Jobtracker will assign it a task.
6. Task execution
When Tasktracker got the mission,
1. All information will be copied locally (including jar, code, configuration information, shard information, etc.)
2.tasktracker creates a new local working directory for the task and extracts the contents of the jar file to
Under this directory.
3.tasktracker Create a new Taskrunner instance to run the task.
Taskrunner will start a new JVM to run each step. (Prevent other software from affecting
To Tasktracker, but it is possible to reuse the JVM between different tasks.
7. Updates to progress and status
Task will report to Tasktracker on a regular basis and tasktracker will periodically collect
All task information on the cluster and want to jobtracker reporting. Jobtracker will be based on all
Summary of the information reported by Tasktracker
8. Job completion
Jobtracker the task to be marked as "successful" after the last task has been received.
and writes the data results to HDFs.
Ps:
Jobtracker function: Responsible for receiving the user submitted the job, responsible for starting, tracking task execution
Tasktracker function: responsible for performing tasks
Job failed:
1.JobTracker failure
This is the most serious kind of task failure, failure mechanism--it's a single node failure, so
The job is doomed to failure. (hadoop2.0 solved)
2.tasktracker failure
Tasktracker crashes will stop sending heartbeat information to JOBT, and Jobtracker will
Tasktracker is removed from the waiting task pool, and the task is moved to a different place for execution.
Jobtracker will add Tasktracker to the blacklist.
3.task failure
The map or reduce fails to run, throwing an exception to tasktracker and suspending the task.
MapReduce startup process:
start-mapred.sh--hadoop-daemon.sh---Hadoop
-->org.apache.hadoop.mapred.jobtracker
Jobtracker Call Order:
Main---starttracker--and new Jobtracker first created in its constructor method
A scheduler, followed by the creation of an RPC server (intertrackerserver)
Tasktracker communicates with the PRC and then calls the Offerservice method for external
Provide services to start the RPC server in the Offerservice method, initialize the Jobtracker,
Call the Start method of TaskScheduler and Eagertaskinitializationlistener
Call the Start method, then call the Start method of Jobinitmanagerthread,
Because it is a thread, it calls the Jobinitmanager Run method, and then
Jobinitqueue the task queue to fetch the first task and then throw it into the thread pool,
Call the-->initjob run method again, and then call the Jobtracker Initjob method
--Jobinprogress's Inittasks
--maps = new Taskinprogress[nummaptasks] and
Reduces = new Taskinprogress[numreducetasks];
Tasktracker Call Order:
Main---new Tasktracker calls the Initialize method in its construction method,
Call Rpc.waitforproxy in the Initialize method to get a jobtracker
Proxy object, and then Tasktracker calls its own Run method,
--Offerservice Method--Transmitheartbeat The return value is
(Heartbeatresponse) is Jobtracker's instruction, in Transmitheartbeat
The Intertrackerprotocol method calls the state of heartbeat to Tasktracker
Sent to jobtracker through the RPC mechanism, the return value is the Jobtracker
Heartbeatresponse.getactions () get a specific instruction and then judge the order
The specific type that starts the execution of the task, addtotaskqueue the start type of the instruction
Join the queue, tasklauncher the task into the task queue,
--Tasklauncher's Run Method--Startnewtask method
--Localizejob Download Resources--Launchtaskforjob Start loading tasks
---Launchtask-Runner.start () start thread; -
Taskrunner call the Run method and launchjvmandwait start the Java child process
Details of the MapReduce
Serialization Concepts
Serialization: Refers to converting a structured object into a byte stream.
Deserialization: Is the inverse of the serialization process. That is, the byte is transferred back to the structured object.
Features of the Hadoop serialization format:
1. Compact: Efficient use of storage space
2. Fast: Little extra overhead for reading and writing data
3. Extensible: Can read the data of old format transparently
4. Interop: Support multi-lingual interaction.
The role of Hadoop serialization:
Serialization is two important roles in distributed environments: interprocess communication, permanent storage.
Communication between Hadoop nodes.
Partitioner programming
Data that has some common characteristics is written to the same file.
Sorting and Grouping
When sorting in the map and reduce phases, the comparison is K2. V2 are not involved in sorting comparisons.
If you want V2 to be sorted, you need to assemble K2 and V2 into new classes as K2,
To participate in the comparison. If you want to customize the collation, the sorted object is implemented
Writablecomparable interface, implementing collations in the CompareTo method,
This object is then treated as a K2, and the sorting grouping is done by K2.
Combiners programming
1. Each map generates a lot of output, and combiner is the function of the map end to the output
Do a merge first to reduce the amount of data transferred to reducer.
2.combiner most basic is the merging of local key, with similar local reduce function
If you do not have combiner, then all the results are reduced, the efficiency will be relatively low,
3. Using Combiner, the first map will be aggregated locally to increase the speed.
The output of the Ps:combiner is the input of the reducer, combiner absolutely cannot change the final calculation result.
So from a personal point of view, combiner only applies to that kind of reducer input key/value and
The output Key/value type is exactly the same and does not affect the final result of the scene. For example: cumulative, Maximum, etc.
Shuffle
MapReduce ensures that each reducer input is sorted by key, and the system performs the sorting process
---Pass the map output as input to reducer---become shuffle (shuffle)
When the 1.map function starts producing output, it is not simply written to disk. It uses buffered
The method is written to memory, and is pre-sorted for efficiency considerations.
Each map task has a ring memory buffer that stores the output of the task. By default,
Buffer size is 100MB, once the buffered content reaches the threshold (default is 80%), a background thread
The content is then written to a new overflow file in the disk-specified directory. In the process of writing to disk,
The map output continues to be written to the buffer, but if the buffer is filled during this time, the map will block,
Until the write disk process is complete.
2. Before writing the disk, be partition,sort. If there is combiner,combine sorted after data.
3. When the final record is finished, merge all the overflow files into a partitioned and sorted file.
How does reducer know which tasktracker to get the map output from?
When the map task completes successfully, they notify their parent that the Tasktracker state has been updated, and then Tasktracker
Then inform Jobtracker. These notifications are transmitted in the heartbeat mechanism. Therefore, for a specified job,
Jobtracker knows the mapping relationship between the map output and the Tasktracker. One of the Reducer
The thread periodically asks Jobtracker to get the location of the map output until it obtains all the output locations.
http://m.oschina.net/blog/213034
One of the two core of Hadoop: the MapReduce summary