Architecture of MapReduce:
-Distributed Programming architecture
-Data-centric, more emphasis on throughput
-Divide and conquer (the operation of large-scale data sets, distributed to a master node under the management of the various nodes together to complete, and then consolidate the intermediate results of each node to get the final output)
-map to break a task into multiple subtasks
-reduce the decomposed multitasking and summarizes the results into the final results
Examples: Counting the library books, counting the occurrences of the words, making the mixed chili sauce and so on.
Structure Chart:
The MASTER-SLAVER structure is also used.
4 entities:
-client
-jobtraker
-tasktraker (Task node)
-hdfs (input, output data, configuration information, etc.)
Basic concepts:
Job: Within Hadoop, a job is used to represent a collection of all the jar files and classes that are needed to run the MapReduce program, which is eventually consolidated into a jar file that submits the jar file to Jobtraker. The MapReduce program is executed.
Tasks (Task): Maptask and Reducetask
Key-value pairs (Key/value pair)
The input and output of the MAP () and Reduce () functions are in the form of <key,value>
After parsing, the input data stored in the HDFs is entered into the MapReduce () function in the form of a key-value pair, which outputs a series of key-value pairs as intermediate results, and in the reduce phase, the intermediate data with the same key value is merged to form the final result.
Life cycle:
1. Submit Homework
-The job needs to be configured before the job is submitted;
-Program code, mainly written by their own mapreduce procedures;
-Configure input output path, output compression;
-After the configuration completes, submits through the jobclient;
Job scheduling algorithm:
FIFO Scheduler (default), Fair Scheduler, Capacity Scheduler
2. Task Assignment
The communication and assignment between-tasktracker and Jobtracker is done through the heartbeat mechanism;
-tasktracker will actively ask Jobtracker whether there is a job to do, if you can do it, then apply to the job task, this task can make map may also be reduce task;
3. Mandate implementation
-tasktraker the code and configuration information to the local;
-Start the JVM run task for each task separately
4. Status updates
-Tasks in the running process, the first will be their status report to Tasktracker, and then by Tasktracker summary of the Jobtracker;
-Task progress is achieved through the counter;
-jobtracker the job flag as successful until the last task has been run.