Package Com.felix;
Import java.io.IOException;
Import Java.util.Iterator;
Import Java.util.StringTokenizer;
Import Org.apache.hadoop.fs.Path;
Import org.apache.hadoop.io.IntWritable;
Import org.apache.hadoop.io.LongWritable;
Import Org.apache.hadoop.io.Text;
Import Org.apache.hadoop.mapred.FileInputFormat;
Import Org.apache.hadoop.mapred.FileOutputFormat;
Import org.apache.hadoop.mapred.JobClient;
Import org.apache.hadoop.mapred.JobConf;
Import Org.apache.hadoop.mapred.MapReduceBase;
Import Org.apache.hadoop.mapred.Mapper;
Import Org.apache.hadoop.mapred.OutputCollector;
Import Org.apache.hadoop.mapred.Reducer;
Import Org.apache.hadoop.mapred.Reporter;
Import Org.apache.hadoop.mapred.TextInputFormat;
Import Org.apache.hadoop.mapred.TextOutputFormat; /** * Description: WordCount explains by Felix * @author Hadoop Dev Group/public class WordCount {/** * mapredu Cebase class: A base class that implements the mapper and reducer interfaces (in which the method only implements the interface without doing anything) * Mapper Interface: * Writablecomparable interface: Implementing WRITABLECOMPARABLClasses of e can be compared to each other.
All classes that are used as key should implement this interface.
* Reporter can be used to report the running progress of the entire application, which is not used in this example. * */public static class Map extends Mapreducebase implements Mapper<longwritable, text, text, in twritable> {/** * longwritable, intwritable, Text are the classes implemented in Hadoop to encapsulate Java data types that implement the Writableco
mparable interfaces, * can be serialized to facilitate data exchange in a distributed environment, and you can see them as substitutes for long,int,string.
* Private final static intwritable one = new intwritable (1);
Private Text Word = new text (); Map method in/** * Mapper interface: * Void map (K1 key, V1 value, outputcollector<k2,v2> output, Reporter
ER) * Maps a single input k/v to an intermediate k/v pair that is the same type of output to the unwanted and input pairs, the input pair can be mapped to 0 or more output pairs.
* Outputcollector Interface: Collects <k,v> pairs of mapper and reducer outputs.
* Outputcollector interface Collect (k, V) method: Add one (k,v) pair to output/public void map (longwritable key, Text value, Outputcollector<text, intwritable> output, RePorter Reporter) throws ioexception {String line = value.tostring ();
StringTokenizer tokenizer = new StringTokenizer (line);
while (Tokenizer.hasmoretokens ()) {Word.set (Tokenizer.nexttoken ());
Output.collect (Word, one); }} public static class Reduce extends Mapreducebase implements Reducer<text, Intwritabl E, text, intwritable> {public void reduce (text key, iterator<intwritable> values, Ou
Tputcollector<text, intwritable> output, Reporter Reporter) throws IOException {
int sum = 0;
while (Values.hasnext ()) {sum + = Values.next (). get ();
} output.collect (Key, New intwritable (sum)); } public static void Main (string[] args) throws Exception {/** * jobconf:map/rEduce job Configuration class, which describes the work * construction method performed by Map-reduce to the Hadoop framework: jobconf (), jobconf (Class exampleclass), jobconf (Configuration conf), etc.
* * jobconf conf = new jobconf (wordcount.class); Conf.setjobname ("WordCount"); Sets a user-defined job name, Conf.setoutputkeyclass (Text.class); Set the key class Conf.setoutputvalueclass (Intwritable.class) for the output data of the job; Set the value class Conf.setmapperclass (Map.class) for the job output; Set up Mapper class Conf.setcombinerclass (Reduce.class) for the job; Set up Combiner class Conf.setreducerclass (Reduce.class) for the job; Set the Reduce class Conf.setinputformat (Textinputformat.class) for the job; Set the InputFormat implementation class Conf.setoutputformat (Textoutputformat.class) for the map-reduce task; Set the OutputFormat implementation class for the Map-reduce task/** * InputFormat Description of the input definition for the job in Map-reduce * setinputpaths (): for map -reduce job set path array as input list * Setinputpath (): Set the path array for the Map-reduce job as the output list * * * fileinputformat.setinput Paths (conf, new Path (a)Rgs[0]));
Fileoutputformat.setoutputpath (conf, new Path (args[1)); Jobclient.runjob (conf); Run a Job}}
(1) The process of map-reduce mainly involves the following four parts: client-side: For submitting Map-reduce Task Job Jobtracker: Coordinating the entire job's operation, which is a Java process, its main Class is Jobtracker Tasktracker: The task that runs this job, processing input split, is a Java process, its main class is Tasktracker hdfs:hadoop Distributed File System, Used to share job-related files among the various processes
(2) is responsible for control and scheduling MapReduce job is jobtracker, responsible for running MapReduce job is tasktracker.
(3) When the client executes Runjob (), makes a request to Hadoop to handle a job,hadoop that divides the job into two tasks for processing. namely: Map task and reduce task.
(4) Hadoop divides the input data into blocks of fixed size (this fast cannot be larger than the size of each block in the HDFs 64M) on the Datanode, a process we call input split. Note: When text is entered, it is split by line. Hadoop creates a task for each input split, in which each record in the split is processed sequentially.
(5) Hadoop try to make the input data block where the Datanode and tasks performed by the Datanode (each datanode has a tasktracker) for the same, can improve operational efficiency, so input The split size is also the size of the block in the HDFs.
(6) The output of a MAP task is typically the input of the reduce task, and the output of reduce is the output of the entire job, often stored in HDFs.
(7) All records of the same key in the Reduce task are bound to run in the same task tracker. However, different key can be run on different task tracker, which we call partition.
Partition rule: (K2,V2)->integer, that is, according to K2, generate a partition ID, K2 with the same ID enter the same partition, is handled by the same reducer on the same tasktracker.
(8) Task Submission
Jobclient.runjob () Creates a new Jobclient instance and calls its Submitjob () function. Request a new job ID from Jobtracker to detect the output configuration of this job the input splits for this job copies the resources needed to run the job to a folder in the Jobtracker file system, including the job jar file. Job.xml configuration file, input splits notification jobtracker This job is ready to run
After the task is submitted, runjob polls the job's progress every second, returning the progress to the command line until the task is finished running.
(9) Initialization of the task
When Jobtracker receives a submitjob call, the task is placed in a queue, and the job scheduler gets the task from the queue and initializes the task.
Initialization first creates an object to encapsulate the tasks, status, and progress that the job runs.
Before creating a task, the job scheduler first obtains the jobclient computed input splits from the shared file system.
It creates a map task for each input split.
Each task is assigned an ID.
(10) Task assignment
Tasktracker periodically sends heartbeat to the Jobtracker.
In Heartbeat, Tasktracker tells Jobtracker that it is ready to run a new task,jobtracker that will be assigned to a task.
Before Jobtracker selects a task for Tasktracker, Jobtracker must first select a job by priority and select a task in the highest-priority job.
Tasktracker has a fixed number of positions to run a map task or a reduce task.
The default scheduler treats the map task prior to the reduce task
When you select the reduce task, Jobtracker does not choose between multiple tasks, but instead takes one directly, because the reduce task does not have the concept of data localization.
(11) Task Execution
Tasktracker is assigned a task, this task is run below.
First, Tasktracker copies the jar of this job from the shared file system to the Tasktracker file system.
Tasktracker copies the files needed to run the job from the distributed cache to the local disk.
Second, it creates a local working directory for each task, extracting the jar into the file directory.
Third, it creates a taskrunner to run the task.
Taskrunner creates a new JVM to run the task.
The child JVM and the Tasktracker communication are created to report on the progress of the operation. The process of 3.4.1 and map
Maprunnable reads a record from input split and then calls Mapper's map function to output the result.
The map's output is not written directly to the hard disk, but writes it to the cache memory buffer.
When the data in the buffer reaches a certain size, a background thread begins to write the data to the hard disk.
Before writing to the hard disk, the data in memory is divided into multiple partition through Partitioner.
In the same partition, the background thread sorts the data in memory according to key.
Each time the data is flush from memory to the hard disk, a new spill file is generated.
When this task ends, all spill files are merged into an entire, partition, and ordered file.
Reducer can request a map's output file through the HTTP protocol, Tracker.http.threads can set the number of HTTP service threads. the process of 3.4.2 and reduce
When the map task ends, its notification Tasktracker,tasktracker notifies Jobtracker.
For a job,jobtracker know the correspondence between Tasktracer and map output.
In reducer, a thread periodically requests the location of the map output from the Jobtracker until it obtains all the map outputs.
The reduce task requires all of the map outputs of its corresponding partition.
The copy process in the reduce task starts copying output when each map task ends, because different map task finishes are different.
The reduce task has multiple copy threads that can copy the map output in parallel.
When a lot of the map output is copied to the reduce task, a background thread merges it into a large, orderly file.
When all the map outputs are copied to the reduce task, enter the sort process, merging all the map outputs into large, sorted files.
Finally, enter the reduce process, call the reducer reduce function, handle each key of the ordered output, and the final result is written to HDFs.
3.5. End of Task
When Jobtracker gets a successful report of the last task's run, the job status is changed to success.
When jobclient polls from Jobtracker, it discovers that the job has ended successfully, then prints the message to the user and returns from the Runjob function.