Hadoop is a Java implementation of Google MapReduce. MapReduce is a simplified distributed programming model that allows programs to be distributed automatically to a large cluster of ordinary machines. Just as Java programmers can do without memory leaks, MapReduce's run-time system solves the distribution details of input data, executes scheduling across machine clusters, handles machine failures, and manages communication requests between machines. Such a pattern allows programmers to handle the resources of a large distributed system without having to have the experience of concurrent processing or distributed systems.
Introduction
As a Hadoop programmer, what he has to do is:
1, the definition mapper, processing the input key-value pair, outputs the intermediate result.
2, the definition of reducer, optional, the intermediate results of the statute, the output of the final results.
3. Define InputFormat and OutputFormat, optionally, InputFormat converts the contents of each row of input files into Java classes for use by the Mapper function, which is not defined by default to string.
4. Define the main function, define a job inside and run it.
Then the matter is handed over to the system.
1. Basic concepts: Hadoop's HDFs implements Google's GFs file system, Namenode as the file system responsible for scheduling runs on each machine in Master,datanode. At the same time Hadoop implements Google's Mapreduce,jobtracker as the mapreduce of the total schedule run in Master,tasktracker runs on each machine to execute the task.
2.main () function, create jobconf, define Mapper,reducer,input/outputformat and input output file directory, Finally, the job is submitted to Jobtracker, waiting for the job to end.
3.JobTracker, create a InputFormat instance, call its Getsplits () method, and split the input directory file into Filesplist as Mapper task Input, generate Mapper task join queue.
4.TaskTracker requests the next map/reduce to the Jobtracker.
mapper The task first creates a recordreader from the inputformat, loops through the contents of the Filesplits to generate key and value, passes to the mapper function, and finishes the middle result as sequencefile. The
reducer Task obtains the required intermediate content (33%) from the jetty of the Tasktracker running Mapper, Sort/merge (66 %), executes the REDUCER function, and finally writes the result directory according to OutputFormat.
Tasktracker reports a run to jobtracker every 10 seconds, and each time a Task10 second is completed, the next task is asked Jobtracker.
All data processing for the Nutch project is built on Hadoop, as detailed in scalable Computing with Hadoop.
II, programmers write code
Let's do a simple, distributed grep that simply matches the input file line-by-row, and prints the line to the output file if it fits. Because it's a simple total output, we just write the mapper function, do not write the Reducer function, and do not define Input/output Format.
Package Demo.hadoop
public class Hadoopgrep {
public static class Regmapper extends Mapreducebase implements Mapper {
private pattern;
public void Configure (jobconf job) {
pattern = Pattern.compile (Job.get ("Mapred.mapper.regex"));
}
public void Map (writablecomparable key, writable value, outputcollector output, Reporter Reporter)
throws IOException {
String Text = ((text) value). toString ();
Matcher Matcher = pattern.matcher (text);
if (Matcher.find ()) {
output.collect (key, value);
}
}
}
Private Hadoopgrep () {
}//Singleton
public static void main (string] args) throws Exception {
jobconf grepjob = new Jobconf (Hadoopgrep. Class);
grepjob.setjobname ("Grep-search");
grepjob.set ("Mapred.mapper.regex", args 2));
Grepjob.setinputpath (new Path (args 0));
Grepjob.setoutputpath (New Path (args 1));
Grepjob.setmapperclass (Regmapper. Class);
Grepjob.setreducerclass (Identityreducer. Class);
jobclient.runjob (grepjob);
}
}
The Configure () function of the Regmapper class accepts the lookup string passed in by the main function, the map () function makes a regular match, the key is the row number, the value is the contents of the file row, and the line of the file is placed in the intermediate result.
The main () function defines the input and output directories and matching strings that are passed in by the command-line arguments, the Mapper function is the Regmapper class, and the reduce function does nothing, directly outputting the intermediate result to the Identityreducer class of the final result, running the job.
The whole code is very simple, without any detail of distributed programming.
three. Running the Hadoop program
Hadoop this aspect of the document is not comprehensive, comprehensive reference gettingstartedwithhadoop and Nutch Hadoop Tutorial Two, and then touched a lot of nails to finally complete the run up, record as follows:
3.1 Local Run mode
No distributed computing at all, no namenode,datanode practices, suitable for the first debugging code.
Unzip Hadoop, where the Conf directory is the configuration directory, The configuration file for Hadoop is Hadoop-default.xml, and if you want to modify the configuration, instead of modifying the file directly, modify the Hadoop-site.xml and reassign the attribute in Hadoop-site.xml.
The default configuration for Hadoop-default.xml is already local, without any modification, the only place in the configuration directory that has to be modified is the location of Java_home in hadoop-env.sh.
Put the compiled hadoopgrep and regmapper.class into the hadoop/build/classes/demo/hadoop/directory to find a larger log file into a directory, and then run
hadoop/bin/hadoop demo.hadoop.HadoopGrep Log files directory of any output directory grep string
View the results of the output directory and view the run log in hadoop/logs/.
Delete the output directory before running again.
3.2 operation mode of single machine cluster
Now, let's do a single cluster. Suppose to complete the setting in 3.1, this machine is named Hadoopserver
1th step. Then modify the Hadoop-site.xml and add the following:
< property >
< name > fs.default.name </name >
< value > hadoopserver:9000 </value >
</Property >
< property >
< name > mapred.job.tracker </name >
< value > hadoopserver:9001 </value >
</Property >
< property >
< name > dfs.replication </name >
< value > 1 </value >
</Property >
From then on it will run from the local file system to the HDFS system of Hadoop, MapReduce Jobtracker from the local process into a distributed task system, 9000,9001 two port number is a random choice of two free port number.
Also, if your/tmp directory is not large enough, you may want to modify the Hadoop.tmp.dir property.
2nd step. Add SSH do not enter the password to log in.
Because Hadoop needs not to enter the password SSH to dispatch, in no SU state, in its own home directory run ssh-keygen-t RSA, and then all the way back to generate keys, and then enter. SSH directory, CP id_rsa.pub Authorized_ Keys
Details can be a man ssh, at this time to perform SSH hadoopserver, do not need to enter any password will be able to enter.
3. Format Namenode, execute
Bin/hadoop Namenode-format
4. Start Hadoop
Execute hadoop/bin/start-all.sh, start Namenode,datanode,jobtracker,tasktracker on the local machine
5. Now put the log file to be found in HDFs.
Performing Hadoop/bin/hadoop Dfs can see the file operation instructions it supports.
Execute hadoop/bin/hadoop DFS put log file in, the log file directory is placed in HDFs/user/user-name/in directory
6. Now to perform the grep operation
Hadoop/bin/hadoop Demo.hadoop.HadoopGrep in Out
View the run log in hadoop/logs/before performing it again. Run Hadoop/bin/hadoop Dfs rmr out delete directory.
7. Run hadoop/bin/stop-all.sh end
3.3 Cluster run mode
assume that the 2nd machine name is Hadoopserver2
1. Create the same execution user as hadoopserver. Extract Hadoop to the same directory.
2. The same modification of the Java_home in haoop-env.sh and 3.2 of the same hadoop-site.xml
3./home/in Hadoopserver Username/.ssh/authorized_keys copy to Hadoopserver2 to ensure hadoopserver can login without password Hadoopserver2
scp/home/username/.ssh/authorized_keys Username@hadoopserver2:/home/username/.ssh/authorized_keys
&NBSP
4. Modify Hadoop-server hadoop/conf/slaves file, increase cluster node, change localhost to
hadoop-server
hadoop-server2
5 hadoop-server execute hadoop/bin/start-all.sh
will start Namenode,datanode,jobtracker,tasktracker in Hadoop-server
start in Hadoop-server2 datanode and Tasktracker
6. Now to perform the grep operation
hadoop/bin/hadoop Demo.hadoop.HadoopGrep in Out
run Hadoop/bin/hadoop Dfs rmr out delete out directory
7 before running again. Run hadoop/bin/stOp-all.sh end.