Getting Started with Hadoop programming

Last Update:2015-03-17 Source: Internet

Author: User

Keywords Hadoop programming

Tags class code communication configuration create data default delete

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Hadoop is a Java implementation of Google MapReduce. MapReduce is a simplified distributed programming model that allows programs to be distributed automatically to a large cluster of ordinary machines. Just as Java programmers can do without memory leaks, MapReduce's run-time system solves the distribution details of input data, executes scheduling across machine clusters, handles machine failures, and manages communication requests between machines. Such a pattern allows programmers to handle the resources of a large distributed system without having to have the experience of concurrent processing or distributed systems.

I. INTRODUCTION

As a Hadoop programmer, what he has to do is:

Define mapper, process input key-value pairs, and output intermediate results.

Define reducer, optional, make the specification of the intermediate result, and output the final result.

Define InputFormat and OutputFormat, optionally, InputFormat convert the contents of each row of input files into Java classes for use by the Mapper function, which is not defined as string.

Define a main function, define a job inside it, and run it. Then the matter is handed over to the system.

Basic concept: The HDFs of Hadoop implements Google's GFs file system, Namenode as the file system responsible for scheduling runs on each machine in Master,datanode. At the same time Hadoop implements Google's Mapreduce,jobtracker as the mapreduce of the total schedule run in Master,tasktracker runs on each machine to execute the task.

The main () function, create the jobconf, define the Mapper,reducer,input/outputformat and input output file directory, and finally submit the job to the Jobtracker, waiting for the job to end.

Jobtracker, create a InputFormat instance, call its Getsplits () method, split the input directory file into filesplist as the input of the Mapper task, and generate the Mapper task join queue. Tasktracker to Jobtracker for the next map/reduce.

The Mapper task first creates the Recordreader from the InputFormat, loops through the contents of the Filesplits to generate key and value, and passes it to the Mapper function, after which the middle result is written as sequencefile. The reducer Task obtains the required intermediate content (33%) from the jetty on the Tasktracker that runs mapper, sort/merge after (66%), executes the REDUCER function, and finally writes in accordance with OutputFormat into the results directory.

Tasktracker reports a run to jobtracker every 10 seconds, and each time a Task10 second is completed, the next task is asked Jobtracker.

All data processing for the Nutch project is built on Hadoop, as detailed in scalable Computing with Hadoop.

Ii. code written by programmers

Let's do a simple, distributed grep that simply matches the input file line-by-row, and prints the line to the output file if it fits. Because it's a simple total output, we just write the mapper function, do not write the Reducer function, and do not define Input/output Format.

Package Demo.hadoop public class Hadoopgrep {

public static class Regmapper extends Mapreducebase implements Mapper {

private pattern;

public void Configure (jobconf job) {pattern = Pattern.compile (Job.get ("Mapred.mapper.regex"));

}

public void Map (writablecomparable key, writable value, outputcollector output, Reporter Reporter)

Throws IOException {

String Text = ((text) value). toString (); Matcher Matcher = pattern.matcher (text);

if (Matcher.find ()) {Output.collect (key, value);

}

Private Hadoopgrep () {

}//Singleton

public static void Main (string] args) throws Exception {

jobconf grepjob = new Jobconf (Hadoopgrep. Class);

Grepjob.setjobname ("Grep-search");

Grepjob.set ("Mapred.mapper.regex", args 2]);

Grepjob.setinputpath (new Path (args 0]));

Grepjob.setoutputpath (New Path (args 1]));

Grepjob.setmapperclass (Regmapper Class);

Grepjob.setreducerclass (Identityreducer Class);

Jobclient.runjob (Grepjob);

}

The Configure () function of the Regmapper class accepts the lookup string passed in by the main function, the map () function makes a regular match, the key is the row number, the value is the contents of the file row, and the line of the file is placed in the intermediate result. The main () function defines the input and output directories and matching strings that are passed in by the command-line arguments, the Mapper function is the Regmapper class, and the Reduce function does nothing, directly outputting the intermediate result to the Identityreducer class of the final result, running the job. The whole code is very simple, without any detail of distributed programming.

Three. Running the Hadoop program

Hadoop this aspect of the document is not comprehensive, comprehensive reference gettingstartedwithhadoop and Nutch Hadoop Tutorial Two, and then touched a lot of nails to finally complete the run up, record as follows:

3.1 Local Run mode

No distributed computing at all, no namenode,datanode practices, suitable for the first debugging code. Unzip Hadoop, where the Conf directory is the configuration directory, Hadoop's configuration file in Hadoop-default.xml, if you want to modify the configuration, not directly modify the file, but modify the Hadoop-site.xml, will The property is hadoop-site.xml in the value. The default configuration for Hadoop-default.xml is already local, without any modification, the only place in the configuration directory that has to be modified is the location of Java_home in hadoop-env.sh.

Put the compiled hadoopgrep and regmapper.class into the hadoop/build/classes/demo/hadoop/directory to find a larger log file into a directory and then run Hadoop/bin/hadoop Demo.hadoop.HadoopGrep log files directory of any output directory grep string

View the results of the output directory and view the run log in hadoop/logs/. Delete the output directory before running again.

3.2 operation mode of single machine cluster

Now, let's do a single cluster. Suppose to complete the setting in 3.1, this machine is named Hadoopserver

1. Modify Hadoop-site.xml and add the following:

< property >

< name > Fs.default.name < value > hadoopserver:9000

< property >

< name > Mapred.job.tracker < value > hadoopserver:9001

< property >

< name > Dfs.replication < value > 1

From then on it will run from the local file system to the HDFS system of Hadoop, MapReduce Jobtracker from the local process into a distributed task system, 9000,9001 two port number is a random choice of two free port number. Also, if your/tmp directory is not large enough, you may want to modify the Hadoop.tmp.dir property.

2. Increase SSH Not enter password to login. Because Hadoop needs not to enter the password SSH to dispatch, in no SU state, in its own home directory run ssh-keygen-t RSA, and then all the way back to generate keys, and then enter. SSH directory, CP id_rsa.pub Authorized_ Keys in detail can be a man ssh, at this time to perform SSH hadoopserver, do not need to enter any password will be able to enter.

3. Format Namenode, perform Bin/hadoop Namenode-format

4. Start Hadoop, execute hadoop/bin/start-all.sh, start Namenode,datanode,jobtracker,tasktracker on this machine

5. Now put the log file to be found into HDFs, perform hadoop/bin/hadoop Dfs can see the file operation instructions it supports. Execute hadoop/bin/hadoop DFS put log file in, the log file directory is placed in HDFs/user/user-name/in directory

6. Now to perform the grep operation, Hadoop/bin/hadoop Demo.hadoop.HadoopGrep in out to view the run log in hadoop/logs/before executing again. Run Hadoop/bin/hadoop Dfs rmr out delete directory.

7. Run hadoop/bin/stop-all.sh End

3.3 Cluster Running mode

Assume that the 2nd machine name is Hadoopserver2 If you have finished configuring 3.2

Create the same execution user as Hadoopserver and extract Hadoop into the same directory.

The same modification of the haoop-env.sh in the Java_home and modified in the same hadoop-site.xml as 3.2

Copy the/home/username/.ssh/authorized_keys from the Hadoopserver to the hadoopserver2 to ensure that hadoopserver can log in without a password hadoopserver2

Modify Hadoop-server hadoop/conf/slaves files, add cluster nodes, and change localhost to Hadoop-server hadoop-server2

Executing hadoop/bin/start-all.sh in Hadoop-server will start Namenode,datanode,jobtracker,tasktracker in Hadoop-server; Start Datanode and Tasktracker in Hadoop-server2

Now to perform the grep operation: Hadoop/bin/hadoop Demo.hadoop.HadoopGrep in Out

Run Hadoop/bin/hadoop Dfs rmr out delete out directory before performing again

Run hadoop/bin/stop-all.sh end.

Scp/home/username/.ssh/authorized_keys Username@hadoopserver2:/home/username/.ssh/authorized_keys

Iv. efficiency

After testing, Hadoop is not a panacea, depending on the size and number of files, the complexity of processing and the number of clustered machines, the associated bandwidth, when the above four are not large, Hadoop advantage is not obvious. For example, do not use Hadoop Java write a simple grep function to process 100M log files as long as 4 seconds, with the Hadoop local mode of operation is 14 seconds, the use of Hadoop stand-alone cluster is 30 seconds, with a dual-machine cluster 10M network port is slower, Slow to be embarrassed to speak out.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More