Debugging a MapReduce program using Hadoop standalone mode under eclipse

Source: Internet
Author: User
Tags flush sort

Hadoop does not use HDFS in stand-alone mode, nor does it open any Hadoop daemons, and all programs run on one JVM and allow up to one reducer

Create a new Hadoop-test Java project in eclipse (especially if Hadoop requires 1.6 or more versions of JDK 1.6)

Download hadoop-1.2.1.tar.gz on Hadoop's official website http://apache.fayea.com/apache-mirror/hadoop/common/

Unzip hadoop-1.2.1.tar.gz get hadoop-1.2.1 directory

Import the jar packages under the hadoop-1.2.1 directory and the Hadoop-1.2.1\lib directory into the Hadoop-test project

Next you write a MapReduce program (the program is used to count monthly balances)

MAP:

Import java.io.IOException;
Import org.apache.hadoop.io.LongWritable;
Import Org.apache.hadoop.io.Text;
Import Org.apache.hadoop.mapred.MapReduceBase;
Import Org.apache.hadoop.mapred.Mapper;
Import Org.apache.hadoop.mapred.OutputCollector;
Import Org.apache.hadoop.mapred.Reporter;

public class Mapbus extends Mapreducebase 
		implements Mapper<longwritable, text, text, longwritable> {
	@ Override public
	void Map (longwritable key, Text date, 
			outputcollector<text, longwritable> output,
			Reporter Reporter) throws IOException {
		//2013-01-11,-200
		String line = date.tostring ();
		if (Line.contains (",")) {
			string[] tmp = Line.split (",");
			String month = tmp[0].substring (5, 7);
			int money = integer.valueof (tmp[1]). Intvalue ();
			Output.collect (New Text (month), New longwritable (Money));}}}

Reduce:

Import java.io.IOException;
Import Java.util.Iterator;
Import org.apache.hadoop.io.LongWritable;
Import Org.apache.hadoop.io.Text;
Import Org.apache.hadoop.mapred.MapReduceBase;
Import Org.apache.hadoop.mapred.OutputCollector;
Import Org.apache.hadoop.mapred.Reducer;
Import Org.apache.hadoop.mapred.Reporter;

public class Reducebus extends Mapreducebase 
		implements Reducer<text, Longwritable, Text, longwritable> {
	@Override public
	void Reduce (Text month, iterator<longwritable>-money,
			Outputcollector<text , longwritable> output, Reporter Reporter)
			throws IOException {
		int total_money = 0;
		while (Money.hasnext ()) {
			Total_money + = Money.next (). get ()
		;
		Output.collect (Month, New Longwritable (Total_money));
	}
}

Main:

import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.LongWritable; import
Org.apache.hadoop.io.Text;
Import Org.apache.hadoop.mapred.FileInputFormat;
Import Org.apache.hadoop.mapred.FileOutputFormat;
Import org.apache.hadoop.mapred.JobClient;

Import org.apache.hadoop.mapred.JobConf; public class Wallet {public static void main (string[] args) {if (args.length! = 2) {System.err.println ("param error!
			");
		System.exit (-1);
		} jobconf jobconf = new jobconf (wallet.class);
		
		Jobconf.setjobname ("My Wallet");
		Fileinputformat.addinputpath (jobconf, New Path (Args[0]));
		Fileoutputformat.setoutputpath (jobconf, New Path (Args[1]));
		Jobconf.setmapperclass (Mapbus.class);
		Jobconf.setreducerclass (Reducebus.class);
		Jobconf.setoutputkeyclass (Text.class);
		
		Jobconf.setoutputvalueclass (Longwritable.class);
		try{jobclient.runjob (jobconf);
		}catch (Exception e) {e.printstacktrace (); }
	}
}

Also need to prepare the file to be analyzed, create 2 files under the E:\cygwin_root\home\input path, one file name is: 2013-01.txt, another file name is: 2013-02.txt

2013-01.txt:

2013-01-01,100
2013-01-02,-100
2013-01-07,100
2013-01-10,-100
2013-01-11,100
2013-01-21,-
2013-01-22,100
2013-01-25,-100
2013-01-27,100
2013-01-18,-100
2013-01-09,500

2013-02.txt:

2013-02-01,100

Once you have set the operating parameters, you can run the MapReduce program through the Java application, run as.

Java.io.IOException:Failed to set permissions of path: 
\tmp\hadoop-linkage\mapred\staging\linkage1150562408\. Staging to 0700

The main reason for this error is that the later version of Hadoop adds a checksum to the file path, which is easier to modify and replace Hadoop-core-1.2.1.jar with Hadoop-0.20.2-core.jar.

The following is a log that the MapReduce program prints when it runs

14/02/11 10:54:16 INFO JVM. Jvmmetrics:initializing JVM Metrics with Processname=jobtracker, sessionid= 14/02/11 10:54:16 WARN mapred. Jobclient:use Genericoptionsparser for parsing the arguments.
Applications should implement Tool for the same. 14/02/11 10:54:16 WARN mapred.  Jobclient:no job jar file set. User classes May is not found.
See jobconf (Class) or Jobconf#setjar (String). 14/02/11 10:54:16 INFO mapred. Fileinputformat:total input paths to process:2 14/02/11 10:54:17 INFO mapred. Jobclient:running job:job_local_0001 14/02/11 10:54:17 INFO mapred. Fileinputformat:total input paths to process:2 14/02/11 10:54:17 INFO mapred. Maptask:numreducetasks:1 14/02/11 10:54:17 INFO mapred. MAPTASK:IO.SORT.MB = 14/02/11 10:54:17 INFO mapred. Maptask:data buffer = 79691776/99614720 14/02/11 10:54:17 INFO mapred. Maptask:record buffer = 262144/327680 14/02/11 10:54:17 INFO mapred. maptask:starting flush of map output 14/02/11 10:54:18 INFO mapred. Maptask:finished spill 0 14/02/11 10:54:18 INFO mapred. TaskRunner:Task:attempt_local_0001_m_000000_0 is done. and is in the process of commiting 14/02/11 10:54:18 INFO mapred. localjobrunner:file:/e:/cygwin_root/home/input/2013-01.txt:0+179 14/02/11 10:54:18 INFO mapred.
Taskrunner:task ' Attempt_local_0001_m_000000_0 ' done. 14/02/11 10:54:18 INFO mapred. Maptask:numreducetasks:1 14/02/11 10:54:18 INFO mapred. MAPTASK:IO.SORT.MB = 14/02/11 10:54:18 INFO mapred. Maptask:data buffer = 79691776/99614720 14/02/11 10:54:18 INFO mapred. Maptask:record buffer = 262144/327680 14/02/11 10:54:18 INFO mapred. maptask:starting flush of map output 14/02/11 10:54:18 INFO mapred. Maptask:finished spill 0 14/02/11 10:54:18 INFO mapred. TaskRunner:Task:attempt_local_0001_m_000001_0 is done. and is in the process of commiting 14/02/11 10:54:18 INFO mapred. localjobrunner:file:/e:/cygwin_root/home/input/2013-02.txt:0+16 14/02/11 10:54:18 INFO mapred.
Taskrunner:task ' attempt_local_0001_m_000001_0 ' done. 14/02/11 10:54:18 INFO MapreD.LOCALJOBRUNNER:14/02/11 10:54:18 INFO mapred. Merger:merging 2 sorted segments 14/02/11 10:54:18 INFO mapred. Merger:down to the last Merge-pass, with 2 segments left of total size:160 bytes 14/02/11 10:54:18 INFO mapred. LOCALJOBRUNNER:14/02/11 10:54:18 INFO mapred. TaskRunner:Task:attempt_local_0001_r_000000_0 is done. and is in the process of commiting 14/02/11 10:54:18 INFO mapred. LOCALJOBRUNNER:14/02/11 10:54:18 INFO mapred. Taskrunner:task Attempt_local_0001_r_000000_0 is allowed to commit now 14/02/11 10:54:18 INFO mapred. fileoutputcommitter:saved output of Task ' attempt_local_0001_r_000000_0 ' to File:/e:/cygwin_root/home/output 14/02/11 10:54:18 INFO mapred. Localjobrunner:reduce > Reduce 14/02/11 10:54:18 INFO mapred.
Taskrunner:task ' Attempt_local_0001_r_000000_0 ' done. 14/02/11 10:54:18 INFO mapred. Jobclient:map 100% reduce 100% 14/02/11 10:54:18 INFO mapred. Jobclient:job complete:job_local_0001 14/02/11 10:54:18 INFO mapred. Jobclient:counters:13 14/02/11 10:54:18 INFO mapred. Jobclient:filesystemcounters 14/02/11 10:54:18 INFO mapred. jobclient:file_bytes_read=39797 14/02/11 10:54:18 INFO mapred. jobclient:file_bytes_written=80473 14/02/11 10:54:18 INFO mapred. Jobclient:map-reduce Framework 14/02/11 10:54:18 INFO mapred. Jobclient:reduce input groups=2 14/02/11 10:54:18 INFO mapred. Jobclient:combine output records=0 14/02/11 10:54:18 INFO mapred. Jobclient:map input records=12 14/02/11 10:54:18 INFO mapred. Jobclient:reduce Shuffle bytes=0 14/02/11 10:54:18 INFO mapred. Jobclient:reduce output records=2 14/02/11 10:54:18 INFO mapred. jobclient:spilled records=24 14/02/11 10:54:18 INFO mapred. Jobclient:map output bytes=132 14/02/11 10:54:18 INFO mapred. Jobclient:map input bytes=195 14/02/11 10:54:18 INFO mapred. Jobclient:combine input records=0 14/02/11 10:54:18 INFO mapred. Jobclient:map output records=12 14/02/11 10:54:18 INFO mapred.
 Jobclient:reduce input records=12

After the run completes, 2 files are generated under the E:\cygwin_root\home\output path:. PART-00000.CRC and part-00000. PART-00000.CRC is a one or two binary file, is an internal file that holds the checksum of the part-00000 file; The final statistics are saved in the part-00000 file.

	100

It is important to note that the output path must be deleted before each run, otherwise it will be reported

Org.apache.hadoop.mapred.FileAlreadyExistsException: 
Output directory File:/e:/cygwin_root/home/output already exists

Hadoop does this check in order to avoid the last time the MapReduce program was not completed, the intermediate file generated by running the MapReduce program again overwrites the intermediate file generated by the last run.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.