Big Data Learning--mapreduce Configuration and Java code implementation wordcount algorithm

Source: Internet
Author: User

---restore content starts---

Configuring MapReduce requires configuring two XML files on top of previous configurations one is the Yarn-site.xml one is Mapred-site.xml, which can be found under the ETC directory of the previously configured Hadoop

The configuration process below first

1, Configuration Yarn-site.xml

<configuration><!--Site Specific YARN configuration Properties--><property><name> yarn.resourcemanager.hostname</name><value>192.168.98.141</value></property>< Property><name>yarn.nodemanager.aux-services</name><value>mapreduce_shuffle</value> </property></configuration>

It is important to explain that yarn's basic idea is to separate the two main functions of jobtracker (Resource management and job scheduling/monitoring), The primary method is to create a global ResourceManager (RM) and several applicationmaster (AM) for the application. The application here refers to the traditional mapreduce job or the Dag of the job, in fact yarn similar to understand as Tomcat, on the Web project has tomcat this platform. Yarn is the same, the essence of yarn layered structure is ResourceManager. This entity controls the entire cluster and manages the allocation of the application to the underlying computing resources. Resourcemannager assign these resources to the NodeManager (yarn proxy node).

The first value configuration property is the host number that corresponds to the IP of your system configuration

Configure Mapred-site.xml

<configuration><property><name>mapreduce.framework.name</name><value>yarn</ Value></property></configuration>

So the configuration is complete.

Open the virtual machine, turn on the yarn service, enter JPS to see if there are two parts of ResourceManager NodeManager. There is a successful configuration.

Running WordCount algorithm under virtual machine

Enter the wordcount algorithm in hadoop-->share-->hadoop--mapreduce--> execution Hadoop-mapreduce-examples-2.7.3.jar

It should be noted here that the first directory after WordCount is the directory of the statistical character file, the second is the output directory, the output directory must be nonexistent before the error will be

Introduction to the work flow of the MapReduce, the rough division can be divided into the following steps

1. Code Writing

2. Job Configuration

3. Submit the Job

4. Initialize the operation

5. Assigning tasks

6. Perform Tasks

7. Update Tasks and Status

MapReduce processes data in the form of key-value pairs when processing data

1, the MapReduce framework is through the map to read the contents of the file, parse into key, value on each line of the file, parse to key, value pair <key,value> each key value pair call a map function, write their own logic, the input key, The value processing is converted to a new key, value output, and the output intermediate key value pair is passed to reduce;

2. Before reduce, there is a shuffle process to merge and sort the output of multiple map tasks

3, write the reduce function's own logic, the input key, value processing, converted to a new key, value output

4. Save the output of reduce to a file

The above is an understanding of the work flow of mapreduce after I finished my study, and the WordCount algorithm is implemented by Java code below.

Start by creating a Maven project that introduces the following dependencies in Pom.xml

</dependency>        <dependency>            <groupId>org.apache.hadoop</groupId>            < artifactid>hadoop-common</artifactid>            <version>2.7.3</version>        </dependency >        <dependency>            <groupId>org.apache.hadoop</groupId>            <artifactId> hadoop-client</artifactid>            <version>2.7.3</version>        </dependency>        < dependency>            <groupId>jdk.tools</groupId>            <artifactid>jdk.tools</artifactid >            <version>1.8</version>            <scope>system</scope>            <systempath>${ Java_home}/lib/tools.jar</systempath>        </dependency>

Set up Map classes

Importjava.io.IOException;Importorg.apache.hadoop.io.IntWritable;Importorg.apache.hadoop.io.LongWritable;ImportOrg.apache.hadoop.io.Text;ImportOrg.apache.hadoop.mapreduce.Mapper; Public classMyMapextendsmapper<longwritable, text, text, intwritable>{@Override/** where Keyin (byte offset) Vlaue (obtained data type) keyout (output data type) value (output data value type )*/    protected voidMap (longwritable key, text value, mapper<longwritable, text, text, intwritable>. Context context)throwsIOException, interruptedexception {//TODO auto-generated Method StubString line=value.tostring ();//get file contents by lineString[] Words=line.split ("");//Shard each line of content with a space         for(String word:words) {context.write (NewText (Word.trim ()),NewIntwritable (1));//overflows the output of the map function into an in-memory ring buffer        }    }}

Create a reduce class

Importjava.io.IOException;Importorg.apache.hadoop.io.IntWritable;ImportOrg.apache.hadoop.io.Text;ImportOrg.apache.hadoop.mapreduce.Reducer; Public classMyreduceextendsReducer<text, Intwritable, Text, intwritable>{@Override/** Key is the type of the map output key, and the iterator type corresponds to the value values from the map * iterators are used to implement each value in the map once processing **/    protected voidReduce (Text key, iterable<intwritable>values, Reducer<text, Intwritable, Text, Intwritable> Context context)throwsIOException, interruptedexception {//TODO auto-generated Method Stub        intSum=0; //Data Processing         for(intwritable intwritable:values) {sum+=Intwritable.get (); } context.write (Key,Newintwritable (sum)); }}

Create Job Class

Importorg.apache.hadoop.conf.Configuration;Importorg.apache.hadoop.conf.Configured;ImportOrg.apache.hadoop.fs.Path;Importorg.apache.hadoop.io.IntWritable;ImportOrg.apache.hadoop.io.Text;ImportOrg.apache.hadoop.mapreduce.Job;ImportOrg.apache.hadoop.mapreduce.lib.input.FileInputFormat;ImportOrg.apache.hadoop.mapreduce.lib.output.FileOutputFormat;ImportOrg.apache.hadoop.util.Tool;ImportOrg.apache.hadoop.util.ToolRunner; Public classMyJobextendsConfiguredImplementstool{ Public Static voidMain (string[] args)throwsException {MyJob MyJob=NewMyJob (); Toolrunner.run (MyJob,NULL); } @Override Public intRun (string[] args)throwsException {//TODO auto-generated Method StubConfiguration conf=NewConfiguration ();//creating a Configuration objectConf.set ("Fs.defaultfs", "hdfs://192.168.80.142:9000"); //Assigning TasksJob job=job.getinstance (conf); Job.setjarbyclass (MyJob.class); Job.setmapperclass (MyMap.class); Job.setreducerclass (myreduce.class); Job.setoutputkeyclass (Text.class); Job.setoutputvalueclass (intwritable.class); //Creating a file input/output streamFileinputformat.addinputpath (Job,NewPath ("/hadoop/hadoop.txt")); Fileoutputformat.setoutputpath (Job,NewPath ("/hadoop/out")); Job.waitforcompletion (true); return0; }}

---restore content ends---

Big Data Learning--mapreduce Configuration and Java code implementation wordcount algorithm

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.