Hadoop wordcount instance code, hadoopwordcount

Source: Internet
Author: User
Tags hdfs dfs

Hadoop wordcount instance code, hadoopwordcount

A simple example is provided to illustrate what MapReduce is:

We need to count the number of times each word appears in a large file. The file is too large. We split the file into small files and arrange multiple people to collect statistics. This process is "Map ". Then combine the statistics of each person. This is "Reduce ".

In the preceding example, if MapReduce is used, a job needs to be created to split the file into several independent data blocks and distribute them in different machine nodes. Then, Map tasks distributed across different nodes are processed in full parallel. MapReduce collects Map output rows and sends the results to Reduce for further processing.

For the specific execution process of a task, a process named "JobTracker" is responsible for coordinating all the tasks in the MapReduce execution process. Several TaskTracker processes are used to run separate Map tasks and report the task execution to JobTracker at any time. If a TaskTracker fails to report the task or does not report the task for a long time, JobTracker starts another TaskTracker to re-execute the independent Map task.

The specific code below is implemented: 1. Compile the wordcount related job

(1) create a maven project in eclipse, depending on the following jar package (you can also refer to the pom configuration of the hadoop-mapreduce-examples project under the hadoop source package)

Note: configure a maven plug-in maven-jar-plugin and specify mainClass

<dependencies>  <dependency>   <groupId>junit</groupId>   <artifactId>junit</artifactId>   <version>4.11</version>  </dependency>  <dependency>    <groupId>org.apache.hadoop</groupId>    <artifactId>hadoop-mapreduce-client-core</artifactId>    <version>2.5.2</version>  </dependency>  <dependency>    <groupId>org.apache.hadoop</groupId>    <artifactId>hadoop-common</artifactId>    <version>2.5.2</version>  </dependency> </dependencies>  <build>   <plugins>     <plugin>  <groupId>org.apache.maven.plugins</groupId>   <artifactId>maven-jar-plugin</artifactId>   <configuration>    <archive>     <manifest>      <mainClass>com.xxx.demo.hadoop.wordcount.WordCount</mainClass>     </manifest>    </archive>   </configuration>  </plugin>   </plugins> </build>

(2) According to the MapReduce running mechanism, a job must write at least three classes to complete the three tasks: Map logic, Reduce logic, and job scheduling.

The Map code can inherit from the org. apache. hadoop. mapreduce. Mapper class.

Public static class TokenizerMapper extends Mapper <Object, Text, Text, IntWritable> {private final static IntWritable one = new IntWritable (1); private Text word = new Text (); // because the key parameter is not used in this example, the key type is simply specified as Object public void map (Object key, Text value, Context context) throws IOException, interruptedException {StringTokenizer itr = new StringTokenizer (value. toString (); while (itr. hasMoreTokens () {word. set (itr. nextToken (); context. write (word, one );}}}

Reduce code can inherit the org. apache. hadoop. mapreduce. Cer CER class

public class IntSumReducer    extends Reducer<Text,IntWritable,Text,IntWritable> {  private IntWritable result = new IntWritable();   public void reduce(Text key, Iterable<IntWritable> values,            Context context            ) throws IOException, InterruptedException {   int sum = 0;   for (IntWritable val : values) {    sum += val.get();   }   result.set(sum);   context.write(key, result);  } }

Compile the main method for Job Scheduling

public static void main(String[] args) throws Exception {  Configuration conf = new Configuration();  Job job = Job.getInstance(conf, "word count");  job.setJarByClass(WordCount.class);  job.setMapperClass(TokenizerMapper.class);  job.setCombinerClass(IntSumReducer.class);  job.setReducerClass(IntSumReducer.class);  job.setOutputKeyClass(Text.class);  job.setOutputValueClass(IntWritable.class);  FileInputFormat.addInputPath(job, new Path(args[0]));  FileOutputFormat.setOutputPath(job, new Path(args[1]));  job.waitForCompletion(true) ;  //System.exit(job.waitForCompletion(true) ? 0 : 1); }
2. Upload data files to the hadoop Cluster Environment

Run mvn install to compress the project into a jar file and upload it to the linux cluster environment. Run the hdfs dfs-mkdir command to create the corresponding command in the hdfs file system, use hdfs dfs-put to upload the data files to be processed to the hdfs system. For example: hdfs dfs-put $ {linux_path/data file }$ {hdfs_path}

3. Execute a job

Execute the command in the Cluster Environment: hadoop jar $ {linux_path}/wordcount. jar $ {hdfs_input_path }$ {hdfs_output_path}

4. view statistical results

Hdfs dfs-cat $ {hdfs_output_path}/output file name

The preceding method runs in Local mode when the hadoop cluster environment is not started. HDFS and YARN do not work at this time. The following are the tasks that need to be done when running a mapreduce job in pseudo-distributed mode. First, extract the steps listed on the official website:

Configure Host Name

# vi /etc/sysconfig/network

For example:

NETWORKING=yesHOSTNAME=mastervi /etc/hosts

Enter the following content localhost

Configure ssh password-free intercommunication

ssh-keygen -t rsa
# cat?~/.ssh/id_rsa.pub?>>?~/.ssh/authorized_keys

Configure the core-site.xml file (at $ {HADOOP_HOME}/etc/hadoop/

<configuration>  <property>    <name>fs.defaultFS</name>    <value>hdfs://localhost:9000</value>  </property></configuration>

Configure hdfs-site.xml files

<configuration>  <property>    <name>dfs.replication</name>    <value>1</value>  </property></configuration>

The following command can run mapreduce job in standalone pseudo Distribution Mode

1. Format the filesystem:
$ Bin/hdfs namenode-format
2. Start NameNode daemon and DataNode daemon:
$ Sbin/start-dfs.sh
3. The hadoop daemon log output is written to the $ HADOOP_LOG_DIR directory (defaults to $ HADOOP_HOME/logs ).

4. Browse the web interface for the NameNode; by default it is available:
NameNode-http: // localhost: 50070/
Make the HDFS directories required to execute MapReduce jobs:
$ Bin/hdfs dfs-mkdir/user
$ Bin/hdfs dfs-mkdir/user/<username>
5. Copy the input files into the distributed filesystem:
$ Bin/hdfs dfs-put etc/hadoop input
6. Run some of the examples provided:
$ Bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.2.jar grep input output 'dfs [a-z.] +'
7. Examine the output files:
Copy the output files from the distributed filesystem to the local filesystem and examine them:

$ Bin/hdfs dfs-get output
$ Cat output /*

View the output files on the distributed filesystem:

$ Bin/hdfs dfs-cat output /*
8. When you're done, stop the daemons:
$ Sbin/stop-dfs.sh


The above is all the content about the wordcount instance code of hadoop in this article. I hope it will be helpful to you. If you are interested, you can continue to refer to other related topics on this site. If you have any shortcomings, please leave a message. Thank you for your support!

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.