Cloud Computing (i)-Data processing using Hadoop Mapreduce

Source: Internet
Author: User
Tags hadoop mapreduce hadoop fs

Using Hadoop Mapreduce for data processing

1. Overview

Use HDP (download: http://zh.hortonworks.com/products/releases/hdp-2-3/#install) to build the environment for distributed data processing.

The project file is downloaded and the project folder is seen after extracting the file. The program will read four text files in the Cloudmr/internal_use/tmp/dataset/titles directory, each line of text in the file is a title from Wikipedia, read each title, and use The special symbol specified in Cloudmr/internal_use/tmp/dataset/misc/delimiters.txt splits the caption into a stand-alone word, then converts the word to full lowercase and then appears in the
All the words in the cloudmr/internal_use/tmp/dataset/misc/stopwords.txt are deleted, and the number of occurrences of the remaining words is counted and output.

Export hadoop_classpath="/usr/hdp/2.3.2.0-2950/hadoop/conf:/usr/hdp/2.3.2.0-2950/hadoop/conf:/usr/ hdp/2.3.2.0-2950/hadoop/conf:/usr/hdp/2.3.2.0-2950/hadoop/lib/*:/usr/hdp/2.3.2.0-2950/hadoop/.//*:/usr/hdp/ 2.3.2.0-2950/hadoop-hdfs/./:/usr/hdp/2.3.2.0-2950/hadoop-hdfs/lib/*:/usr/hdp/2.3.2.0-2950/hadoop-hdfs/.//*:/ Usr/hdp/2.3.2.0-2950/hadoop-yarn/lib/*:/usr/hdp/2.3.2.0-2950/hadoop-yarn/.//*:/usr/hdp/2.3.2.0-2950/ hadoop-mapreduce/lib/*:/usr/hdp/2.3.2.0-2950/hadoop-mapreduce/.//*:::/usr/share/java/ mysql-connector-java-5.1.17.jar:/usr/share/java/mysql-connector-java-5.1.31-bin.jar:/usr/share/java/ Mysql-connector-java.jar:/usr/hdp/2.3.2.0-2950/tez/*:/usr/hdp/2.3.2.0-2950/tez/lib/*:/usr/hdp/2.3.2.0-2950/tez /conf:/usr/hdp/current/hadoop-yarn-client/.//*:/usr/hdp/current/hadoop-yarn-client/lib/*"

2. Operation Process

Step (1): Put the project folder into the HDP virtual machine, go to the CLOUDMR folder, run the following command to start:

./start.sh

Enter your account number and enter 10 digits at random. Run the following command again to check if Hadoop is working correctly:

Hadoop version

Step (2): Write the Titlecount.java file, complete the corresponding function. The Titlecount.java after completion are as follows:

ImportOrg.apache.commons.logging.Log;Importorg.apache.commons.logging.LogFactory;Importorg.apache.hadoop.conf.Configuration;Importorg.apache.hadoop.conf.Configured;ImportOrg.apache.hadoop.fs.FSDataInputStream;ImportOrg.apache.hadoop.fs.FileSystem;ImportOrg.apache.hadoop.fs.Path;Importorg.apache.hadoop.io.IntWritable;ImportOrg.apache.hadoop.io.Text;ImportOrg.apache.hadoop.mapreduce.Job;ImportOrg.apache.hadoop.mapreduce.Mapper;ImportOrg.apache.hadoop.mapreduce.Reducer;ImportOrg.apache.hadoop.mapreduce.lib.input.FileInputFormat;ImportOrg.apache.hadoop.mapreduce.lib.output.FileOutputFormat;ImportOrg.apache.hadoop.util.Tool;ImportOrg.apache.hadoop.util.ToolRunner;ImportJava.io.*;ImportJava.util.*;/*** Classic "Word Count"*/ Public classTitlecountextendsConfiguredImplementsTool { Public Static voidMain (string[] args)throwsException {intres = Toolrunner.run (NewConfiguration (),NewTitlecount (), args);    System.exit (RES); } @Override Public intRun (string[] args)throwsException {Job Job= Job.getinstance ( This. getconf (), "Title Count"); Job.setoutputkeyclass (Text.class); Job.setoutputvalueclass (intwritable.class); Job.setmapoutputkeyclass (Text.class); Job.setmapoutputvalueclass (intwritable.class); Job.setmapperclass (Titlecountmap.class); Job.setreducerclass (titlecountreduce.class); Fileinputformat.setinputpaths (Job,NewPath (args[0])); Fileoutputformat.setoutputpath (Job,NewPath (args[1])); Job.setjarbyclass (Titlecount.class); returnJob.waitforcompletion (true) ? 0:1; }     Public StaticString Readhdfsfile (String path, Configuration conf)throwsioexception{Path PT=Newpath (path); FileSystem FS=Filesystem.get (Pt.touri (), Conf); Fsdatainputstream file=Fs.open (PT); BufferedReader Buffin=NewBufferedReader (Newinputstreamreader (file)); StringBuilder everything=NewStringBuilder ();        String Line;  while(line = Buffin.readline ())! =NULL) {everything.append (line); Everything.append ("\ n"); }        returneverything.tostring (); }         Public Static classTitlecountmapextendsMapper<object, text, text, intwritable>{Set<String> stopwords =NewHashset<string>();        String delimiters; @Overrideprotected voidSetup (Context context)throwsioexception,interruptedexception {Configuration conf=context.getconfiguration (); String Delimiterspath= Conf.get ("delimiters"); Delimiters=Readhdfsfile (Delimiterspath, conf); String Stopwordspath= Conf.get ("Stopwords"); List<String> stopwordslist = arrays.aslist (Readhdfsfile (Stopwordspath, conf). Split ("\ n"));  for(String e:stopwordslist) {Stopwords.add (e); }} @Override Public voidMap (Object key, Text value, context context)throwsIOException, interruptedexception {stringtokenizer Stk=NewStringTokenizer (value.tostring (), delimiters);  while(Stk.hasmoretokens ()) {String e=Stk.nexttoken (). Trim (). toLowerCase (); if(Stopwords.contains (e) = =false) {Context.write (NewText (E),NewIntwritable (1)); }            }        }    }     Public Static classTitlecountreduceextendsReducer<text, Intwritable, Text, intwritable>{@Override Public voidReduce (Text key, iterable<intwritable> values, context context)throwsIOException, interruptedexception {intsum = 0;  for(intwritable e:values) {sum+=E.get (); } context.write (Key,Newintwritable (sum)); }    }}

Step (3): Compile the Java source file. For convenience, create a new output folder in the CLOUDMR folder to hold the. class file generated by the compilation. Use the following command (executed under the CLOUDMR folder):

mkdir output
Javac-classpath $Hadoop _classpath-d Output Titlecount.java

Entering the output folder will see 3. class files.

Step (4): Package The compiled generated class file.

First create a new text file under the CLOUDMR folder MANIFEST.MF, using the following command (executed under the CLOUDMR folder):

Touch MANIFEST.MF

Edit content as

Main-class:titlecount.class

MANIFEST.MF is a bit of information about this package, which defines the main class.

Then use the following command to package (execute under the CLOUDMR folder):

Jar CVFM Titlecount.jar manifest.mf-c output/.

The meaning of this order is;

Jar Pack Command

Cvfm

The name of the package Titlecount.jar.

MANIFEST.MF the file into the bag.

All files in the folder after-c-c into the bag.

output/all the files in the output folder into the package

. The package Titlecount.jar placed in the current folder

Note: The packaging process is important and error-prone, so be sure to follow the steps described above.

Step (5): Release the Titlecount.jar.

Before you publish (yarn), you need to complete the preparation work.

The relevant files: Four text files in the Cloudmr/internal_use/tmp/dataset/titles directory, Cloudmr/internal_use/tmp/dataset/misc/delimiters.txt,
Cloudmr/internal_use/tmp/dataset/misc/stopwords.txt upload to HDFs.

Create a new Data folder within the/user/root/folder in HDFs, put Delimiters.txt, stopwords.txt into the Data folder, and create a new titles folder in the Data folder, Cloudmr/interna The four text files in the L_use/tmp/dataset/titles directory are placed in the titles folder.

Describes the relevant commands:

Hadoop fs-ls Lists the HDFs directory, because there are no parameters, the current user's home directory is listed

Hadoop Fs-ls/list HDFs root directory

Hadoop fs-mkdir Data creates a new data directory in the default directory

Hadoop fs-mkdir data/titles New titles directory in the data directory

Hadoop fs-copyfromlocal./abc.txt data uploads the abc.txt in the current directory (local) to the data directory on HDFs

You can then publish it, using the command:

Yarn jar Titlecount.jar titlecount-d delimiters="/user/root/data/delimiters.txt" -D stopwords="/user/root/data/stopwords.txt" data/titles output

The meaning of this order is;

Yarn Publishing Content

Jar to publish content as jar package

What the Titlecount.jar publishes

Entrance to Titlecount Titlecount.jar

-D delimiters= "/user/root/data/delimiters.txt"-D stopwords= "/user/root/data/stopwords.txt"-D followed by parameter, where two parameters are defined

Data/titles input folder, its files as input to the map

Where the output file is stored

After the yarn command finishes executing, you can view the results of the run.

Cloud Computing (i)-Data processing using Hadoop Mapreduce

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.