Learn a few basics before you go through API operations
First, the basic data types of Hadoop are different from the basic data types of Java, but there are corresponding relationships
Such as
If you need to define your own data type, you must implement writable
Hadoop data types can be obtained using the Get method for the corresponding Java data type
The Java data type can be transformed by the constructor of the Hadoop data class name, or by the Set method
Second, the steps of Hadoop to submit a job are divided into eight, can be understood as Tianlong eight steps
As follows:
Map-Side work:
1.1 Read the file to be manipulated-this step formats the contents of the file as a key-value pair, with the key being offset from the starting position of each line, with the value of the contents of each row
1.2 Call map for processing-This step uses the custom mapper class to implement its own logic, the input data is 1.1 formatted key-value pairs, the input data is also the form of key-value pairs
1.3 The processing result of map is partitioned--map after processing, the key-value pairs can be partitioned according to their business requirements, for example, saving different types of results in different file medium. Several partitions are set up here, and there will be several reducer to handle the contents of the corresponding partition.
After 1.4 partitioning, the data for each partition is sorted, grouped-the sort is sorted from small to large, and after sorting, the value of the option with the same key value is merged. For example, all key-value pairs may exist
Hello 1
Hello 1
The key is all Hello, and after merging it becomes
Hello 2
The processing of sorting and merging can be interfered and implemented according to your business requirements
1.5 (combiner)--simply to make a reduction at the map end, but unlike the real reduce processing: combiner can only handle local data and not be processed across the network. The combiner processing at the map end can reduce the output data because the data is transmitted over the network, which is designed to reduce the pressure on the network transmission and the amount of work behind reduce. does not replace reduce
Reduce end work:
2.1 Copy data to each reduce via the network
2.2 Call reduce for processing--reduce received data is the entire map end processing after the key value pair, the output is also a set of key value pairs, is the final result
2.3 outputting the results to the path of the HDFs file system
Create a new Java project and import the Hadoop package, right-click on the project options, select
Find the installation directory for Hadoop and select all packages
When you find Lib in the Hadoop installation directory, import all of the packages
The new Jmapper class is a custom mapper class
Import Org.apache.hadoop.io.longwritable;import Org.apache.hadoop.io.text;import The org.apache.hadoop.mapreduce.mapper;//custom Mapper class must inherit the Mapper class and override the map method to implement its own logical public class Jmapper extends mapper< longwritable, text, text, longwritable> {//each line of the input file is called once to the map method, how many rows of the file are called protected void map (longwritable key, Text value,org.apache.hadoop.mapreduce.mapper<longwritable, text, text, Longwritable> Context context) throws Java.io.IOException, Interruptedexception {//key The starting offset for each row//value the contents of each line//the contents of each row are split, such as HelloWorld, is divided into a string array with two data, respectively, hello,worldstring[] ss = value.tostring (). toString (). Split ("\ t");//loop array, Each of these data as the output key, the value is 1, indicating that the key appears once for (String s:ss) {//context.write method can map the value of the key pair output context.write (new Text (s), new Longwritable (1));}};}
The new Jreducer class is a custom reducer
Import Org.apache.hadoop.io.longwritable;import Org.apache.hadoop.io.text;import The org.apache.hadoop.mapreduce.reducer;//custom Reducer class must inherit Reducer and override the reduce method to implement its own logic, with the generic parameters being the input key type, the value type, the key type of the output, the value type After the reduce is similar to the public class Jreducer extends Reducer<text, longwritable, Text, longwritable> {// The reduce method is called once for each key-value pair, and the number of key-value pairs is called as many times as protected void reduce (Text key,java.lang.iterable<longwritable> value, Org.apache.hadoop.mapreduce.reducer<text, Longwritable, Text, Longwritable> Context context) throws Java.io.IOException, interruptedexception {//key for each individual word, such as: Hello,world,you,me etc// Value is a collection of occurrences of this word in text, such as {1,1,1}, representing a total of three occurrences of long sum = 0;//loop value, adding the values together to get the total number of times for (longwritable v:value) {sum + = V.get () ;} Context.write Enter a new key-value pair (result) context.write (key, New longwritable (sum));};}
New class to execute the submit job, named Jsubmit
Import Java.io.ioexception;import Java.net.uri;import Java.net.urisyntaxexception;import Org.apache.hadoop.conf.configuration;import Org.apache.hadoop.fs.filesystem;import Org.apache.hadoop.fs.Path; Import Org.apache.hadoop.io.longwritable;import Org.apache.hadoop.io.text;import Org.apache.hadoop.mapreduce.Job; Import Org.apache.hadoop.mapreduce.lib.input.fileinputformat;import Org.apache.hadoop.mapreduce.lib.input.textinputformat;import Org.apache.hadoop.mapreduce.lib.output.fileoutputformat;import Org.apache.hadoop.mapreduce.lib.output.textoutputformat;public class Jsubmit {public static void main (string[] args) Throws Ioexception,urisyntaxexception, Interruptedexception, classnotfoundexception {//path class is defined for the Hadoop API, Create two Path objects, one for input file paths, one for input results path Outpath = new Path ("Hdfs://localhost:9000/out");//The path of the input file is the file path of the local Linux system Inpath = new Path ("/home/hadoop/word");//create Default Configuration object Configuration conf = new configuration ();// Get Hadoop file system by address and CONF exclusive//If the input path already exists delete filesystem fs = FilEsystem.get (New URI ("hdfs://localhost:9000"), conf), if (Fs.exists (Outpath)) {Fs.delete (Outpath, True);} Create a new Job object based on Conf, representing the job to be submitted, the job named JSubmit.class.getSimpleName () Job Job = new Job (conf, JSubmit.class.getSimpleName () );//1.1//fileinputformat class sets the file path to read fileinputformat.setinputpaths (job, Inpath);// Setinputformatclass sets the Format class Job.setinputformatclass (Textinputformat.class) used to read the file,//1.2 calls the custom mapper class map method to operate// Set the Processing Mapper class Job.setmapperclass (Jmapper.class);//Set the data type of the key-value pair to output after the Mapper class is processed Job.setmapoutputkeyclass (text.class) ; Job.setmapoutputvalueclass (longwritable.class);//1.3 partition, the following two lines of code write and not write the same, the default settings
<span style= "White-space:pre" ></span>job.setpartitionerclass (hashpartitioner.class); <span style= " White-space:pre "></span>job.setnumreducetasks (1);//1.4 sorting, grouping//1.5, these three steps have default settings, if no special requirements can be
2.1 Transfer the data to the corresponding reducer//2.2 using the custom Reducer class action//Settings Reducer class Job.setreducerclass (Jreducer.class);//Set reducer after processing The data type of the output key-value pair Job.setoutputkeyclass (text.class); Job.setoutputvalueclass (longwritable.class);//2.3 outputs the result// Fileoutputformat set the output path Fileoutputformat.setoutputpath (Job, Outpath);// Setoutputformatclass Format Class Job.setoutputformatclass (Textoutputformat.class) when the output is set;// Submits the current Job Object Job.waitforcompletion (True);}}
Run the Java program, you can see the prompt to submit the job again console
View the output file in HDFs
Run successfully!
Submitting custom Hadoop jobs through the Java API