Detailed MapReduce implementation data deduplication and inverted index application scenario case

Source: Internet
Author: User
Tags manual iterable split static class

Hadoop written questions: Identify common friends of different people (consider data deduplication)

Example:

Zhang San: John Doe, Harry, Zhao Liu

John Doe: Zhang San, tianqi, Harry

The actual work, the data to reuse is still quite a lot of, including the empty value of the filter and so on, this article on data deduplication and inverted index detailed explanation.

first, data deduplication [simulation of a carrier call detail to weight]

The number of statistics data sets in the project, the site log file calculation, and other cases will involve data deduplication, data de-duplication and so on are frequently used storage data reduction techniques. A simple case to illustrate how MapReduce implements data deduplication.

① Raw analog data [C outgoing, B inbound]

13711111111 C
13611111111 b
13711111111 b
13722222222 C
13611111111 C

13711111111 b
13611111111 b
13711111111 b
13722222222 b
13611111111 C

Pass all records of the same piece of data to a reduce, and the final output will be one time.

When the map stage uses the Hadoop default job Input method, the value entered is the output key.

//mapper Task
  static class Ddmap extends mapper<longwritable,text,text,text>{
    private static Text line = new text ();
   protected void Map (longwritable k1,text v1,context Context) {
    line = v1;
    text Text = new text ("");
     try {
      context.write (line,text);
     } catch (IOException e) {
      //TODO Auto-generated Catch block
      e.printstacktrace ();
     } catch (Interruptedexception e) {
      //TODO Auto-generated Catch block
      e.printstacktrace ();
     }
   };
  }

//reducer Task
  static class Ddreduce extends reducer<text,text,text,text>{
    protected void Reduce (text k2,iterable<text> v2s,context Context) {
    text text = New Text ("");
    try {
     context.write (k2, text);
    } catch (IOException e) {
     //TODO auto-generated catch Block
     e.printstacktrace ();
    } catch (Interruptedexception e) {
     //TODO Auto-generated Catch block
     e.printstacktrace ();
    }
   };
  }
 }

Initialize parameters
public static final String host_path= "Hdfs://v:9000″;
Read file path "manual creation required"
public static final String input_path=host_path+ "/ddin";
Output file path
public static final String output_path=host_path+ "/ddout";

Perform a MapReduce task-driven
public static void Main (string[] args) throws exception{
Final configuration conf = new configuration ();

FileSystem fs = Filesystem.get (new URI (Host_path), conf);
if (Fs.exists (new Path (Output_path))) {
Fs.delete (New Path (Output_path), true);
}

Create a Job Object
Final Job Job = new Job (conf);

Notification job file input path
Fileinputformat.setinputpaths (Job, Input_path);
Notify job file Output path
Fileoutputformat.setoutputpath (Job, New Path (Output_path));

Notifies the job to resolve the input file to a key-value pair "default can be omitted"
Job.setinputformatclass (Textinputformat.class);

Call the custom mapper function
Job.setmapperclass (Ddmap.class);
Set the K2,v2 type, if the <k2,v2><k3,v3> type is the same, you can omit
Job.setmapoutputkeyclass (Text.class);
Job.setmapoutputvalueclass (Text.class);

Call the custom reducer function
Job.setreducerclass (Ddreduce.class);
Setting the K3,v3 type
Job.setoutputkeyclass (Text.class);
Job.setoutputvalueclass (Text.class);

Notifies the job of how to <k3,v3> write to HDFs [default value, can be omitted]
Job.setoutputformatclass (Textoutputformat.class);

Execute Job
Job.waitforcompletion (TRUE);

}

13611111111 b
13611111111 C
13711111111 b
13711111111 C
13722222222 b
13722222222 C

second, inverted index (inverted)

Inverted index is the most commonly used data structure in document retrieval system and is widely applied to full-text search engine.

Used primarily to store a map of a word or phrase stored in a document or group of documents, which provides a way to find documents based on content.

Because the document content is not determined by the document, it is reversed, so it is called an inverted index.

In practice, each document corresponds to a weight value, which is used to refer to the degree of relevance of each document to the search content.

The most common use of word frequency as a weight value, that is, the number of times the words appear in the document.

More complex weights or records of words appear in multiple documents to implement the TF-IDF (term frequency-inverse document Frequency) algorithm, or to consider the position of the word in the document, and so on.

File1:mapreduce is simple

File2:mapreduce is powerful

File3:hello MapReduce Bye MapReduce

Information to follow: words, document URLs, Word frequency.

<key,value> similar to:< "MapReduce" File1.txt 1>

The <key,value> pair can only have two values, and File1.txt 1 is combined as value as needed.

The advantage of the word as key : using the default ordering of the MR Framework, a list of the word frequency of the same words in the same document is passed to the combine process.

the benefit of combining URL and word frequency into value : The default Hashpartitioner class of the MR Framework is used to complete the shuffle process, sending all records of the same words to the same reducer processing.

With a reduce can not complete the word frequency statistics and generate document list, need to add combine process to complete the word frequency statistics.

The combine process accumulates the value with the same value as the key, obtaining the word frequency in this document for that key.

The value group of the same key is synthesized to the desired format for the inverted index file.

The word file should not be too large, to ensure that each file corresponding to a split, or because the reduce process does not further statistical word frequency, the final result may appear the words are not statistically complete, You can override each file as a split by overriding the InputFormat class. You can also use a composite key-value peer to implement an inverted index that contains more information.

Package sort;

Import java.io.IOException;
Import Java.net.URI;
Import Java.util.StringTokenizer;

Import org.apache.hadoop.conf.Configuration;
Import Org.apache.hadoop.fs.FileSystem;
Import Org.apache.hadoop.fs.Path;
Import org.apache.hadoop.io.LongWritable;
Import Org.apache.hadoop.io.Text;
Import Org.apache.hadoop.mapreduce.Job;
Import Org.apache.hadoop.mapreduce.Mapper;
Import Org.apache.hadoop.mapreduce.Reducer;
Import Org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
Import Org.apache.hadoop.mapreduce.lib.input.FileSplit;
Import Org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
Import Org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
Import Org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

public class Test {
File1:mapreduce is simple
File2:mapreduce is powerful
File3:hello MapReduce Bye MapReduce
Hello file3.txt:1;
MapReduce File3.txt:2; File1.txt:1; File2.txt:1;
Bye file3.txt:1;
is file1.txt:1; File2.txt:2;
Powerful file2.txt:1;
Simple file2.txt:1; File1.txt:1;


Static class Iimap extends Mapper<longwritable, text, text, text> {
private static Text key = new text ();//Word and URL
private static Text value = new text ();//Word frequency
Private Filesplit filesplit;//Split object

protected void Map (longwritable K1, Text v1, context context) {
Get <k1,v1> Filesplit object to which it belongs
Filesplit filesplit = (filesplit) context.getinputsplit ();
StringTokenizer StringTokenizer = new StringTokenizer (v1.tostring ());
while (Stringtokenizer.hasmoretokens ()) {
int indexOf = Filesplit.getpath (). toString (). IndexOf ("File");
Key.set (Stringtokenizer.nexttoken () + ":"
+ Filesplit.getpath (). toString (). substring (indexOf));
Value.set ("1");
try {
Context.write (key, value);
} catch (IOException e) {
TODO auto-generated Catch block
E.printstacktrace ();
} catch (Interruptedexception e) {
TODO auto-generated Catch block
E.printstacktrace ();
}
}
};
}

combiner tasks
Static class Iicombiner extends Reducer<text, text, text, text> {
Private text text = new text ();

  protected void reduce (Text key, iterable<text> values, context context) {
   // Statistic word frequency
   int sum = 0;
   for (Text value:values) {
    sum + = Integer.parseint (value.tostring ());
   }
   int Splitindex = key.tostring (). IndexOf (":");
   //Reset value value URL Word frequency merge
   text.set (key.tostring (). substring (Splitindex + 1) + " : "+ sum);
   //Reset Key to Word
   key.set (key.tostring (). substring (0, splitindex));

try {
Context.write (key, text);
} catch (IOException e) {
TODO auto-generated Catch block
E.printstacktrace ();
} catch (Interruptedexception e) {
TODO auto-generated Catch block
E.printstacktrace ();
}
};
}

Reducer Tasks
Static class Iireduce extends Reducer<text, text, text, text> {
Private text V3 = new text ();

  protected void Reduce (Text K2, iterable<text> V2s, Context context) {
   // Generate Document List
   string fileList = new String ();
   for (Text value:v2s) {
 &nbs P;  filelist + = value.tostring () + ";";
   }
   v3.set (fileList);
   try {
    context.write (K2, v3);
   } catch ( IOException e) {
    //TODO auto-generated catch block
     E.printstacktrace ();
   } catch (Interruptedexception e) {
    //TODO auto-generated Catch block
    e.printstacktrace ();
   }
  };
 }

Initialize parameters
public static final String Host_path = "hdfs://v:9000";
Read file path "manual creation required"
public static final String Input_path = Host_path + "/iiin";
Output file path
public static final String Output_path = Host_path + "/iiout";
Perform a MapReduce task-driven
public static void Main (string[] args) throws Exception {
Final configuration conf = new configuration ();

FileSystem fs = Filesystem.get (new URI (Host_path), conf);
if (Fs.exists (new Path (Output_path))) {
Fs.delete (New Path (Output_path), true);
}

Create a Job Object
Final Job Job = new Job (conf);

Notification job file input path
Fileinputformat.setinputpaths (Job, Input_path);
Notify job file Output path
Fileoutputformat.setoutputpath (Job, New Path (Output_path));

Notifies the job to resolve the input file to a key-value pair "default can be omitted"
Job.setinputformatclass (Textinputformat.class);

Call the custom mapper function
Job.setmapperclass (Iimap.class);
Set the K2,v2 type, if the <k2,v2><k3,v3> type is the same, you can omit
Job.setmapoutputkeyclass (Text.class);
Job.setmapoutputvalueclass (Text.class);

Job.setcombinerclass (Iicombiner.class);

Call the custom reducer function
Job.setreducerclass (Iireduce.class);
Setting the K3,v3 type
Job.setoutputkeyclass (Text.class);
Job.setoutputvalueclass (Text.class);

Notifies the job of how to <k3,v3> write to HDFs [default value, can be omitted]
Job.setoutputformatclass (Textoutputformat.class);

Execute Job
Job.waitforcompletion (TRUE);

}

}

hello file3.txt:1;
mapreduce file3.txt:2; File1.txt:1; File2.txt:1;
bye file3.txt:1;
is file1.txt:1; File2.txt:2;
powerful file2.txt:1;
simple file2.txt:1; File1.txt:1;

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.