Hot Word Extraction: Ikanalyzer + Lucene + mapreduce__hadoop

Source: Internet
Author: User

This paper records the recent process of Chinese hotspot vocabulary extraction.

First of all, need a Chinese word breaker, I chose the ikanalyzer. Second, in order to deal with synonyms, Lucene was used. Third, considering the amount of data, the use of MapReduce.

After Ikanalyzer and lucene processing, the test text will be cut into a set of words without synonyms, and then use MapReduce to do word frequency statistics, and wordcount process, this is the first job to do things. At the end of the first job, an intermediate result is obtained, each line consisting of a key pair of word frequencies, sorted by the dictionary order of the words.

The intermediate result is one step from the ideal goal, we hope the result can be sorted in descending order according to Word frequency, so we need a second job, use Inversemapper, take the intermediate result as input, then exchange the key and the value of the intermediate result, change to the key value pair (word frequency term), Then customize a intwritabledesccomparator to achieve a descending effect, and finally reduce, the resulting results are sorted in descending order of frequency.

This article does not make a full order of the results of the mapreduce, because in the current environment, we only need top x hot words, you can simplify the handling of this problem, assuming that there are n reduce, produce n results file, as long as the n file top x row out, Then make a unified comparison to pick out the final top X.

The following are described in turn.


1. Ikanalyzer

Adopt the Intelligent Word segmentation mode.

In addition, in order to remove extraneous words, custom stopword.dic. Add a custom IKAnalyzer.cfg.xml dictionary file name to the Ikanalyzer.cfg.xml,ikanalyzer and place the default disabled word dictionary and the custom Stop Word dictionary in the root directory of the project class file.

Ikanalyzer will be used in the map to segment each line of the test text (map InputFormat to Textinputformat), and then the tangent of each of the words in sequence through the lucene synonym processing, That is, the words with similar meanings are replaced by one word of unity.

public static class Map extends Mapreducebase implements
			mapper<longwritable, text, text, intwritable> {

		Private final static intwritable one = new intwritable (1);
		Private Text Word = new text ();

		public void Map (longwritable key, Text value,
				outputcollector<text, intwritable> output, Reporter Reporter) Throws IOException {
			String line = value.tostring ();
			StringTokenizer tokenizer = new StringTokenizer (line);
			while (Tokenizer.hasmoretokens ()) {
				try {
					byte[] bt = Tokenizer.nexttoken (). GetBytes ();
					InputStream IP = new Bytearrayinputstream (BT);
					Reader read = new InputStreamReader (IP);
					Iksegmenter iks = new Iksegmenter (read, true);//Intelligent Word Segmentation mode
					
					lexeme t;
					while ((t = Iks.next ())!= null) {						
						word.set (Synonym.getsynmword (T.getlexemetext ()));
						Output.collect (Word, one);
					}
					
				catch (IOException e) {
					e.printstacktrace ();}}}
	



2. Lucene

Use Synonymfilterfactory and a custom thesaurus to replace the synonym with a uniform word.

private static class synonym{static Synonymfilterfactory factory;
		Static Whitespaceanalyzer Whitespaceanalyzer;
			static{Version ver = version.lucene_5_5_0;
	        java.util.map<string, string> Filterargs = new hashmap<string, string> ();
	        Filterargs.put ("Lucenematchversion", ver.tostring ());
	        Filterargs.put ("Synonyms", "sdic");//the path to the synonym table Filterargs.put ("Expand", "true");
	        Factory = new Synonymfilterfactory (Filterargs); try {factory.inform (New Filesystemresourceloader (".").
			Topath ()));
			catch (IOException e) {//TODO auto-generated catch block E.printstacktrace ();
		} Whitespaceanalyzer = new Whitespaceanalyzer (); public static string Getsynmword (String str) throws IOException {Tokenstream ts = factory.create (wh
	        
			Itespaceanalyzer.tokenstream ("synm", str));
	        String Synmword = null;
Chartermattribute termattr = Ts.addattribute (Chartermattribute.class);	        Ts.reset ();	            
	        if (Ts.incrementtoken ()) {Synmword = Termattr.tostring ();
	        } ts.end ();
	        
	        Ts.close ();
	        System.out.println (Synmword);
	    return Synmword; }				
	}



3. MapReduce

3.1 Reduce

public static class Reduce extends Mapreducebase implements
			Reducer<text, Intwritable, Text, intwritable> {
  public void reduce (Text key, iterator<intwritable> values,
				outputcollector<text, intwritable> Output, Reporter Reporter)
				throws IOException {
			int sum = 0;
			while (Values.hasnext ()) {
				sum + = Values.next () (). get ();
			}
			
			Text Outputkey = new text ();
			Outputkey.set (key);
			
			Output.collect (Outputkey, New intwritable (sum));
		}
	

3.2

private static class Intwritabledesccomparator extends Intwritable.comparator {public
        int compare ( Writablecomparable A, writablecomparable b) {
          return-super.compare (A, b);
        }
        
        public int Compare (byte[] b1, int s1, int L1, byte[] b2, int s2, int l2) {
            Return-super.compare (B1, S1, L1, B2, S2, L 2);
        }
    

3.3 Job 1

                jobconf conf = new jobconf (hotwords.class);
		Conf.setjobname ("Job1"); 
		Conf.setoutputkeyclass (text.class); 
		Conf.setoutputvalueclass (intwritable.class); 
		Conf.setmapperclass (map.class); 
		Conf.setcombinerclass (reduce.class); 
		Conf.setreducerclass (reduce.class); 
		Conf.setinputformat (textinputformat.class); 
		Conf.setoutputformat (textoutputformat.class); 

		
		Fileinputformat.setinputpaths (conf, new Path ("Iktest"));
		Fileoutputformat.setoutputpath (conf, new Path ("tmp"));

		Jobclient.runjob (conf);

3.4 Job 2

                jobconf sortconf = new jobconf (hotwords.class);
		Sortconf.setjobname ("Job2"); 
		Sortconf.setoutputkeyclass (intwritable.class); 
		Sortconf.setoutputvalueclass (text.class); 
		Sortconf.setmapperclass (inversemapper.class); 
		Sortconf.setnumreducetasks (2); 
		Sortconf.setinputformat (Keyvaluetextinputformat.class); So that Inversemapper can get the original key value to
		Sortconf.setoutputformat (textoutputformat.class) correctly; 
		Sortconf.setoutputkeycomparatorclass (intwritabledesccomparator.class);

		Fileinputformat.setinputpaths (sortconf, New Path ("tmp"));
		Fileoutputformat.setoutputpath (sortconf, New Path ("result"));

		

Reference URL: http://www.cnblogs.com/xwdreamer/archive/2011/01/07/2297044.html

Http://www.hankcs.com/program/java/lucene-synonymfilterfactory.html


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.