Hot Word Extraction: Ikanalyzer + Lucene + mapreduce_

Hot Word Extraction: Ikanalyzer + Lucene + mapreduce__hadoop

Last Update:2018-08-20 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This paper records the recent process of Chinese hotspot vocabulary extraction.

First of all, need a Chinese word breaker, I chose the ikanalyzer. Second, in order to deal with synonyms, Lucene was used. Third, considering the amount of data, the use of MapReduce.

After Ikanalyzer and lucene processing, the test text will be cut into a set of words without synonyms, and then use MapReduce to do word frequency statistics, and wordcount process, this is the first job to do things. At the end of the first job, an intermediate result is obtained, each line consisting of a key pair of word frequencies, sorted by the dictionary order of the words.

The intermediate result is one step from the ideal goal, we hope the result can be sorted in descending order according to Word frequency, so we need a second job, use Inversemapper, take the intermediate result as input, then exchange the key and the value of the intermediate result, change to the key value pair (word frequency term), Then customize a intwritabledesccomparator to achieve a descending effect, and finally reduce, the resulting results are sorted in descending order of frequency.

This article does not make a full order of the results of the mapreduce, because in the current environment, we only need top x hot words, you can simplify the handling of this problem, assuming that there are n reduce, produce n results file, as long as the n file top x row out, Then make a unified comparison to pick out the final top X.

The following are described in turn.

1. Ikanalyzer

Adopt the Intelligent Word segmentation mode.

In addition, in order to remove extraneous words, custom stopword.dic. Add a custom IKAnalyzer.cfg.xml dictionary file name to the Ikanalyzer.cfg.xml,ikanalyzer and place the default disabled word dictionary and the custom Stop Word dictionary in the root directory of the project class file.

Ikanalyzer will be used in the map to segment each line of the test text (map InputFormat to Textinputformat), and then the tangent of each of the words in sequence through the lucene synonym processing, That is, the words with similar meanings are replaced by one word of unity.

public static class Map extends Mapreducebase implements
			mapper<longwritable, text, text, intwritable> {

		Private final static intwritable one = new intwritable (1);
		Private Text Word = new text ();

		public void Map (longwritable key, Text value,
				outputcollector<text, intwritable> output, Reporter Reporter) Throws IOException {
			String line = value.tostring ();
			StringTokenizer tokenizer = new StringTokenizer (line);
			while (Tokenizer.hasmoretokens ()) {
				try {
					byte[] bt = Tokenizer.nexttoken (). GetBytes ();
					InputStream IP = new Bytearrayinputstream (BT);
					Reader read = new InputStreamReader (IP);
					Iksegmenter iks = new Iksegmenter (read, true);//Intelligent Word Segmentation mode
					
					lexeme t;
					while ((t = Iks.next ())!= null) {						
						word.set (Synonym.getsynmword (T.getlexemetext ()));
						Output.collect (Word, one);
					}
					
				catch (IOException e) {
					e.printstacktrace ();}}}

2. Lucene

Use Synonymfilterfactory and a custom thesaurus to replace the synonym with a uniform word.

private static class synonym{static Synonymfilterfactory factory;
		Static Whitespaceanalyzer Whitespaceanalyzer;
			static{Version ver = version.lucene_5_5_0;
	        java.util.map<string, string> Filterargs = new hashmap<string, string> ();
	        Filterargs.put ("Lucenematchversion", ver.tostring ());
	        Filterargs.put ("Synonyms", "sdic");//the path to the synonym table Filterargs.put ("Expand", "true");
	        Factory = new Synonymfilterfactory (Filterargs); try {factory.inform (New Filesystemresourceloader (".").
			Topath ()));
			catch (IOException e) {//TODO auto-generated catch block E.printstacktrace ();
		} Whitespaceanalyzer = new Whitespaceanalyzer (); public static string Getsynmword (String str) throws IOException {Tokenstream ts = factory.create (wh
	        
			Itespaceanalyzer.tokenstream ("synm", str));
	        String Synmword = null;
Chartermattribute termattr = Ts.addattribute (Chartermattribute.class);	        Ts.reset ();	            
	        if (Ts.incrementtoken ()) {Synmword = Termattr.tostring ();
	        } ts.end ();
	        
	        Ts.close ();
	        System.out.println (Synmword);
	    return Synmword; }				
	}

3. MapReduce

3.1 Reduce

public static class Reduce extends Mapreducebase implements
			Reducer<text, Intwritable, Text, intwritable> {
  public void reduce (Text key, iterator<intwritable> values,
				outputcollector<text, intwritable> Output, Reporter Reporter)
				throws IOException {
			int sum = 0;
			while (Values.hasnext ()) {
				sum + = Values.next () (). get ();
			}
			
			Text Outputkey = new text ();
			Outputkey.set (key);
			
			Output.collect (Outputkey, New intwritable (sum));
		}

3.2

private static class Intwritabledesccomparator extends Intwritable.comparator {public
        int compare ( Writablecomparable A, writablecomparable b) {
          return-super.compare (A, b);
        }
        
        public int Compare (byte[] b1, int s1, int L1, byte[] b2, int s2, int l2) {
            Return-super.compare (B1, S1, L1, B2, S2, L 2);
        }

3.3 Job 1

                jobconf conf = new jobconf (hotwords.class);
		Conf.setjobname ("Job1"); 
		Conf.setoutputkeyclass (text.class); 
		Conf.setoutputvalueclass (intwritable.class); 
		Conf.setmapperclass (map.class); 
		Conf.setcombinerclass (reduce.class); 
		Conf.setreducerclass (reduce.class); 
		Conf.setinputformat (textinputformat.class); 
		Conf.setoutputformat (textoutputformat.class); 

		
		Fileinputformat.setinputpaths (conf, new Path ("Iktest"));
		Fileoutputformat.setoutputpath (conf, new Path ("tmp"));

		Jobclient.runjob (conf);

3.4 Job 2

                jobconf sortconf = new jobconf (hotwords.class);
		Sortconf.setjobname ("Job2"); 
		Sortconf.setoutputkeyclass (intwritable.class); 
		Sortconf.setoutputvalueclass (text.class); 
		Sortconf.setmapperclass (inversemapper.class); 
		Sortconf.setnumreducetasks (2); 
		Sortconf.setinputformat (Keyvaluetextinputformat.class); So that Inversemapper can get the original key value to
		Sortconf.setoutputformat (textoutputformat.class) correctly; 
		Sortconf.setoutputkeycomparatorclass (intwritabledesccomparator.class);

		Fileinputformat.setinputpaths (sortconf, New Path ("tmp"));
		Fileoutputformat.setoutputpath (sortconf, New Path ("result"));

Reference URL: http://www.cnblogs.com/xwdreamer/archive/2011/01/07/2297044.html

Http://www.hankcs.com/program/java/lucene-synonymfilterfactory.html

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Hot Word Extraction: Ikanalyzer + Lucene + mapreduce__hadoop

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Hot Word Extraction: Ikanalyzer + Lucene + mapreduce__hadoop

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support