Java uses Nagao algorithm to realize new word discovery and hot word mining _java

Source: Internet
Author: User
Tags readline

The frequency of each substring is counted by Nagao algorithm, then the frequency of each string, the number of left and right neighbors, the left and right entropy and the interaction information (internal cohesion) are counted based on these frequencies.

Noun Explanation:

Nagao algorithm: A fast algorithm for the frequency of all substrings in a statistical text. Detailed algorithm visible http://www.doc88.com/p-664123446503.html
Word frequency: The number of times the string appears in the document. The more times it occurs, the more important it becomes.
Left and right: the number of different words in the document that appear to the left of the string. The more the left and right neighbors, the higher the probability of the string becoming a word.
Left and right entropy: the entropy of the number distribution of the different words appearing on the left side of the string in the document. Similar to the above indicators, there are certain differences.
Interactive information: Each time a string is divided into two parts, the left half of the string and the right half of the string, calculate the probability of its occurrence in addition to their own independent occurrence of the probability, and finally take all the probability of the smallest value of the division. The larger the value, the higher the cohesion within the string, the more likely it is to be a word.

Algorithm specific process:

1. To read the input file line by row, according to the non-Chinese characters ([^\u4e00-\u9fa5]+) and stop the word "very much?" It is also more than this is not and only on the use of the good and to go after not say ",
into strings, the code is as follows:
string[] phrases = Line.split ("[^\u4e00-\u9fa5]+|[" +stopwords+ "]");
Stop words can be modified.
2. Get the left and right substring of all the Shard strings, add left and right ptable respectively
3. Sort the ptable and compute the ltable. The ltable record is that the next substring has the same number of characters as the previous substring in the sorted ptable
4. Traverse ptable and ltable, you can get the word frequency of all substrings, left and right neighbors
5. According to the word frequency of all substrings, the left and right neighborhood results, the frequency of the output string, the number of left and right, the left and right entropy, interactive information

1.  Nagaoalgorithm.java

Package Com.algo.word;
Import Java.io.BufferedReader;
Import Java.io.BufferedWriter;
Import java.io.FileNotFoundException;
Import Java.io.FileReader;
Import Java.io.FileWriter;
Import java.io.IOException;
Import java.util.ArrayList;
Import Java.util.Arrays;
Import java.util.Collections;
Import Java.util.HashMap;
Import Java.util.HashSet;
Import java.util.List;
Import Java.util.Map;
 
Import Java.util.Set;
   
  public class Nagaoalgorithm {private int N;
  Private list<string> leftptable;
  Private int[] leftltable;
  Private list<string> rightptable;
  Private int[] rightltable;
   
  Private double wordnumber;
   
  Private map<string, tfneighbor> Wordtfneighbor;
   
  Private final static String Stopwords = "Is it a lot of it is also better than this is not and only on the good and to go after not say";
    Private Nagaoalgorithm () {//default n = 5 N = 5;
    leftptable = new arraylist<string> ();
    rightptable = new arraylist<string> ();
  Wordtfneighbor = new hashmap<string, tfneighbor> (); }
  Reverse phrase private string reverse (string phrase) {StringBuilder reversephrase = new StringBuilder ();
    for (int i = Phrase.length ()-1; I >= 0; i--) reversephrase.append (Phrase.charat (i));
  return reversephrase.tostring ();
    //co-prefix length of S1 and S2 private int coprefixlength (string s1, string s2) {int coprefixlength = 0;
      for (int i = 0; i < Math.min (S1.length (), s2.length ()); i++) {if (S1.charat (i) = = S2.charat (i)) coprefixlength++;
    else break;
  return coprefixlength; //add substring of line to ptable private void addtoptable (String line) {//split line according to consecutive n One Chinese character string[] phrases = Line.split ("[^\u4e00-\u9fa5]+|["
    +stopwords+ "]"); for (String phrase:phrases) {for (int i = 0; i < phrase.length (); i++) Rightptable.add (phrase.substring (i
      ));
      String reversephrase = reverse (phrase);
for (int i = 0; i < reversephrase.length (); i++)        Leftptable.add (reversephrase.substring (i));
    Wordnumber + + phrase.length ();
    }//count ltable private void Countltable () {collections.sort (rightptable);
    rightltable = new Int[rightptable.size ()]; for (int i = 1; i < rightptable.size (); i++) Rightltable[i] = Coprefixlength (Rightptable.get (i-1), Rightptable.get
     
    (i));
    Collections.sort (leftptable);
    leftltable = new Int[leftptable.size ()]; for (int i = 1; i < leftptable.size (); i++) Leftltable[i] = Coprefixlength (Leftptable.get (i-1), Leftptable.get (i))
     
    ;
  System.out.println ("Info: [Nagao Algorithm step 2]: have sorted ptable and counted left and right ltable"); //according to PTable and ltable, Count statistical RESULT:TF, neighbor distribution private void Counttfneighbor ()  {//get TF and right neighbor for (int pindex = 0; Pindex < rightptable.size (); pindex++) {String phrase =
      Rightptable.get (Pindex); for (int length = 1 + righTltable[pindex]; Length <= N && length <= phrase.length ();
        length++) {String word = phrase.substring (0, length);
        Tfneighbor Tfneighbor = new Tfneighbor ();
        Tfneighbor.incrementtf ();
        if (Phrase.length () > Length) Tfneighbor.addtorightneighbor (Phrase.charat (length));
            for (int lindex = pindex+1; lindex < rightltable.length; lindex++) {if (Rightltable[lindex] >= length) {
            Tfneighbor.incrementtf ();
            String cophrase = Rightptable.get (lindex);
          if (Cophrase.length () > Length) Tfneighbor.addtorightneighbor (Cophrase.charat (length));
        else break;
      } wordtfneighbor.put (Word, tfneighbor); }//get left neighbor for (int pindex = 0; Pindex < leftptable.size (); pindex++) {String phrase = le
      Ftptable.get (Pindex); for (int length = 1 + leftltable[pindex]; length <= N && length <= phrase.length (); length++) {String word = reverse (phrase.substring (0, length));
        Tfneighbor Tfneighbor = Wordtfneighbor.get (word);
        if (Phrase.length () > Length) Tfneighbor.addtoleftneighbor (Phrase.charat (length));
            for (int lindex = pindex + 1; lindex < leftltable.length; lindex++) {if (leftltable[lindex) >= length) {
            String cophrase = Leftptable.get (lindex);
          if (Cophrase.length () > Length) Tfneighbor.addtoleftneighbor (Cophrase.charat (length));
        else break;
  }} System.out.println ("Info: [Nagao Algorithm Step 3]: Having counted TF and neighbor");  //according to Wordtfneighbor, Count MI of Word private double countmi (String word) {if (Word.length () <= 1)
    return 0;
    Double coprobability = Wordtfneighbor.get (Word). Gettf ()/wordnumber;
    list<double> mi = new arraylist<double> (Word.length ()); for (int pos = 1; pos < word.length () pos++) {String Leftpart = word.substring (0, POS);
      String Rightpart = word.substring (POS);
      Double leftprobability = Wordtfneighbor.get (Leftpart). Gettf ()/wordnumber;
      Double rightprobability = Wordtfneighbor.get (Rightpart). Gettf ()/wordnumber;
    Mi.add (coprobability/(leftprobability*rightprobability));
  return Collections.min (MI); }//save TF, (left and right) neighbor number, neighbor entropy, mutual information private void Savetfneighborinfomi (  String out, String stoplist, string[] threshold) {try {//read stop words file set<string>
      = new Hashset<string> ();
      BufferedReader br = new BufferedReader (new FileReader (stoplist));
      String Line;
      while (line = Br.readline ())!= null) {if (Line.length () > 1) stopwords.add (line);
      } br.close ();
      Output words TF, neighbor info, MI bufferedwriter bw = new BufferedWriter (new FileWriter (out)); for (MAP.ENTRY&LT String, tfneighbor> Entry:wordTFNeighbor.entrySet ()) {if Entry.getkey (). Length () <= 1 | | stopwords.contai
        NS (Entry.getkey ())) continue;
         
         
        Tfneighbor Tfneighbor = Entry.getvalue ();
        int TF, leftneighbornumber, Rightneighbornumber;
        Double mi;
        tf = Tfneighbor.gettf ();
        Leftneighbornumber = Tfneighbor.getleftneighbornumber ();
        Rightneighbornumber = Tfneighbor.getrightneighbornumber ();
        Mi = COUNTMI (Entry.getkey ()); if (tf > Integer.parseint (threshold[0]) && leftneighbornumber > Integer.parseint (threshold[1)) & & Rightneighbornumber > Integer.parseint (threshold[2]) && mi > Integer.parseint (threshold[3
          ]) {StringBuilder sb = new StringBuilder ();
          Sb.append (Entry.getkey ());
          Sb.append (","). Append (TF);
          Sb.append (","). Append (Leftneighbornumber);
          Sb.append (","). Append (Rightneighbornumber); Sb.appEnd (","). Append (Tfneighbor.getleftneighborentropy ());
          Sb.append (","). Append (Tfneighbor.getrightneighborentropy ());
          Sb.append (","). Append (mi). append ("\ n");
        Bw.write (Sb.tostring ());
    } bw.close ();
    catch (IOException e) {throw new RuntimeException (e);
  } System.out.println ("Info: [Nagao Algorithm step 4]: have saved to file");
    }//apply Nagao algorithm to input file public static void Applynagao (string[] inputs, string out, string stoplist) {
    Nagaoalgorithm Nagao = new Nagaoalgorithm ();
    Step 1:add phrases to ptable String;
        for (String in:inputs) {try {bufferedreader br = new BufferedReader (new FileReader);
        while (line = Br.readline ())!= null) {nagao.addtoptable (line);
      } br.close ();
      catch (IOException e) {throw new RuntimeException (); } System.out.println ("Info: [Nagao Algorithm Step 1]: Have added alL left and right substrings to ptable ");
    Step 2:sort ptable and Count Ltable nagao.countltable ();
    Step3:count TF and Neighbor Nagao.counttfneighbor ();
  Step4:save TF Neighborinfo and MI Nagao.savetfneighborinfomi (out, Stoplist, "20,3,3,5". Split (",")); public static void Applynagao (string[] inputs, string out, string stoplist, int n, string filter) {Nagaoalgorithm
    Nagao = new Nagaoalgorithm ();
    NAGAO.SETN (n);
    String[] threshold = Filter.split (",");
      if (threshold.length!= 4) {System.out.println ("Error:filter must have 4 numbers, seperated with ', '");
    Return
    }//step 1:add phrases to ptable String;
        for (String in:inputs) {try {bufferedreader br = new BufferedReader (new FileReader);
        while (line = Br.readline ())!= null) {nagao.addtoptable (line);
      } br.close ();
      catch (IOException e) {throw new RuntimeException (); } SyStem.out.println ("Info: [Nagao Algorithm Step 1]: Have added all left and right substrings to ptable");
    Step 2:sort ptable and Count Ltable nagao.countltable ();
    Step3:count TF and Neighbor Nagao.counttfneighbor ();
  Step4:save TF Neighborinfo and MI Nagao.savetfneighborinfomi (out, stoplist, threshold);
  } private void Setn (int n) {n = n;
    public static void Main (string[] args) {string[] ins = {"E://test//ganfen.txt"};
  Applynagao (INS, "E://test//out.txt", "e://test//stoplist.txt");
 }
 
}

2. Tfneighbor.java

Package Com.algo.word;
Import Java.util.HashMap;
 
Import Java.util.Map;
  public class Tfneighbor {private int tf;
  Private Map<character, integer> Leftneighbor;
   
  Private Map<character, integer> Rightneighbor;
    Tfneighbor () {Leftneighbor = new hashmap<character, integer> ();
  Rightneighbor = new Hashmap<character, integer> (); //add Word to Leftneighbor public void Addtoleftneighbor (char word) {//leftneighbor.put (Word, 1 + leftneighbor.g
    Etordefault (Word, 0));
    Integer number = Leftneighbor.get (word);
  Leftneighbor.put (word, number = null? 1:1+number); //add Word to Rightneighbor public void Addtorightneighbor (char word) {//rightneighbor.put (Word, 1 + RIGHTNEIGHB
    Or.getordefault (Word, 0));
    Integer number = Rightneighbor.get (word);
  Rightneighbor.put (word, number = null? 1:1+number);
  }//increment tf public void Incrementtf () {tf++; public int Getleftneighbornumber () {return LEFTNEIGHBOR.size ();
  public int Getrightneighbornumber () {return rightneighbor.size ();
    Public double getleftneighborentropy () {double entropy = 0;
    int sum = 0;
      for (int number:leftNeighbor.values ()) {entropy + = Number*math.log (number);
    sum + = number;
    } if (sum = = 0) return 0;
  return Math.log (sum)-entropy/sum;
    Public double getrightneighborentropy () {double entropy = 0;
    int sum = 0;
      for (int number:rightNeighbor.values ()) {entropy + = Number*math.log (number);
    sum + = number;
    } if (sum = = 0) return 0;
  return Math.log (sum)-entropy/sum;
  public int Gettf () {return tf;

 }
}

3. Main.java

 package Com.algo.word;  public class Main {public static void main (string[] args) {//if 3 arguments, the i-argument is input files Splitting with ', '//second argument are output file//output 7 columns split with ', ', like below://word, t Erm frequency, left neighbor number, right neighbor number, left neighbor entropy, right neighbor entropy, mutual informat Ion//third argument is stop words list if (args.length = 3) Nagaoalgorithm.applynagao (Args[0].split (","), a
     
    RGS[1], args[2]); If 4 arguments, Forth argument is the NGram parameter N//5th argument is threshold to output words, default is "20, 3,3,5 "//output TF > && (left | right) neighbor number > 3 && MI > 5 else if (Args.leng
     
     
  th = = 5) Nagaoalgorithm.applynagao (Args[0].split (","), Args[1], args[2], Integer.parseint (args[3)), args[4]; }
 
}

The above is the entire contents of this article, I hope you can enjoy.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.