Recently, in a project on text mining, we need to use the technique of matching the similarity of vectors that the Ngram model already corresponds to.
The procedure of Ngram participle
A netizen is asking me, think of want to write here, as for those jar package is also very easy to find, Lucene jar, Baidu Search can be found
Package Edu.fjnu.huanghong;import Java.io.ioexception;import Java.io.stringreader;import Org.apache.lucene.analysis.tokenizer;import Org.apache.lucene.analysis.ngram.ngramtokenizer;import Org.apache.lucene.analysis.tokenattributes.chartermattribute;import Org.apache.lucene.analysis.tokenattributes.offsetattribute;import Org.apache.lucene.analysis.tokenattributes.positionincrementattribute;import Org.apache.lucene.analysis.tokenattributes.positionlengthattribute;import Org.apache.lucene.analysis.tokenattributes.termtobytesrefattribute;import Org.apache.lucene.analysis.tokenattributes.typeattribute;import org.apache.lucene.util.version;/* * * Import Org.apache.lucene.analysis.ngram.Lucene43EdgeNGramTokenizer; Import Org.apache.lucene.analysis.ngram.Lucene43NGramTokenizer; * */public class Ngram {public static void main (string[] args) {String s = "Pick up white iphone6 phone case transparent owner way 15659119418"; string[] str = s.split (""); StringBuilder sb = new StringBuilder (); for (int i = 0; i < Str.lengtH i++) {sb.append (str[i]);} System.out.println (Sb.tostring ()); StringReader sr = new StringReader (sb.tostring ())//n-gram model word breaker tokenizer tokenizer = new Ngramtokenizer (Version.LUCENE _45,SR); Testtokenizer (Tokenizer);} private static void Testtokenizer (Tokenizer tokenizer) {try {tokenizer.reset (); while (Tokenizer.incrementtoken ()) < Span style= "White-space:pre" ></span>{chartermattribute Chartermattribute=tokenizer.addattribute ( Chartermattribute.class);) System.out.print (chartermattribute.tostring () + "|");} Tokenizer.end (); Tokenizer.close ();} catch (IOException e) {e.printstacktrace ();}}}
I do not know if there is any predecessor related to qgram knowledge---------------------can not be found through the wall.
Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.
Lucene Ngram Dividing words