使用不同的分詞器, 最後得到的關鍵詞不同, 需要的時間也不同
需要中文分詞是, 用IKAnalyzer是不錯的選擇, 但相比時間, 我的電腦上大概分詞需要800+ms
分詞器工作流程:
輸入文本(What's your name?)
→關鍵詞劃分(What's ; your ; name), 不同分詞器分法不同
→消除停用詞()
→形態還原 (What's -> What)
→轉化小寫(What -> what)
private long stime;private long etime;private Analyzer analyzer;@Beforepublic void s(){stime = System.currentTimeMillis();}@Afterpublic void e(){etime = System.currentTimeMillis();System.out.println("使用" + analyzer.getClass().getName() + "分詞, 耗時" + (etime - stime) + "ms");}@Testpublic void test() throws Exception {//analyzer = new SimpleAnalyzer(Version.LUCENE_35);//analyzer = new StandardAnalyzer(Version.LUCENE_35);analyzer = new IKAnalyzer();analyze(analyzer, "hTTp://www.baidu.com/s?wd=Lucene中文分詞");}private void analyze(Analyzer analyzer, String text) throws Exception {TokenStream tokens = analyzer.reusableTokenStream("content", new StringReader(text));OffsetAttribute offsetAttr = tokens.getAttribute(OffsetAttribute.class);CharTermAttribute charTermAttr = tokens.getAttribute(CharTermAttribute.class);while (tokens.incrementToken()) {char[] charBuf = charTermAttr.buffer();String term = new String(charBuf, 0, offsetAttr.endOffset() - offsetAttr.startOffset());System.out.println(term + ", " + offsetAttr.startOffset() + ", " + offsetAttr.endOffset());}tokens.close();// while (ts.incrementToken()) {//過時// TermAttribute ta = ts.getAttribute(TermAttribute.class);// System.out.println(ta.term());// }}