Lucene5.x 中文 同義字

來源:互聯網
上載者:User

標籤:src   one   prot   osi   char   儲存   lis   div   extends   

查詢好好多資料,英文同義字好好的,中文就不行,多謝網友支援,拼接了好多代碼,然後修改了一些,不足之處,多謝指正。

直接上代碼吧,在代碼中瞭解怎麼分詞的最好

1,建立分詞引擎

1 public interface SamewordContext {2     String[] getSamewords(String name);3 }

2,同義字

 1 import java.util.HashMap; 2 import java.util.Map; 3  4 public class SimpleSamewordContext implements SamewordContext { 5     Map<String,String[]> maps = new HashMap<String,String[]>(); 6     public SimpleSamewordContext() { 7         maps.put("中國",new String[]{"天朝","大陸"}); 8         maps.put("我家",new String[]{"family","伐木累"}); 9     }10     @Override11     public String[] getSamewords(String name) {12         // TODO Auto-generated method stub13         return maps.get(name);14     }15 }

3,TokenFilter

import java.io.IOException;import java.util.Stack;import org.apache.lucene.analysis.TokenFilter;import org.apache.lucene.analysis.TokenStream;import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;import org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;import org.apache.lucene.util.AttributeSource;public class MySameTokenFilter extends TokenFilter {    private CharTermAttribute cta = null;    private PositionIncrementAttribute pia = null;    private AttributeSource.State current;    private Stack<String> sames = null;    private SamewordContext samewordContext;    protected MySameTokenFilter(TokenStream input,SamewordContext samewordContext) {        super(input);        cta = this.addAttribute(CharTermAttribute.class);        pia = this.addAttribute(PositionIncrementAttribute.class);        sames = new Stack<String>();        this.samewordContext = samewordContext;    }    @Override    public boolean incrementToken() throws IOException {        if(sames.size()>0) {            //將元素出棧,並且擷取這個同義字            String str = sames.pop();            //還原狀態            restoreState(current);            cta.setEmpty();            cta.append(str);            //設定位置0            pia.setPositionIncrement(0);            return true;        }                if(!this.input.incrementToken()) return false;                if(addSames(cta.toString())) {            //如果有同義字將目前狀態先儲存            current = captureState();        }        return true;    }        private boolean addSames(String name) {        String[] sws = samewordContext.getSamewords(name);        if(sws!=null) {            for(String str:sws) {                sames.push(str);            }            return true;        }        return false;    }    }

4,Analyzer

import java.io.Reader;import java.io.StringReader;import org.apache.lucene.analysis.TokenStream;import org.apache.lucene.analysis.core.LowerCaseFilter;import org.apache.lucene.analysis.core.StopAnalyzer;import org.apache.lucene.analysis.core.StopFilter;import org.wltea.analyzer.lucene.IKTokenizer;import com.chenlb.mmseg4j.analysis.MMSegAnalyzer;public class MySameworkAnalyzer extends MMSegAnalyzer {    private SamewordContext samewordContext;    public MySameworkAnalyzer(SamewordContext samewordContext) {        // TODO Auto-generated constructor stub        this.samewordContext = samewordContext;    }    @Override    protected TokenStreamComponents createComponents(String text) {        Reader in = new StringReader(text);        IKTokenizer tokenizer = new IKTokenizer(in , true);        TokenStream tokenStream = new MySameTokenFilter(tokenizer,                samewordContext);        tokenStream = new LowerCaseFilter(tokenStream);        tokenStream = new StopFilter(tokenStream,                StopAnalyzer.ENGLISH_STOP_WORDS_SET);        return new TokenStreamComponents(tokenizer, tokenStream);    }}

5,測試

@Test    public void test01() {         String text = "我家在中國";          Analyzer analyzer = new MySameworkAnalyzer(new SimpleSamewordContext());           AnalyzerUtils.displayAllToken(text,analyzer);      }

運行結果:

 

Lucene5.x 中文 同義字

聯繫我們

該頁面正文內容均來源於網絡整理,並不代表阿里雲官方的觀點,該頁面所提到的產品和服務也與阿里云無關,如果該頁面內容對您造成了困擾,歡迎寫郵件給我們,收到郵件我們將在5個工作日內處理。

如果您發現本社區中有涉嫌抄襲的內容,歡迎發送郵件至: info-contact@alibabacloud.com 進行舉報並提供相關證據,工作人員會在 5 個工作天內聯絡您,一經查實,本站將立刻刪除涉嫌侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.