This essentially Lucene analyzer CHAIN,SOLR is just handy: you can link tokenizer and filter by configuring an XML file. We sometimes need to use this chain in our own code. This article records how to do it.
Look at the whole code first (groovy):
Class myanalyzer { def analyzer = new analyzer () { @Override Protected tokenstreamcomponents createcomponents (string s) { def loader = new classpathresourceloader () // create tokenizer def factory = new Mmsegtokenizerfactory (["Mode": "complex", "Dicpath": "Dict"]) factory.inform (loader) tokenizer tokenizer = factory.create () // create tokenfilters Factory = new synonymfilterfactory (["Synonyms": "Dict/synonyms.txt", "expand": " True ", " IgnoreCase ": " true "]) factory.inform (loader) Tokenfilter filter = factory.create (Tokenizer) factory = new stopfilterfactory (["IgnoreCase": "true", "words ": " Dict/stop_words_cn.txt "]) Factory.inform (loader) filter = factory.create (filter) return new tokenstreamcomponents (tOkenizer, filter) } } def tokenize (String text) { def tokens = [] def ts = Analyzer.tokenstream ("text", text) def termAttr = ts.addattribute (Chartermattribute.class) ts.reset () while (Ts.incrementtoken ()) { tokens.add (Termattr.tostring ()) } ts.end () ts.close () return tokens } public static void main (String[] args) { myanalyzer analyzer = new myanalyzer () println (Analyzer.tokenize ("I am a Painter")) }
There are several key points:
The Custom Analyzer integrates analyzer and implements the Createcomponents method
Need resourceloader to load data file
The above link three, can more: Tokenizerfactory, Synonymfilterfactory, stopfilterfactory ...
The method of taking tokens is special: need to use a attribute class: Chartermattribute; action required Tokenstream
Must be in accordance with such a process (Lucene rules ... ): Reset, Incrementtoken, end, close
This article is from the "I advised the Heaven Heavy Cheer" blog, please be sure to keep this source http://waynecui.blog.51cto.com/8231128/1761156
Do not start SOLR, use SOLR's analyzer chain (using mmseg4j participle)