Ikanalyzer Dictionary Extensions
Recently work to Lucene, need to Chinese word segmentation, word breaker is Ikanalyzer, the dictionary contains 270,000 terms, can meet the general requirements of the word segmentation, but if applied to specific areas of specialization, but also need to expand the professional thesaurus to achieve better segmentation effect:
- Ikanalyzer Dictionary Extensions
- Ikanalyzer Participle API
- Intelligent Word Segmentation
- Fine-grained segmentation
- Word Base Extension
- Extending the dictionary through the configuration file
- Extending the dictionary through the API
Ikanalyzer Participle API
Ikanalyzer analyzer=NewIkanalyzer (true); Tokenstream ts=NULL;Try{Ts=analyzer.tokenstream ("title","Interstellar cock wire, brand, production, commercial sales, industrial inventories, cumulative, year-on-year, YoY, total, China, Lee Group, Furong king, Yellow Crane Tower, su Smoke, Jiao zi, yuxi, your smoke, clouds, white Sands, Huangshan, Nanjing, Golden Leaf, real dragon, double happiness • Double Happiness, seven wolves, Golden Saint, Red River); Chartermattribute Cta=ts.addattribute (Chartermattribute.class); Ts.reset ();intCount=0; while(Ts.incrementtoken ()) {count++; System. out. println ("Term:"+ cta.tostring ()); } System. out. println ("Total:"+count); }Catch(Exception e) {E.printstacktrace (); }finally{if(ts!=NULL)Try{Ts.close (); }Catch(Exception e) {E.printstacktrace (); }if(analyzer!=NULL){Try{Analyzer.close (); }Catch(Exception e) {E.printstacktrace (); } } }
First Ikanalyzer analyzer=new Ikanalyzer (true); Constructs a Ikanalyzer object, the constructor parameter is true to use intelligent Word segmentation, the default is the most fine-grained segmentation, the following are the two modes of word segmentation effect:
1. Intelligent segmentation
Term: dick
Term: Silk
Term: Brand
Term: Production
Term: Commercial
Term: Sales
Term: Business
Term: Inventory
Term: Cumulative
Term: Previous year
Term: Same period
Term: YoY
Term: Total
Term: China
Term: Lee
Term: Group
Term: Lotus King
Term: Yellow Crane Tower
Term: Sue
Term: Tobacco
Term: Jiao
Term: Child
Term: Yuxi
Term: Expensive
Term: Tobacco
Term: Mist
Term: White Sands
Term: Huangshan
Term: Nanjing
Term: Yellow
Term: Golden leaf
Term: true
Term: Dragon
Term: Double Happiness
Term: red double Happiness
Term: seven horses
Term: Wolf
Term: Gold
Term: St.
Term: Red River
total:41
2. Fine-grained segmentation
Term: Interstellar
Term: dick
Term: Silk
Term: Brand
Term: Production
Term: Commercial
Term: Sales
Term: Business
Term: Inventory
Term: Cumulative
Term: Previous year
Term: Same period
Term: YoY
Term: Total
Term: China
Term: Lee
Term: Group
Term: Lotus King
Term: Hibiscus
Term: Wang
Term: Yellow Crane Tower
Term: Yellow crane
Term: Building
Term: Sue
Term: Tobacco
Term: Jiao
Term: Child
Term: Yuxi
Term: Expensive
Term: Tobacco
Term: Mist
Term: White Sands
Term: Huangshan
Term: Nanjing
Term: Gold
Term: Golden leaf
Term: true
Term: Dragon
Term: Double Happiness
Term: red double Happiness
Term: Double Happiness
Term: Seven
Term: Horse
Term: Wolf
Term: Gold
Term: St.
Term: Red River
Total:47
Ikanalyzer.tokenstream (Fieldname,text) returns the Tokenstream object, token refers to the word after the word, the Tokenstream object contains the text word, fieldName refers to the document's domain name, a document contains multiple fields, such as Title,abstract,content and so on, can be arbitrarily specified here.
The function of Tokenstream.incrementtoken () is the equivalent of an iterator that iterates through each word element.
Chartermattribute is a specific lexical object, and each time the Incrementtoken () method is called, the object holds a WORD element that is updated.
Word Base Extension
Through the above output can be seen, if the use of the dictionary ikanalyzer, the two word patterns are not the ideal effect, so to expand the thesaurus. There are two ways to extend a thesaurus, one way is to add a dictionary file, and the other is to expand dynamically with Ikanalyzer's own API.
Extending the dictionary through the configuration file
The following is from Ikanalyzer's own documentation, which is quite clear:
The IK word breaker also supports the configuration of the IKAnalyzer.cfg.xml file to augment your dictionary and stop dictionaries (filter dictionaries).
1. Deploying IKAnalyzer.cfg.xml
IKAnalyzer.cfg.xml deployment is the same as Hibernate, log4j, and so on, under the Code root login (for Web projects, usually web-inf/classes login).
2. Editing and deployment of dictionary files
Word breaker dictionary file format is a non-BOM UTF-8 encoded Chinese text file with unlimited file extensions. In the dictionary, each Chinese word has a separate line, using \ r \ n DOS mode to wrap. (Note, if you don't know what the UTF-8 format is without BOM, make sure your dictionary uses UTF-8 storage and add a blank line to the head of the file). You can refer to the. dic file under the source Org.wltea.analyzer.dic package of the word breaker.
The dictionary file should be deployed under the Java resource path, which is the path that ClassLoader can load. (Recommended with IKAnalyzer.cfg.xml)
3. Configuration of the IKAnalyzer.cfg.xml file
<?xml version= "1.0" encoding= "UTF-8"?> <! DOCTYPE Properties SYSTEM "Http://java.sun.com/dtd/properties.dtd" > <properties> <comment>IK Analyzer Extended Configuration</Comment> <!--users can configure their own extension dictionary here-- <entry key="Ext_dict">/mydict.dic; /com/mycompany/dic/mydict2.dic;</Entry> <!--users can configure their own extension stop word dictionary here-- <entry key="Ext_stopwords">/ext_stopword.dic</Entry> </Properties>
In the configuration file, users can configure multiple dictionary files at once. The file name uses ";" separated by numbers. The file path is the starting root path of the relative Java package.
Note: Both the configuration file and the dictionary file are placed in the root directory of the Java package, and the dictionary path of the above configuration file should be stripped of the "/" at the beginning, otherwise the extension will not take effect.
After the cigarette brand is filled into the extended dictionary file, the following is the result of the word segmentation after the extended dictionary:
Term: Interstellar
Term: Dick Wire
Term: Brand
Term: Production
Term: Commercial
Term: Sales
Term: Business
Term: Inventory
Term: Cumulative
Term: Previous year
Term: Same period
Term: YoY
Term: Total
Term: China
Term: Lee Group
Term: Lotus King
Term: Yellow Crane Tower
Term: Sue Smoke
Term: Jiao Zi
Term: Yuxi
Term: Your cigarettes
Term: Mist
Term: White Sands
Term: Huangshan
Term: Nanjing
Term: Golden leaf
Term: True Dragon
Term: Double Happiness
Term: red double Happiness
Term: seven Wolves
Term: Golden saint
Term: Red River
Total:32
You can see that the cigarette brand is correctly identified.
Extending the dictionary through the API
The IK word breaker supports expanding your dictionaries and stopping dictionaries using the API programming model. If your personalized dictionary is stored with a database, this approach should apply to you. The API is as follows: Class Org.wltea.analyzer.dic.Dictionary Description: The Dictionary object for the IK word breaker. It is responsible for the loading of Chinese vocabulary, memory management and matching retrieval. Public StaticDictionaryInitial(Configuration cfg) Description: Initializes an instance of the dictionary. The dictionary uses a singleton pattern, and once initialized, the instance is fixed. PS: Note that this method can only be called once. Parameter 1:configuration cfg, dictionary path configuration return value: Dictionary IK dictionary single case? Public StaticDictionaryGetsingleton() Description: Gets the initialized dictionary singleton return value: Dictionary IK dictionary single case? Public void Addwords(collection<string> words) Description: Loads a list of user-extended vocabularies into the main dictionary of IK, adding a word breaker's recognizable words. Parameter 1:collection<string> words, extended glossary list return value: None? Public void Disablewords(collection<string> words) Description: The word element parameter in the screen dictionary 1:collection<string> words, the list of words to be deleted return value: None
The above also comes from official documents. Note here: This API's expansion of the dictionary is only run-time and only expands the in-Memory dictionary objects without affecting the dictionary files on disk, which is not specifically stated in the documentation.
The following defines a class to implement the dictionary extension, adding "Interstellar cock silk" to the Dictionary:
publicclass DicUtil { publicstaticvoidextendDic(){ Configuration cfg=DefaultConfig.getInstance(); System.out.println(cfg.getMainDictionary()); Dictionary.initial(cfg); Dictionary dic=Dictionary.getSingleton(); set=new HashSet<>(); set.add("星际屌丝"); dic.addWords(set); }}
The following call Extenddic () again after the word breaker:
Dicutil.extenddic ();//Extensions dictionaryIkanalyzer analyzer=NewIkanalyzer (true); Tokenstream ts=NULL;Try{Ts=analyzer.tokenstream ("title","Interstellar cock wire, brand, production, commercial sales, industrial inventories, cumulative, year-on-year, YoY, total, China, Lee Group, Furong king, Yellow Crane Tower, su Smoke, Jiao zi, yuxi, your smoke, clouds, white Sands, Huangshan, Nanjing, Golden Leaf, real dragon, double happiness • Double Happiness, seven wolves, Golden Saint, Red River); Chartermattribute Cta=ts.addattribute (Chartermattribute.class); Ts.reset ();intCount=0; while(Ts.incrementtoken ()) {count++; System. out. println ("Term:"+ cta.tostring ()); } System. out. println ("Total:"+count); }Catch(Exception e) {E.printstacktrace (); }finally{if(ts!=NULL)Try{Ts.close (); }Catch(Exception e) {E.printstacktrace (); }if(analyzer!=NULL){Try{Analyzer.close (); }Catch(Exception e) {E.printstacktrace (); } } } }
The following is the output result:
Term: Interstellar cock wire
Term: Brand
Term: Production
Term: Commercial
Term: Sales
Term: Business
Term: Inventory
Term: Cumulative
Term: Previous year
Term: Same period
Term: YoY
Term: Total
Term: China
Term: Lee Group
Term: Lotus King
Term: Yellow Crane Tower
Term: Sue Smoke
Term: Jiao Zi
Term: Yuxi
Term: Your cigarettes
Term: Mist
Term: White Sands
Term: Huangshan
Term: Nanjing
Term: Golden leaf
Term: True Dragon
Term: Double Happiness
Term: red double Happiness
Term: seven Wolves
Term: Golden saint
Term: Red River
Total:31
As you can see, "Interstellar cock silk" is sliced out as a separate word element. Dictionary as a single instance or as a static variable has an effect on the global.
But we open the main dictionary file and look for "Interstellar cock wire" will not be found because this set of APIs is oriented to memory expansion dictionaries and does not change disk files.
Ikanalyzer Dictionary Extensions