Lucene uses Ikanalyzer word breaker and ikanalyzer extension thesaurus

Source: Internet
Author: User
Tags createindex

Article reprinted from: http://www.cnblogs.com/dennisit/archive/2013/04/07/3005847.html

Scenario One: Configuration-based dictionary augmentation

The project structure diagram is as follows:


The IK word breaker also supports the configuration of IKAnalyzer.cfg.xml files to augment your proprietary dictionaries. Google Pinyin Thesaurus Download: http://ishare.iask.sina.com.cn/f/14446921.html?from=like
Create the IKAnalyzer.cfg.xml file under the SRC directory of the Web project, as follows

<?xml version= "1.0" encoding= "UTF-8"? ><! DOCTYPE Properties SYSTEM "Http://java.sun.com/dtd/properties.dtd" >  <properties>      <comment> IK Analyzer Extended Configuration </comment>    <!--users can configure their own extension dictionaries here--     <entry key= "Ext_dict" >/dicdata/ Use.dic.dic;/dicdata/googlepy.dic</entry>      <!--users can configure their own extension stop word dictionary here-    -    <entry key= " Ext_stopwords ">/dicdata/ext_stopword.dic</entry> </properties>

Editing and deployment of dictionary files
Word breaker dictionary file format is a non-BOM UTF-8 encoded Chinese text file with unlimited file extensions. In the dictionary, each Chinese word has a separate line, using \ r \ n dos mode to wrap. (Note, if you don't know what the UTF-8 format is without BOM, make sure your dictionary uses UTF-8 storage and add a blank line to the head of the file). You can refer to the. dic file under the source Org.wltea.analyzer.dic package of the word breaker. The dictionary file should be deployed under the Java resource Path, which is the path that ClassLoader can load. (It is recommended to put together with IKAnalyzer.cfg.xml).

Scenario Two: API-based dictionary extensions

Entry-related operations in Ikanalyzer
1.org.wltea.analyzer.cfg
2.org.wltea.analyzer.dic

ORG.WLTEA.ANALYZER.CFG definition in Configuration Interface Getextdictionarys ()  get extended Dictionary configuration Path Getextstopworddictionarys () Get extended Stop Dictionary configuration Path Getmaindictionary () Gets the main dictionary path Getquantifierdicionary () Get quantifier Dictionary path Org.wltea.analyzer.cfg.DefualtConfig class is the implementation of the Configuration interface

Related methods in the Directory class under Org.wltea.analyzer.dic

public void Addwords (java.util.collection<java.lang.string> words)     Bulk Load new entry    parameters: words- Collection entry list public void Disablewords (java.util.collection<java.lang.string> words) bulk Remove (mask) entry

Using the Ikanalyzer Word breaker instance demo in Lucene
Business Entities

Package com.icrate.service.study.demo;/** * * *  @version: 1.0 *   *  @author  : Sujonin              <a href= " Mailto:[email protected] "> Send mail </a> *     *  @since   : 1.0        Created:    2013-4-7    01:52:49 *      *  @function: TODO         * */public class Medicine {    private Integer ID;    private String name;    Private String function;            Public Medicine () {            } public            Medicine (Integer ID, string name, String function) {        super ();        This.id = ID;        this.name = name;        This.function = function;    }    Getter and Setter () public        String toString () {        return this.id + "," +this.name + "," + This.function;    }}

Building Simulation Data

Package Com.icrate.service.study.demo;import Java.util.arraylist;import java.util.list;/** * * * @version: 1.0 * * @ Author: Sujonin <a href= "mailto:[email protected" "> Send mail </a> * * @since: 1.0 when created Room: 2013-4-7 PM 01:54:34 * * @function: TODO * */public class DataFactory {private static Datafa        Ctory DataFactory = new DataFactory (); Private DataFactory () {} public list<medicine> GetData () {list<medicine> List = new A        Rraylist<medicine> (); List.add (New Medicine (1, "Silver flower Cold particles", "function indications: silver flower cold particles, headache, heat, solution table, the pharynx."        ")); List.add (New Medicine (2, "cold cough syrup", "function indications: Cold cough syrup, table clearing heat, cough and phlegm."        ")); List.add (New Medicine (3), "Cold Spirit granule", "function indications: antipyretic and analgesic." Headache, heat-clearing.        ")); List.add (New Medicine (4, "Cold Spirit capsule", "function indications: silver flower cold particles, headache, heat, clear table, the pharynx."        ")); List.add (New Medicine (5, "Renhe cold particles", "functional indications: wind and heat, lung cough, relieve the table heat, cough and phlegm."        "));            return list; } public static DataFactory getinstance () {return DataFactory; }}

Use Lucene to retrieve analog data

Package Com.icrate.service.study.demo;import Java.io.file;import Java.io.ioexception;import java.util.ArrayList; Import Java.util.list;import Org.apache.lucene.analysis.analyzer;import org.apache.lucene.document.Document; Import Org.apache.lucene.document.field;import Org.apache.lucene.index.indexreader;import Org.apache.lucene.index.indexwriter;import Org.apache.lucene.index.indexwriterconfig;import Org.apache.lucene.index.term;import Org.apache.lucene.queryparser.multifieldqueryparser;import Org.apache.lucene.search.indexsearcher;import Org.apache.lucene.search.query;import Org.apache.lucene.search.scoredoc;import Org.apache.lucene.search.topdocs;import Org.apache.lucene.search.highlight.formatter;import Org.apache.lucene.search.highlight.fragmenter;import Org.apache.lucene.search.highlight.highlighter;import Org.apache.lucene.search.highlight.queryscorer;import Org.apache.lucene.search.highlight.scorer;import Org.apache.lucene.search.highlight.simplefragmenter;import Org.apache.lucene. Search.highlight.simplehtmlformatter;import Org.apache.lucene.store.directory;import Org.apache.lucene.store.fsdirectory;import Org.apache.lucene.util.version;import org.wltea.analyzer.lucene.ikanalyzer;/** * * Luenceprocess.java * * @version: 1.1 * * @author: Sujonin <a href      = "mailto:[email protected" > Send mail </a> * * @since: 1.0 Created: APR 3, 11:48:11 AM *    * todo:luence using IK word breaker * */public class Luceneikutil {Private Directory directory;        Private Analyzer Analyzer;            /** * with parameter construction, parameter used to specify index file directory * @param indexfilepath */Public Luceneikutil (String Indexfilepath) {try {            Directory = Fsdirectory.open (new File (Indexfilepath));        Analyzer = new Ikanalyzer ();        } catch (IOException e) {e.printstacktrace ();    }}/** * default construction, using the system default path as Index */public Luceneikutil () {This ("/luence/index"); }/** * CREATE INDEX * DESCRIption: * @author [email protected] APR 3, * @throws Exception */public void CreateIndex () throws        exception{indexwriterconfig indexwriterconfig = new Indexwriterconfig (Version.lucene_35,analyzer);        IndexWriter indexwriter = new IndexWriter (directory,indexwriterconfig);        Indexwriter.deleteall ();        list<medicine> list = Datafactory.getinstance (). GetData ();            for (int i=0; i<list.size (); i++) {Medicine Medicine = List.get (i);            Document document = Adddocument (Medicine.getid (), Medicine.getname (), medicine.getfunction ());        Indexwriter.adddocument (document);    } indexwriter.close ();      }/** * * Description: * @author [email protected] APR 3, * @param ID * @param title        * @param content * @return */public Document adddocument (Integer ID, string name, String function) {        Document doc = new document (); Field.index.nO means no index//field.index.analyzed represents a word breaker and index//field.index.not_analyzed represents no participle and index doc.add (        New Field ("id", string.valueof (ID), field.store.yes,field.index.not_analyzed));        Doc.add (New Field ("name", name,field.store.yes,field.index.analyzed));        Doc.add (New Field ("function", function,field.store.yes,field.index.analyzed));    return doc; }/** * * Description: Update index * @author [email protected] APR 3, * @param ID * @param            Title * @param content */public void update (Integer id,string title, String content) {try {            Indexwriterconfig indexwriterconfig = new Indexwriterconfig (Version.lucene_35,analyzer);            IndexWriter indexwriter = new IndexWriter (directory,indexwriterconfig);            Document document = Adddocument (ID, title, content);            Term term = new term ("id", string.valueof (ID));            Indexwriter.updatedocument (term, document); IndexWriter.Close ();        } catch (Exception e) {e.printstacktrace (); }}/** * * Description: Indexed by ID * @author [email protected] APR 3, * @param ID * /public void Delete (Integer id) {try {indexwriterconfig indexwriterconfig = new Indexwriterconfig (V Ersion.            Lucene_35,analyzer);            IndexWriter indexwriter = new IndexWriter (directory,indexwriterconfig);            Term term = new term ("id", string.valueof (ID));            Indexwriter.deletedocuments (term);        Indexwriter.close ();        } catch (Exception e) {e.printstacktrace ();     }}/** * * Description: Query * @author [email protected] APR 3, * @param where query condition * @param scoredoc page with */public list<medicine> search (string[] fields,string keyword) {Index        Searcher indexsearcher = null;                list<medicine> result = new arraylist<medicine> ();        try {//CREATE index Finder, and read-only indexreader Indexreader = Indexreader.open (directory,true);            Indexsearcher = new Indexsearcher (Indexreader);            Multifieldqueryparser queryparser =new multifieldqueryparser (version.lucene_35, Fields,analyzer);                        Query query = queryparser.parse (keyword);            Returns the previous number record Topdocs Topdocs = indexsearcher.search (query, 10);            Information show int totalcount = Topdocs.totalhits;                                    SYSTEM.OUT.PRINTLN ("Total retrieved" +totalcount+ "record");                Highlight/* Create a highlighter to highlight the results of the search Simplehtmlformatter: To control the highlighting of the keywords you want to highlight This class has 2 constructor methods 1:simplehtmlformatter () The default construction method. Highlight:<b> Keywords </B> 2:simplehtmlformatte R (String Pretag, String Posttag). Highlighting method: Pretag keyword Posttag */Formatter Formatter = new Simplehtmlforma Tter ("<font color= ' Red ' >", "</font>"); /* Queryscorer Queryscorer is a built-in scoring device. The work of the scoreboard first is to sort the fragments.                The item used by Queryscorer is derived from a query entered by the user, which extracts items from the original input words, phrases, and Boolean queries and weights them based on the corresponding weighting factor (boost factor).                The original form of the query must also be rewritten for ease of use by Queryscoere.                For example, a wildcard query, a fuzzy query, a prefix query, and a range query are all rewritten as items used in Boolenaquery. Before you pass a query instance to Queryscorer, you can call the Query.rewrite (Indexreader) method to override the Query object */scorer Fragmentscorer = n            EW queryscorer (query);            Highlighter highlighter = new highlighter (formatter,fragmentscorer);            Fragmenter fragmenter = new Simplefragmenter (100);                      /* Highlighter uses Fragmenter to split the original text into multiple fragments. The built-in Simplefragmenter splits the original text into fragments of the same size, and the default size of the fragment is 100 characters.             This size is controllable.                        */Highlighter.settextfragmenter (fragmenter);                        scoredoc[] Scoredocs = Topdocs.scoredocs; for (Scoredoc scdoc:scoredocs) {Document document = INdexsearcher.doc (Scdoc.doc);                Integer id = integer.parseint (document.get ("id"));                String name = Document.get ("name");                String function = Document.get ("function"); FLOAT score = Scdoc.score;                Similarity of String Lightername = highlighter.getbestfragment (Analyzer, "name", name);                if (null==lightername) {lightername = name;                } String Lighterfunciton = highlighter.getbestfragment (Analyzer, "function", function);                if (Null==lighterfunciton) {Lighterfunciton = function;                                } Medicine Medicine = new Medicine ();                Medicine.setid (ID);                Medicine.setname (Lightername);                                Medicine.setfunction (Lighterfunciton);                            Result.add (medicine); }} catch (ExceptiOn e) {e.printstacktrace ();            }finally{try {indexsearcher.close ();            } catch (IOException e) {e.printstacktrace ();    }} return result;        } public static void Main (string[] args) {Luceneikutil luceneprocess = new Luenceikutil ("F:/index");        try {luceneprocess.createindex ();        } catch (Exception e) {e.printstacktrace (); }//Modify Test Luceneprocess.update (2, "Test contents", "Modify test ...                ");        Query test String [] fields = {"Name", "function"};        list<medicine> list = Luenceprocess.search (Fields, "cold");            for (int i=0; i<list.size (); i++) {Medicine Medicine = List.get (i);        System.out.println ("(" +medicine.getid () + ")" +medicine.getname () + "\ T" + medicine.getfunction ());            }//Delete test//luenceprocess.delete (1); }}

Program Run Results

Load extension dictionaries:/dicdata/use.dic.dic load extension dictionary:/dicdata/googlepy.dic Load Extension Stop dictionary:/dicdata/ext_stopword.dic total Retrieved 4 records (1) Silver flower <font color= ' red ' > Cold </font> granule    function indications: Silver flower <font color= ' red ' > Cold </font> granules, headache, heat-clearing, hydrolysis table, swallow. (4) <font color= ' red ' > Cold </font> capsule    function: Silver flower <font color= ' red ' > Cold </font> granules, headache, heat clearing, hydrolysis table, swallow. (3) <font color= ' red ' > Cold </font> Spirit granule    function: antipyretic and analgesic. Headache, heat-clearing. (5) Renhe <font color= ' red ' > Cold </font> granule    function indications: wind and heat, relieve the lungs, relieve heat, relieve cough and phlegm.

How to tell if an index exists

    /**     * Determine if an index file already exists     * @param indexpath     * @return *  /Private Boolean isexistindexfile (String Indexpath) throws exception{        File File = new file (Indexpath);        if (!file.exists ()) {            file.mkdirs ();        }        String indexsufix= "/segments.gen";         Depending on whether the index file Segments.gen exists to determine whether the index was created for the first time, file           indexfile=new file (indexpath+indexsufix);        return indexfile.exists ();    }

Appendix: IK word-breaker processing

The entire word breaker process for IK First, describe the entire word processing process for IK:

1. Lucene's word base class is analyzer, so IK provides an implementation class Ikanalyzer for analyzer. First, we want to instantiate a ikanalyzer, which has a constructor method to receive a parameter ismaxwordlength, which is to identify whether IK uses the maximum word length participle, or to use the most fine-grained segmentation algorithm of two. The implementation of the two actual algorithms, the maximum word length segmentation is the most fine-grained segmentation of a subsequent processing, is the most fine-grained segmentation results filtering, select the longest word segmentation results.

2. The Ikanalyzer class overrides the Tokenstream method of Analyzer, which receives two parameters, field name and input stream reader, where filed name is the attribute column of Lucene. is a name for the index, similar to the column name of the database, after the text content is excessively word-processed and the index is created. Because IK only involves word processing, there is no processing for field name, so there is no discussion here.

3. The Tokenstream method is called when Lucene is processing the text input stream reader, and only a Iktokenizer class is instantiated in the Tokenstream method of Ikanalyzer. This class inherits the Tokenizer class of Lucene. and re-wrote the Incrementtoken method, the function of the method is to process the text input stream generation token, that is, the minimum term lucene, in IK is called lexeme.

4. In Iktokenizer's construction method, it instantiates the final parts of speech iksegmentation, also known as the main word breaker. Its construction method receives two parameters, reader and ismaxwordlength.

5. In the construction method of iksegmentation, there are three main tasks, creating context object contexts, loading dictionaries, and creating sub-word breakers.

6. Contex is mainly the location of the cursor that stores the word-breaker result set and records the word processing.

7. Dictionaries are created as a singleton, mainly with quantifier dictionaries, master dictionaries, and stop-word dictionaries. Dictionaries are stored in dictionary fragment class dictsegment, the dictionary core class. Dictsegment has a static storage structure charmap, is a public dictionary table, used to store all Chinese characters, key and value are a Chinese character, the current IK inside the charmap about more than 7,100 key-value pairs. In addition, Dictsegment has two of the most important data structures, which are used to store the dictionary tree, one is the dictsegment array Childrenarray, the other is a single character (the first Chinese character for each entry), Value is the HashMap childrenmap of Dictsegment. Both of these data structures are used to store the dictionary tree.

8. Sub-word breaker is the real division of speech, IK inside there are three sub-participle, quantifier word breaker, CJK Word breaker (processing Chinese), stop word breaker. The main word breaker iksegmentation traversal of the three word breakers to the text input stream for word processing.

9. Iktokenizer's Incrementtoken method calls the next method of Iksegmentation, and next is the result of the next participle. Next in the first call, the need to load the input stream, and read it into buffer, then traverse the sub-word breaker, the text content of the buffer type of word processing, and then add the word segmentation results into the context of the lexemeset.

Lucene uses Ikanalyzer word breaker and ikanalyzer extension thesaurus

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.