Introduction to Lucene Learning

Source: Internet
Author: User
Tags create directory

Reference blog: http://blog.csdn.net/ayi_5788/article/category/6348409 page: http://blog.csdn.net/hu948162999/article/details /41209699 1, what is Chinese word learning English is known, English is the word unit, the words and words separated by a space or a comma. The Chinese language is a unit of words, words and words, words and words to form a sentence. So for English, we can simply use a space to determine whether a string is a word, such as I love china,love and China is very easy to distinguish between the program, but the Chinese "I like Chinese" is not the same, the computer does not know "China" is a word or "love" is a word. The Chinese sentence is divided into meaningful words, that is, Chinese participle, also known as cut words.  I love China, the result of participle is: I love China. At present, Chinese word segmentation is still a difficult problem ——— for the need for contextual differences and new words (names, places, etc.) is difficult to perfect the distinction. It is only one of the other problems that the McCartney of Korea, Japan and China, called CJK (Chinese Japanese Korean), which are equally problematic in the world, may contain other issues.  2 and Chinese word segmentation the processing of Chinese in Lucene is based on the automatic segmentation of the word segmentation, or two Yuan segmentation. In addition, there are the largest segmentation (including forward, backward, and before and after the combination), the least segmentation, full segmentation and so on.   Lucene comes with several word breakers Whitespaceanalyzer, Simpleanalyzer, Stopanalyzer, StandardAnalyzer, Chineseanalyzer, Cjkanalyzer and so on. The first three is only applicable to English participle, standardanalyzer to the simplest possible realization of Chinese word segmentation, that is, the dichotomy, each word as a word, such as: "Beijing Tian ' an door" ==> "Beijing every day an Ann door." In this way, although comprehensive, but there are many shortcomings, such as the index file is too large, the retrieval speed is slow and so on. Chineseanalyzer is divided by the word, and standardanalyzer to the Chinese word is not a big difference. Cjkanalyzer is divided by two words, more arbitrary, and will generate garbage tokens, affecting the size of the index. The above word breaker is too simple to meet the needs of reality, so we need to implement their own segmentation algorithm.   So, in the query, whether it is query "Beijing" or query "Tian ' Anmen", will query the phrase byThe same rules are sliced: "Beijing", "Cheonan", the combination of multiple keywords with "and", the same can be correctly mapped to the corresponding index.  This is common for other Asian languages: Korean and Japanese. The biggest advantage of automatic segmentation is that there is no thesaurus maintenance cost, simple implementation, disadvantage is low index efficiency, but for small and medium-sized applications, based on the 2-yuan syntax segmentation is sufficient. Based on the 2 yuan after the segmentation of the index general size and the source file is similar, and for English, index files generally only the original file 30%-40% different.    the current large search engine language analysis algorithm is generally based on the combination of the above 2 mechanisms. For the Chinese language analysis algorithm, we can find more relevant information in Google search keyword "wordsegment search".    ============================= Split line ===============================   because Lucene different version of the gap is larger, this series of tutorials intends to 3.5 version, 4.5 version, 5.0 version are given an example, easy to learn, but also facilitate their own review. Note: As the Lucene5.0 version is based on the development of JDK1.7, so want to learn the classmate please configure 1.7 and above version. The test Lucene 6.1.0 also applies the code in Lucene 5.0.  Lucene 6.1.0 Minimum requirements are also JDK1.7. Creating an index can be divided into the main steps, I have experimented with, different versions will be somewhat different, but follow the following several steps to write, the problem is not too big.   1, create directory  2, create indexwriter  3, create Document Object   4, add field  5 to document, Adding documents to the index via IndexWriter    search can be divided into the following steps:  1, creating directory  2, creating indexreader  3,  Create indexsearch  4 based on Indexreader, create a search for query  5, search by searcher and return topdocs  6, get Topdocs objects according to Scoredoc   7, according to Searcher and Scoredoc object to obtain the specific Document object   8, according to the document object to get the required value    we add a field to the document can have more settings, So what does that mean?   Name: Field name, easy to understand   value: field value, also easy to understand how   Store and index are explained, here's an optional value for both options:  Field.Store.YES or No ( Storage domain option)   set to Yes or the contents of this domain are completely stored in the file, easy to restore the text   set to no means that the contents of this domain are not stored in a file, but can be indexed, when the content cannot be fully restored   Field.index (Index Options)   index.analyzed: Word breaker and index, for title, content   index.not_analyzed: index, but no participle, if the social Security number, name, ID, etc. for exact search   index.analyzed_not_norms: do word segmentation but do not store norms information, this norms includes information such as time and weight to create an index   INDEX.NOT_   Analyzed_not_norms: That is, do not do participle or store norms information   index.no: Do not index    lucene4.5 example:
ImportJava.io.File;ImportOrg.apache.lucene.analysis.Analyzer;ImportOrg.apache.lucene.analysis.standard.StandardAnalyzer;Importorg.apache.lucene.document.Document;ImportOrg.apache.lucene.document.Field;ImportOrg.apache.lucene.document.FieldType;ImportOrg.apache.lucene.document.StringField;ImportOrg.apache.lucene.document.TextField;ImportOrg.apache.lucene.index.DirectoryReader;ImportOrg.apache.lucene.index.IndexWriter;ImportOrg.apache.lucene.index.IndexWriterConfig;ImportOrg.apache.lucene.queryparser.classic.QueryParser;ImportOrg.apache.lucene.search.IndexSearcher;ImportOrg.apache.lucene.search.Query;ImportOrg.apache.lucene.search.ScoreDoc;ImportOrg.apache.lucene.search.TopDocs;Importorg.apache.lucene.store.Directory;Importorg.apache.lucene.store.FSDirectory;Importorg.apache.lucene.util.Version; Public classIndexutil {Private Static Finalstring[] ids = {"1", "2", "3" }; Private Static Finalstring[] authors = {"Darren", "Tony", "Grylls" }; Private Static FinalString[] titles = {"Hello world", "Hello Lucene", "Hello Java" }; Private Static FinalString[] contents = {"Hello world, I am on my", "Today's my first day to study Lucene",            "I like Java" }; /*** Build Index*/     Public Static voidindex () {IndexWriter IndexWriter=NULL; Try {            //1. Create DirectoryDirectory directory = Fsdirectory.open (NewFile ("F:/test/lucene/index")); //2. Create IndexWriterAnalyzer Analyzer =NewStandardAnalyzer (version.lucene_45); Indexwriterconfig Config=NewIndexwriterconfig (version.lucene_45, analyzer); IndexWriter=Newindexwriter (directory, config); intSize =ids.length;  for(inti = 0; i < size; i++) {                //3. Create Document ObjectDocument document =NewDocument (); //Take a look at the meaning of four parameters//4. Add field to document                /*** Create field with String value. *                  * @paramname * Field name *@paramValue * String value *@paramtype * Field type *@throwsIllegalArgumentException * If either the name or value is NULL, or if the field's type is                 Neither indexed () nor * stored (), or if indexed () is false and Storetermvectors () is true. * @throwsNullPointerException * If the type is NULL * * Public Field (string name, string value, FieldType type)*/                /*** Note: This is different from version 3.5, the original constructor is obsolete*/                /*** Note: Here 4.5 version uses FieldType instead of the original store and index, different field pre-defined some FieldType * 
    */                //Store norms information for ID, but not participle or storeFieldType Idtype =textfield.type_stored; Idtype.setindexed (false); Idtype.setomitnorms (false); Document.add (NewField ("id", Ids[i], idtype)); //storage for author, but no participleFieldType Authortype =textfield.type_stored; Authortype.setindexed (false); Document.add (NewField ("Author", Authors[i], authortype)); //for title storage, participleDocument.add (NewField ("title", Titles[i], stringfield.type_stored)); //the content is not stored, but participleDocument.add (NewField ("Content", Contents[i], textfield.type_not_stored)); //5. Adding documents to the index via IndexWriterindexwriter.adddocument (document); }        } Catch(Exception e) {e.printstacktrace (); } finally {            Try {                if(IndexWriter! =NULL) {indexwriter.close (); }            } Catch(Exception e) {e.printstacktrace (); }        }    }    /*** Search*/     Public Static voidsearch () {Directoryreader Indexreader=NULL; Try {            //1. Create DirectoryDirectory directory = Fsdirectory.open (NewFile ("F:/test/lucene/index")); //2. Create Indexreader            /*** Note that reader differs from version 3.5: * * So use directoryreader * * @Deprec ated public static Directoryreader open (final directory directory) throws IOException {return * D Irectoryreader.open (directory); }             */Indexreader=directoryreader.open (directory); //3. Create Indexsearch according to IndexreaderIndexsearcher Indexsearcher =NewIndexsearcher (Indexreader); //4. Create a search query//Use the default standard word breakerAnalyzer Analyzer =NewStandardAnalyzer (version.lucene_45); //search for Lucene in content//create parser to determine the contents of the file to search for, the second parameter is the domain of the searchQueryparser Queryparser =NewQueryparser (version.lucene_45, "content", analyzer); //Create a query that indicates that the search domain is a content containing Lucene documentsQuery query = queryparser.parse ("Lucene"); //5, according to Searcher search and return TopdocsTopdocs Topdocs = indexsearcher.search (query, 10); //6. Get Scoredoc Object according to Topdocsscoredoc[] Scoredocs =Topdocs.scoredocs;  for(Scoredoc scoredoc:scoredocs) {//7. Obtain specific document objects according to searcher and Scoredoc objectsDocument document =Indexsearcher.doc (Scoredoc.doc); //8. Get the required value according to the Document objectSYSTEM.OUT.PRINTLN ("ID:" + document.get ("id"))); System.out.println ("Author:" + document.get ("Author")); System.out.println ("Title:" + document.get ("title"))); /*** See if the content can be printed out, why? */System.out.println ("Content:" + document.get ("content"))); }        } Catch(Exception e) {e.printstacktrace (); } finally {            Try {                if(Indexreader! =NULL) {indexreader.close (); }            } Catch(Exception e) {e.printstacktrace (); }        }    }}

Introduction to Lucene Learning

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.