[Lucene]-lucene Basic overview and simple examples

Source: Internet
Author: User
Tags isearch

First, Lucene basic introduction:
    • Basic information: Lucene is an open source full-text Search engine toolkit for the Apache Software Foundation, a full-text search engine architecture that provides a complete query engine and indexing engine, some text analysis engines. Lucene's goal is to provide software developers with a simple and easy-to-use toolkit to facilitate full-text retrieval in the target system, or to build a complete full-text search engine on this basis.
    • File structure: Top-down tree expansion, one-to-many.
      • Indexed index: equivalent to a library or table.
      • Segment segment: Equivalent to a sub-library or a sub-table.
      • Document documents: Quite a piece of data, like a novel engulfing the starry Sky
      • Field field: A document can be divided into multiple fields, equivalent to a field, such as: novel author, title, content ...
      • Word Yuan term: a domain can be divided into multiple words, the word element is to do the smallest unit of search, standard word word is a word and kanji.
    • Forward information:
      • Domain, document, section, index
    • Reverse Information:
      • Word-by-document.
Second, Lucene full text search:

1. Data classification:

    • Structured data: Database, fixed-length and formatted data.
    • Semi-structured data: such as xml,html, etc...
    • Unstructured data: Data that is not fixed in length and format, such as text ...

2, the retrieval process: Luncene The retrieval process can be divided into two parts, one part is the left structured, semi-structured, unstructured data indexing process, and the other part is the right index query process.

    • Indexing process:
      • There is a series of indexed files
      • The indexed file is parsed and linguistically processed to form a series of words (term).
      • The index is created to form a dictionary and a reverse index table.
      • Writes an index to the hard disk/memory through the index store.
    • Search process:
      • User input Query keywords.
      • A series of words (term) are obtained by parsing and parsing the query statements.
      • A query tree is obtained by parsing the syntax.
      • The index is read into memory by the index store.
      • The query tree is used to search the index to get the document linked list of each word (term), to cross the list of documents, and to get the result document.
      • Sorts the query's relevance by the search results document.
      • Returns the query result to the user.

3. Reverse Index:

Luncene search keywords, through the index can be targeted to a document, such as the keyword-to-document mapping is the direction of the document to the string mapping process, so called the direction index.

4. Create an index:

    • Document: Index documents,
    • Word Segmentation technology: A variety of word segmentation technology, standard participle: Chinese into a single Chinese characters, English single word.
    • Index creation: Gets an index table.

5. Index Search:

    • Four steps: Keyword (keyword), word breaker (Analyzer), search index (SEARCHINDEX), return results (Result)
    • Detailed steps: Enter a keyword, using the word segmentation method to get the lexical elements, to the index table to retrieve the words, find the document containing all the etymology and return it.
Three, the mathematical model of Lucene:

1. Key noun:

    • Document: An article is a document.
    • Domains: Documents have multiple domains that can be divided: document name, document author, document time, document content.
    • Lexical elements: A domain can be divided into multiple words, such as the document named Chinese Ancient Poetry introduction , through the word can be obtained by participle: Chinese, Chinese, ancient, poetry, words, Jane, mediator. this is obtained by using standard participle, and the lexical element is the smallest unit of search.

2. Weight calculation:

    • Tf:term Frequency, the number of times the word element appears in this document, the greater the TF value, the more important the word.
    • Df:document Frequency, how many documents contain this word element, the larger the DF, the less important the word is.
    • Weight Calculation formula:wt,d=tft,d * log (N/DFT), Tft,d represents the number of times the word element T appears in document D, n represents the number of documents, and the DFt represents the number of documents containing the word element T.

3. Space Vector Model:

    • This document is represented by the weight of each word element of a document as a vector (the weight value of each word as a dimension of the vector) so that all documents can be represented as an n-dimensional space vector. n represents the total number of lexical elements in a collection of lexical elements after all documents are word breakers. Examples are as follows:
    • retrieval process : M-Document we got the space vector of M n-dimensional, the search term we participle get x word element, calculate the weight of this x word element, get an n-dimensional vector xv, by calculating the similarity of XV and M n-dimensional vectors (Yu Xuan angle) to represent the correlation. Lucene uses this correlation scoring mechanism to get the returned document.
Iv. Official Sample Demo:
  • Download source code and Jar:http://lucene.apache.org/core/
  • The core jar packages are as follows:
  • Run the demo instance file in the source code and the example in Lucenecoreapi to see how Luncene creates index and retrieves it.

  • Build a Java project with idea: add these 6 core packages as shown below, and sample files.
  • Indexfiles,searchfiles is to iterate through a file directory to create an index and manually enter a keyword to retrieve a hit file. Cindexsrarch is an instance of a memory index that is created and added to the document, which ultimately retrieves the process.
  •  PackageTest;ImportOrg.apache.lucene.analysis.Analyzer;ImportOrg.apache.lucene.analysis.standard.StandardAnalyzer;Importorg.apache.lucene.document.Document;ImportOrg.apache.lucene.document.Field;ImportOrg.apache.lucene.document.TextField;ImportOrg.apache.lucene.index.DirectoryReader;ImportOrg.apache.lucene.index.IndexWriter;ImportOrg.apache.lucene.index.IndexWriterConfig;ImportOrg.apache.lucene.queryparser.classic.QueryParser;ImportOrg.apache.lucene.search.IndexSearcher;ImportOrg.apache.lucene.search.Query;ImportOrg.apache.lucene.search.ScoreDoc;Importorg.apache.lucene.store.Directory;Importorg.apache.lucene.store.RAMDirectory;Importjava.io.IOException;/*** Created by Rzx on 2017/6/1.*/ Public classCindexsearch { Public Static voidCreateindexandsearchindex ()throwsexception{Analyzer Analyzer=NewStandardAnalyzer ();//Standard word breaker//ramdirectory Memory Dictionary storage IndexDirectory directory =Newramdirectory (); //Directory directory = fsdirectory.open ("/tmp/testindex"), disk storage indexindexwriterconfig config=NewIndexwriterconfig (Analyzer); IndexWriter writer=NewIndexWriter (directory,config); Document Document=NewDocument (); String text= "Hello World main test"; Document.add (NewField ("Filetest", Text, textfield.type_stored));//to add a field field to the documentwriter.adddocument (document);        Writer.close (); Directoryreader Directoryreader=directoryreader.open (directory); Indexsearcher ISearch=NewIndexsearcher (Directoryreader); Queryparser Parser=NewQueryparser ("Filetest",NewStandardAnalyzer ()); Query Query= Parser.parse ("main");//query main keyword Scoredoc [] hits= Isearch.search (query,1000). Scoredocs;  for(inti = 0; I ) {Document Hitdoc=Isearch.doc (Hits[i].doc); System.out.print ("Hit file content:" +hitdoc.get ("Filetest"));        } directoryreader.close ();    Directory.close (); }     Public Static voidMain (string[] args) {Try{createindexandsearchindex (); } Catch(Exception e) {e.printstacktrace (); }    }}

    Operation Result:

  • The input directory created two file Test1.txt,test2.txt, the contents are: Hello World and Hello main man test. Run Indexfiles to read the input directory and automatically create a Testindex index directory with the following results:

  • The Testindex file is created in the directory, which stores the index-related information. Run Searchfiles as follows: Enter the search keyword in console: hello,min

V. Core categories:
  • Build index: Analyzer,director (ramdirectory,fsdirectory), indexwriterconfig,indexwriter,document

  •  * Analyzer Analyzer = new  StandardAnalyzer (); //  instantiation of the word breaker  * Directory directory = new    Ramdirectory (); //  Initialize the memory index directory  * Directory directory = fsdirectory.open ("Indexpath"); //  Initialize disk storage index  * indexwriterconfig config = new  indexwriterconfig (Analyzer); //  indexer configuration  * IndexWriter writer = new  IndexWriter (Directory,config); //  indexer  * Document document = new  document (); //  Initializes the document, which is used to store the data. 
  • Query index: Directoryreader,indexsearch,queryparser,mutilfieldqueryparser,
  • * Directoryreader Directoryreader = directoryreader.open (directory);//Index Catalog Reader* Indexsearcher ISearch =NewIndexsearcher (Directoryreader);//Index Finder*Multiple Search methods:* Queryparser single field < domain >binding: Queryparser Qparser=New Queryparser("Filed",NewStandardAnalyzer ());//Query parser: Parameter Field field, Word breakerQuery query = Qparser.parse ("main")//Query Keywords* Multifieldqueryparser multiple fields < domains >binding (): Queryparser qparser2=New Multifieldqueryparser(Newstring[]{"Field1", "Field2"},NewStandardAnalyzer ());//Multi-field query parserQuery query = Qparser2.parse ("main")//Query Keywords* Term bound field < domain > query:NewTerm (Field,keyword); term=NewTerm ("content", "main"); Query Query=NewTermquery (term);* More methods: Refer to http://blog.csdn.net/chenghui0317/article/details/10824789* Scoredoc [] hits = Isearch.search (query,1000). Scoredocs;//Query the document you hit and the score and Shard
  • Highlight: Simplehtmlformatter,highlighter,simplefragmenter
  • Simplehtmlformatter formatter=New simplehtmlformatter ("<b><font color= ' Red ' >", "</font> </b> "); Highlighter highlighter=newnew  queryscorer (query)); Highlighter.settextfragmenter (new simplefragmenter (=  highlighter.getbestfragment ( New StandardAnalyzer (), "contents", "Hello Main man test");
  • Built-in Word breaker: Lucene implements a lot of word breakers, targeted application of various scenarios and various languages.
  • Queryparser Qparser =NewQueryparser ("Content",NewSimpleanalyzer ()); Queryparser Qparser=NewQueryparser ("Content",NewClassicanalyzer ()); Queryparser Qparser=NewQueryparser ("Content",NewKeywordanalyzer ()); Queryparser Qparser=NewQueryparser ("Content",NewStopanalyzer ()); Queryparser Qparser=NewQueryparser ("Content",NewUax29urlemailanalyzer ()); Queryparser Qparser=NewQueryparser ("Content",NewUnicodewhitespaceanalyzer ()); Queryparser Qparser=NewQueryparser ("Content",NewWhitespaceanalyzer ()); Queryparser Qparser=NewQueryparser ("Content",NewArabicanalyzer ()); Queryparser Qparser=NewQueryparser ("Content",NewArmeniananalyzer ()); Queryparser Qparser=NewQueryparser ("Content",NewBasqueanalyzer ()); Queryparser Qparser=NewQueryparser ("Content",NewBraziliananalyzer ()); Queryparser Qparser=NewQueryparser ("Content",NewBulgariananalyzer ()); Queryparser Qparser=NewQueryparser ("Content",NewCatalananalyzer ()); Queryparser Qparser=NewQueryparser ("Content",NewCjkanalyzer ()); Queryparser Qparser=NewQueryparser ("Content",NewCollationkeyanalyzer ()); Queryparser Qparser=NewQueryparser ("Content",NewCustomanalyzer (Version defaultmatchversion, charfilterfactory[] charfilters, Tokenizerfactorytokenizer, tokenfilterfactory[] tokenfilters, integer posincgap, Integer offsetgap)); Queryparser Qparser=NewQueryparser ("Content",NewSmartchineseanalyzer ());//Chinese Longest participle
Six, highlight example:
  • Read the Indexfile created index testindex and query the keyword main and highlight it. An exception occurred: Exception in thread "main" java.lang.noclassdeffounderror:org/apache/lucene/index/memory/memoryindex needs to be imported: Lucene-memory-6.5.1.jar, this packet is used to handle the offset of the storage location, allowing us to locate the keyword element in the text.
  •  Public Static voidSearchbyindex (String indexfilepath,string keyword)throwsParseException, invalidtokenoffsetsexception {Try{String Indexdatapath= "Testindex"; String KeyWord= "Main"; Directory dir= Fsdirectory.open (NewFile (Indexdatapath). Topath ()); Indexreader Reader=Directoryreader.open (dir); Indexsearcher Searcher=NewIndexsearcher (reader); Queryparser Queryparser=NewQueryparser ("Contents",NewStandardAnalyzer ()); Query Query= Queryparser.parse ("main"); Topdocs Topdocs=searcher.search (query,10); Scoredoc[] Scoredocs=Topdocs.scoredocs; System.out.println ("Maximum Rating:" +Topdocs.getmaxscore ());  for(inti=0;i<scoredocs.length;i++){                intDoc=Scoredocs[i].doc; Document Document=Searcher.doc (DOC); System.out.println ("====================================="); System.out.println ("Keywords:" +KeyWord); System.out.println ("File path:" +document.get ("path")); System.out.println ("File ID:" +Scoredocs[i].doc); //Start highlightingSimplehtmlformatter formatter=NewSimplehtmlformatter ("<b><font color= ' Red ' >", "</font></b>"); Highlighter Highlighter=NewHighlighter (Formatter,Newqueryscorer (query)); Highlighter.settextfragmenter (NewSimplefragmenter (400)); String Conten= Highlighter.getbestfragment (NewStandardAnalyzer (), "contents", "Hello Main man test"); //String conten = highlighter.getbestfragment (New StandardAnalyzer (), "Contents", Document.get ("content"));System.out.println ("File content:" +Conten); System.out.println ("Degree of relevance:" +Scoredocs[i].score);        } reader.close (); } Catch(IOException e) {e.printstacktrace (); }    }

    Output Result:

Source code: Https://codeload.github.com/NextNight/luncene6.5.1Test/zip/master

[Lucene]-lucene Basic overview and simple examples

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.