Lucene: Introduction to Full-text search engine based on Java

Source: Internet
Author: User

Lucene is a Java-based full-text Indexing toolkit.

    1. Java-based full-text indexing engine Lucene introduction: History of Authors and Lucene
    2. Implementation of full-text search: Comparison of Luene full-text indexes and database indexes
    3. A brief introduction to the mechanism of Chinese word segmentation: a comparison between word base and auto-segmentation algorithm
    4. Introduction to specific installation and use: System structure Introduction and demonstration
    5. Hacking Lucene: Simplified Query Analyzer, removal implementations, custom sorting, application interface extensions
    6. What else can we learn from Lucene?
In addition, if you are in the selection of the full-text engine, it may be time to try Sphinx: compared to lucene faster, with Chinese word segmentation support, and built-in simple distributed retrieval support;

Java-based full-text indexing/retrieval engine--lucene

Lucene is not a full-text indexing application, but rather a full-text Indexing engine toolkit written in Java that can be easily embedded in a variety of applications to implement full-text indexing/retrieval capabilities for applications.

Lucene's Lucene contributor Doug Cutting is a senior full-text index/search expert who was once a major developer of the V-twin search engine (one of Apple's Copland operating system achievements) and later as a senior system architect in Excite, Currently engaged in some of the internet underlying architecture research. His contribution to Lucene's goal is to include full-text search functionality for a variety of small and medium-sized applications.

Lucene's History: Previously published in the author's own www.lucene.com, later released at the end of sourceforge,2001 year as a subproject of the Apache Foundation Jakarta: http://jakarta.apache.org/lucene/

There are already many Java projects using Lucene as its background full-text indexing engine, and the more famous are:

    • Jive:web Forum System;
    • Eyebrows: Mailing List HTML archive/browse/query system, the main reference document of this article "Thelucene Search engine:powerful, flexible, and free" is one of the main developers of the eyebrows system, And eyebrows has become the main mailing list filing system for the Apache project today.
    • Cocoon: XML-based Web publishing framework, the full-text retrieval section uses Lucene
    • Eclipse: An open Java-based development platform that helps partial full-text indexing use Lucene

For Chinese users, the most important question is whether it supports full-text search in Chinese. However, with the introduction of Lucene structure, you will understand that because of the good architecture design of Lucene, support for Chinese language retrieval can be realized by extending the interface of linguistic lexical analysis.

implementation mechanism of full-text search

Lucene API Interface Design is common, the input and output structure is similar to the database table ==> record ==> field, so many traditional applications of files, databases, etc. can be easily mapped to the Lucene storage structure/interface. In general, Lucene can be thought of as a database system that supports full-text indexing .

Compare Lucene and database:

Recordset: Query result set consisting of multiple record
Lucene database
 index data Source: Doc (fie Ld1,field2 ...)                  Doc (Field1,field2 ...) \ indexer/_____________ |                Lucene index| --------------/searcher Result output: Hits (Doc (field1,field2) doc (field1 ...)) 
 Index data source: Record (Field1,field2 ...) record (field1.) \ sql:insert/_____________ |               DB Index | -------------/sql:select Result output: Results (Record (field1,field2.) record (field1 ...)) 
Document: A "unit"
that needs to be indexed a document consists of multiple fields
record: Record with multiple fields
field: Fields field: Field
Hits: Query result set consisting of matching document

Full-Text Search ≠like "%keyword%"

Usually thicker books are often appended with keyword index tables (e.g.: Beijing: 12, 34, Shanghai: 3, 77 pages ...). ), which helps the reader find the page number of the relevant content more quickly. And the database index can greatly improve the speed of the principle of the query is also the same, imagine the index behind the book to find the speed is more than a page by page to the content of the number of times higher ... And the index is efficient, another reason is that it is well-sequenced. for the retrieval system, the core is a sort problem .

Because database indexes are not designed for full-text indexing, database indexes do not work when you use the like "%keyword%" , and when you use similar queries, the search process becomes a page-by-page traversal process. So for the database service with fuzzy query, like is very harmful to the performance. If you need to make a fuzzy match for multiple keywords: like "%keyword1%" and "%keyword2%" ... Its efficiency can be imagined.

Therefore, the key to establishing an efficient retrieval system is to establish a reverse indexing mechanism similar to the index of science and technology, and to store data sources (such as multiple articles) in the same order, there is another list of key words, which is used to store the keyword ==> article mapping relationship, using such a mapping relationship index: [keywords ==> the article number of the keyword appears, the number of occurrences (even the position: Start offset, end offset), frequency], the retrieval process is the process of turning the fuzzy query into a logical combination of multiple, accurate queries that can take advantage of the index. Thus, the efficiency of multi-keyword query is greatly improved, so the problem of full-text retrieval is finally a sort problem.

It can be seen that the exact query of fuzzy queries relative to the database is a very uncertain problem, which is the reason why most databases have limited support for full-text retrieval. The most central feature of Lucene is the implementation of a full-text indexing mechanism that traditional databases do not excel at through a special indexing structure, and an extended interface to facilitate customization for different applications.

You can compare a database's fuzzy query with a table:

lucene full-text indexing engine database
index add data from a data source All through the full-text index to establish a reverse index for a like query, the traditional index of the data is not used at all. Data needs to be conveniently documented to perform grep-style fuzzy matching, with several orders of magnitude lower than indexed search speeds. The
Match effect is matched by a lexical element (term) and can be used to support non-English languages such as Chinese, through the implementation of the language analysis interface. use: Like "%net%" will also match the Netherlands,
Multiple keywords fuzzy matching: With the "%com%net%": Can not match the reverse order of the xxx.net. xxx.com
match has a matching algorithm that results in a high degree of matching (similarity) in front. does not have a matching degree of control: for example, there are 5 words and 1 occurrences of net in the record, the result is the same.
result output outputs the highest-matching first 100 results with a special algorithm, and the result set is a buffered, small-volume read. Returns all the result sets, which require a large amount of memory to hold these temporary result sets when there are very many matching entries (such as tens of thousands).
customizable through different language analysis interface implementation, can be easily customized to meet the needs of the application of the index rules (including support for Chinese) no interface or interface complex, cannot be customized /td>
Conclusion high-load fuzzy query application, need to be responsible for fuzzy query rules, index data volume is relatively large usage is low, fuzzy matching rules are simple or need fuzzy query data volume less

the biggest difference between full-text search and database application is: let The most relevant The first 100 results meet the needs of more than 98% users

The innovation of Lucene:

Most of the search (database) engines are using the B-tree structure to maintain the index, the update of the index will lead to a lot of IO operations, lucene in the implementation, this slightly improved: not to maintain an index file, but in the extension of the index to constantly create a new index file, And then periodically merge these new small index files into the original large index (for different update strategies, the size of the batch can be adjusted), so that without affecting the efficiency of the retrieval of the premise, improve the efficiency of the index.

Comparison of Lucene and some other full-text search Systems/Applications:

Lucene Other open source full-text retrieval system
Incremental and Bulk Indexes An incremental index (Append) can be used for bulk indexing of large amounts of data, and the interface is designed to optimize batch indexes and small batches of incremental indexes. Many systems support only batch indexes, and sometimes the data source has a little bit more to rebuild the index.
Data source Lucene does not define a specific data source, but rather the structure of a document, so it is very flexible to adapt to a variety of applications (as long as the front-end has the right converter to convert the data source to the appropriate structure), Many systems are only for Web pages and lack the flexibility of other format documents.
Index content fetching Lucene's documents are made up of multiple fields, and even those fields need to be indexed, those fields do not need to be indexed, and the fields near the index are divided into types that require participle and do not require word breakers:
An index that requires a word breaker, such as: Title, article content field
The index of the word breaker is not required, such as: Author/Date field
Lack of versatility, often the entire index of the document
Linguistic analysis Implemented by different extensions of the language Analyzer:
Can filter out unwanted words: an of, etc.,
Western Grammar analysis: The jumps jumped jumper all fall into a jump to index/retrieve
Non-English support: Index support for Asian languages, Arabic language
Lack of common interface implementations
Query analysis Through the implementation of the query analysis interface, you can customize your query syntax rules:
For example: the +-and or relationship between multiple keywords
Concurrent access Ability to support multi-user use

the problem of segmenting words in Asian languages (word Segment)

For Chinese, the full-text index first also to solve a language analysis of the problem, for English, the words in the statement is naturally separated by a space, but the Asian language CJK statements in the word is a word, all, first to the statement in the "word" index words, How this word is sliced out is a big problem.

First of all, certainly cannot use the single character relabeled (Si-gram) as the index unit, otherwise check "Shanghai", cannot let contain "the sea" also matches.

But in a word: "Beijing Tian ' an door", how the computer according to the Chinese language habits of segmentation?
"Beijing Tian ' an gate" or "Beijing Tian ' an gate"? So that the computer can be divided according to the language habits, often requires the machine has a relatively rich thesaurus to be able to more accurately identify the words in the statement.

Another solution is to use the auto-segmentation algorithm: Divide the word into 2-dollar syntax (BIGRAM), such as:
"Beijing Tian ' an gate" ==> "Beijing every day an Ann door".

Thus, in the query, whether the query "Beijing" or query "Tiananmen Square", the query phrase according to the same rules: "Beijing", "Tian an", a number of keywords between the "and" the relationship between the combination of the same can be correctly mapped to the corresponding index. This is common for other Asian languages: Korean and Japanese.

The biggest advantage of automatic segmentation is that there is no thesaurus maintenance cost, simple implementation, disadvantage is low index efficiency, but for small and medium-sized applications, based on the 2-yuan syntax segmentation is sufficient. Based on the 2 yuan after the segmentation of the index general size and the source file is similar, and for English, index files generally only the original file 30%-40% different,


Automatic segmentation Thesaurus Segmentation
Realize Implementation is very simple Achieve complex
Inquire Increases the complexity of query analysis, Suitable for the implementation of more complex query syntax rules
Storage efficiency Index redundancy is large and the index is almost as large as the original High index efficiency, about 30% of the original size
Maintenance costs No Glossary maintenance costs Glossary maintenance costs are very high: languages such as China, Japan and Korea need to be maintained separately.
Also need to include word frequency statistics and other content
Applicable fields Embedded System: Limited Operating Environment resources
Distributed System: No thesaurus synchronization problem
Multi-lingual environment: no Glossary maintenance costs
A professional search engine with high query and storage efficiency requirements

At present, the large search engine language analysis algorithm is generally based on the combination of the above 2 mechanisms. For the Chinese language analysis algorithm, we can find more relevant information in Google search keyword "wordsegment search".

Installation and use

Download: http://jakarta.apache.org/lucene/

Note: Some of the more complex lexical analysis in Lucene is generated using JavaCC (Javacc:javacompilercompiler, a pure Java Lexical analysis generator), so if you compile from source code or need to modify Queryparser, Customizing your own lexical analyzer also requires downloading JAVACC from https://javacc.dev.java.net/.

Lucene Composition: Indexing Module (index) and search module (search) are the main external application portals for external applications

org.apache.lucene.search/ Search Portal
org.apache.lucene.index/ Index entry
org.apache.lucene.analysis/ Language analyzers
org.apache.lucene.queryparser/ Query Analyzer
org.apache.lucene.document/ Storage structure
Org.apache.lucene.store/ Underlying io/storage structure
org.apache.lucene.util/ Some of the common data structures

A simple example demonstrates how Lucene is used:

Indexing process: Reads the file name (multiple) from the command line, stores the path (path field) and content (Body field) 2 fields, and makes full-text indexing of the content: the unit of the index is the document object, each Document object contains multiple Fields Field objects, For different field properties and data output requirements, you can also select different index/storage field rules for the field, as follows:
method Word index store Purpose /th>
field.text (string name, String value) Yes Yes Yes Word index and store, for example: Title, Content field
field.text (String name, Reader value) Yes Yes no segmentation index is not stored, such as meta information,
is not used to return the display but needs to be retrieved
Field.keyword (string name, string value) No Yes Yes do not slice indexes and store them, for example: Date fields
field.unindexed (string name, String value) no No Yes do not index, store only, for example: file path
field.unstored (string name, String value) Yes Yes No Full-text index only, not stored
public class Indexfiles {   //Usage:: indexfiles [index output DIRECTORY] [indexed file list]   ... public static void Main (string[] args) throws Exception {    String indexpath = args[0];    IndexWriter writer;    Constructs a new write indexer with the specified language parser (the 3rd parameter indicates whether it is an append index)    writer = new IndexWriter (Indexpath, New Simpleanalyzer (), false);    for (int i=1; i<args.length; i++) {      System.out.println ("indexing file" + args[i]);      InputStream is = new FileInputStream (args[i]);      Constructs a Document object that contains 2 fields field      //One is the path to the field, not indexed, only stored      //One is the content body field, is full-text indexed, and stores      Document doc = new Document ();      Doc.add (field.unindexed ("path", Args[i]);      Doc.add (Field.text ("Body", (Reader) New InputStreamReader (IS));      Writes a document to index      writer.adddocument (DOC);      Is.close ();    };    Close Write indexer    writer.close ();}  }

As you can see in the index process:

    • The language Analyzer provides an abstract interface, so language analysis (analyser) can be customized, although Lucene defaults to provide 2 more common parser simpleanalyser and standardanalyser, which are not supported by default in the 2 parser, So to add the Chinese language to the segmentation rules, need to modify these 2 analyzers.
    • Lucene does not specify the format of the data source, but only provides a common structure (document object) to accept the input of the index, so the input data source can be: database, Word document, PDF document, HTML Document ... You can index the data source by constructing the Chengcheng Docuement object as long as you are able to design the appropriate parser converter.
    • For large batches of data indexes, you can also improve the efficiency of bulk indexes by adjusting the Indexerwrite file Merge frequency attribute (Mergefactor).

The search process and results are displayed:

The search results return a hits object that can then be used to access the contents of the Document==>field.

Assuming a full-text search based on the Body field, you can print the path field of the query results and the matching of the corresponding query (score).

public class Search {public   static void Main (string[] args) throws Exception {    String Indexpath = args[0], Querys Tring = args[1];    Search engine pointing to index directory    Searcher Searcher = new Indexsearcher (indexpath);    Query parser: Use and index the same language Analyzer    query query = queryparser.parse (queryString, "body",                               New Simpleanalyzer ());    Search results using Hits storage    Hits Hits = searcher.search (query);    The matching degree of data and queries to the corresponding fields can be accessed by hits for    (int i=0; i
Throughout the retrieval process, the Language Analyzer, Query Analyzer, and even the search engine (Searcher) provide an abstract interface that can be customized as needed.

Hacking Lucene

Simplified Query Analyzer

Personal feeling Lucene became a Jakarta project, drawing in too much time for debugging increasingly complex queryparser, most of which are not very familiar to most users, currently Lucene supported syntax:

Query:: = (Clause) *
Clause:: = ["+", "-"] [<term> ":"] (<TERM> |) ("Query") ")

The middle logic includes: and OR +-&&| | such as symbols, but also "phrase query" and for the Latin prefix/fuzzy query, etc., personal feeling for general applications, these functions are some flashy, in fact, to achieve the current similar to Google query statement analysis function is actually enough for most users. So, early versions of Lucene Queryparser are still a good choice.

Add modify delete specified record (document)

Lucene provides an extension of the index, so the dynamic expansion of the index should be no problem, and the modification of the specified record seems to be only possible through the deletion of records, and then rejoin the implementation. How do I delete a specified record? The method of deletion is also simple, except that it needs to be specifically indexed at index time based on the record ID in the data source, and then use the Indexreader.delete (termterm) method to delete the corresponding document from the record ID.

sort function based on a field value

Lucene defaults to sort results by its own correlation algorithm (score), but the ability to sort results based on other fields is a frequently mentioned problem in Lucene's development mailing lists, and many of the original database-based applications require sorting beyond the match-based (score). And from the principle of full-text search, we can see that any search process that is not based on the efficiency of the index will result in a very low efficiency, if the order based on other fields need to access the storage field during the search, the speed back greatly reduced, it is very undesirable.

But there is also a compromise solution: in the search process can affect the results of the sorting only the index has stored in the DocId and score 2 parameters, so, based on the score outside the order, in fact, the data source can be pre-ordered, and then sorted according to DocId to achieve. This avoids sorting out the results in the Lucene search results and accessing a field value that is not in the index during the search.

What needs to be changed here is the hitcollector process in Indexsearcher:

... scorer.score (new Hitcollector () {private float Minscore = 0.0f;public final void collect (int doc, float score) {  I F (Score > 0.0f &&  //Ignore zeroed buckets      (Bits==null | | bits.get (DOC))) {  //skip docs not in bi TS    totalhits[0]++;    If (score >= minscore) {/              * Original: Lucene will docid and corresponding matching degree score example into the result hit list:       * hq.put (New Scoredoc (Doc, score));  Update hit Queue               * If you replace score with Doc or 1/doc, you have already ordered the data source index according to the docid line or inverse row               * Assuming that the result is sorted by a field, and the results are based on the docid sort               * Even more complex score and docid fitting can be achieved for sorting a field.               *              /Hq.put (New Scoredoc (Doc, (float) 1/doc));       if (Hq.size () > Ndocs) {  //if hit queue Overfullhq.pop ();  Remove lowest in Hit Queueminscore = ((Scoredoc) hq.top ()). Score; Reset Minscore}}}}      , Reader.maxdoc ());

more general input and output interface

Although Lucene does not define a deterministic input document format, more and more people think of using a standard intermediate format as the data import interface for Lucene, and then other data, such as PDFs, can be indexed by simply converting the parser to a standard intermediate format. This intermediate format is mainly XML-based, similar to the implementation of 4, 5:

Data source: WORD       PDF     HTML DB other \ | |      |         /                       XML Intermediate Format                            |                     Lucene INDEX

There is currently no parser for msword documents, because Word documents and ASCII-based RTF documents need to be parsed using COM object mechanisms. This is the information I found on Google: http://www.intrinsyc.com/products/enterprise_applications.asp
Another way is to convert Word documents into text:http://www.winfield.demon.nl/index.html


Indexing process Optimization

The index is generally divided into 2 kinds of cases, one is small batch index extension, one is the large-scale index reconstruction. During the indexing process, not every new doc is added to the index to re-index the file's write operations (file I/O is a very resource-intensive thing).

Lucene first indexes operations in memory and writes the files according to a certain batch. The larger the interval of this batch, the fewer files are written, but the more memory is consumed. Instead of taking up less memory, file IO operations are frequent and indexing is slow. There is a merge_factor parameter in IndexWriter that will help you to make the most of your memory and reduce the file operation based on your application environment after you construct the indexer. Based on my experience: The default indexer is written once per 20 record indexes, 50 times times more per Merge_factor, and the index speed can be increased by about 1 time times.

Search Process Optimization

Lucene supports memory indexing: Such searches have an order of magnitude faster than file-based I/O.
http://www.onjava.com/lpt/a/3273
It is also necessary to minimize the creation of indexsearcher and to cache the foreground of search results.

Lucene's optimization for full-text indexing is that after the first index retrieval, all the records are not read out, and only the ID of the first 100 results (Topdocs) with the highest matching in all results is placed in the result set cache and returned. Here you can compare the database retrieval: If it is a 10,000-item database retrieval result set, the database is sure to take all the record content to get back to the application result set. So even if the number of search matches is large, Lucene's result set takes up no more memory space. For the general fuzzy retrieval application is less than so many results, the first 100 has been able to meet more than 90% of the search requirements.

If the first number of cached results is exhausted and the subsequent results are read, searcher retrieves and generates a cache that is 1 time times larger than the last search cache and crawls back. So if you construct a searcher to check 1-120 results, searcher is actually a 2-time search process: The first 100 is finished, the cache results are used up, searcher re-retrieves and constructs a 200 result cache, and so on, 400 cache, 800 cache. Since each searcher object disappears and these caches are inaccessible, you may want to cache the result record as much as possible to ensure that the cache count is below 100 to take full advantage of the first result cache, not to allow Lucene to waste multiple searches, and to scale the result cache.

Another feature of Lucene is that it automatically filters out low-match results in the process of collecting results. This is also what the database application needs to return the results of the search to a different place.

Some of my attempts:

    • Support for Chinese tokenizer: There are 2 versions, one generated by JAVACC, one for CJK part by one token index, the other one is rewritten from Simpletokenizer, support numbers and letters token in English, indexed in Chinese by iteration.
    • An indexer based on an XML data source: Xmlindexer, so that all data sources can be indexed with Xmlindxer as long as they can be converted to the specified XML by DTD.
    • Sort by a field: search for results sorted by record index: Indexordersearcher, so if you need to sort the results according to a field, you can have the data source first sorted by a field (for example: Pricefield), so that after indexing, Then, using the search engine retrieved in the order of record IDs, the result is the equivalent of that field sort.

Learn more from Lucene

Luene is indeed a model of object-facing design.

    • All problems are easily extended and reused through an extra layer of abstraction: you can achieve your goals by re-implementing them, but not for other modules;
    • Simple application Entry searcher, Indexer, and call the underlying series of components to complete the search task;
    • The tasks of all objects are very specific: for example, the search process: The Queryparser analysis transforms the query statement into a series of combinations of exact queries (query), reads the index through the underlying index reading structure Indexreader, and scores the search results with the corresponding scorecard Sorting, and so on. All functional modules are highly atomized, so they can be re-implemented without the need to modify other modules.
    • In addition to the flexible application interface design, Lucene also offers a number of language analyzer implementations for most applications (Simpleanalyser,standardanalyser), which is one of the important reasons why new users can get started quickly.

These advantages are very worthwhile in the future development of learning lessons. As a generic toolkit, Lunece does give a lot of convenience to developers who need to embed full-text search functionality into their applications.

In addition, through the study and use of Lucene, I also have a deeper understanding of why many database optimization design requirements, such as:

    • As much as possible to index the fields to improve the query speed, but too many indexes on the database table update operation is slow, and the result of too many sort conditions, is actually often one of the performance of the killer.
    • Many business databases provide some optimization parameters for large quantities of data insertions, and this role is similar to the merge_factor of indexers.
    • 20%/80% principle: Check the results are not equal to good quality, especially for the return result set is very large, how to optimize the quality of the first dozens of results is often the most important.
    • As much as possible, let the app get a smaller result set from the database, because random access to the result set is a very resource-intensive operation even for large databases.

Resources:

Apache:lucene Project
http://jakarta.apache.org/lucene/
Lucene Development/user Mailing list archiving
[email protected]
[email protected]

The Lucene search engine:powerful, flexible, and free
Http://www.javaworld.com/javaworld/jw-09-2000/jw-0915-Lucene_p.html

Lucene Tutorial
Http://www.darksleep.com/puff/lucene/lucene.html

Notes on distributed searching with Lucene
http://home.clara.net/markharwood/lucene/

The segmentation words of Chinese language
Http://www.google.com/search?sourceid=navclient&hl=zh-CN&q=chinese+word+segment

Introduction to search Engine tools
http://searchtools.com/

Several papers and patents by Lucene author Cutting
Http://lucene.sourceforge.net/publications.html

Lucene's. NET implementation: Dotlucene
http://sourceforge.net/projects/dotlucene/

Lucene author Cutting's other project: A Java-based search engine Nutch
http://www.nutch.org/http://sourceforge.net/projects/nutch/

On the comparison of segmentation words based on thesaurus and N-gram
Http://china.nikkeibp.co.jp/cgi-bin/china/news/int/int200302100112.html

2005-01-08 cutting Lectures on Lucene at the Pisa University: a very detailed Lucene architecture commentary

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.