Java-based Full-text indexing engine Lucene Introduction: About the author and the History of Lucene
Implementation of full-text search: A comparison of luene Full-text indexes and database indexes
A brief introduction to the mechanism of Chinese word segmentation: A comparison based on lexical library and automatic word segmentation algorithm
Introduction to specific installation and use: System Architecture Introduction and Demo
Hacking Lucene: Simplified Query Analyzer, implementation of deletion, custom ordering, extension of application interface
What else can we learn from Lucene?
Java-based Full-text indexing/retrieval engine--lucene
Lucene is not a complete full-text indexing application, but is a Java-written Full-text Indexing engine toolkit that can be easily embedded in a variety of applications to implement Full-text indexing/retrieval for applications.
Lucene's author: Lucene's contributor, Dougcutting, is a senior Full-text indexing/retrieval expert who was a major developer of the V-twin search engine (one of Apple's Copland operating system's achievements), and later as a senior system architect in Excite, Currently engaged in research on some of the internet's underlying architectures. His goal for Lucene is to add full-text search capabilities to a variety of small and medium applications.
Lucene's History: Earlier published in the author's own www.lucene.com, later released at the end of sourceforge,2001 year to become the Apache Foundation Jakarta a subproject: http://jakarta.apache.org/lucene/
There are already a lot of Java projects using Lucene as its background full-text indexing engine, and more notable are:
Jive:web Forum System;
Eyebrows: Mailing list HTML archiving/browsing/querying system, the main reference document of this article "Thelucene Search engine:powerful, flexible, and free" is one of the main developers of the eyebrows system, And eyebrows has become the main mailing list archiving system for the Apache project at the moment.
Cocoon: XML-based Web publishing framework, the full text retrieval section uses the Lucene
Eclipse: Java-based open development platform, the Help section's Full-text indexing uses Lucene
For Chinese users, the most concerned question is whether they support full-text search in Chinese. But by introducing the structure of Lucene later on, you will learn that because of the good architecture design of Lucene, support for Chinese can only be achieved by extending the language lexical analysis interface.
The realization mechanism of full-text search
Lucene API Interface Design is more general, input and output structure is very similar to the database table ==> record ==> field, so many traditional applications of files, databases, etc. can be more easily mapped to the storage structure of Lucene/interface. Generally speaking, Lucene can be considered as a database system that supports Full-text indexing.
Compare Lucene and database:
Lucene Database
Index data Source: Doc (Field1,field2 ...) doc (field1,field2 ...) \ indexer/_____________ | Lucene index| --------------/searcher \ Results output: Hits (Doc (field1,field2) doc (field1 ...))
Index data Source: Record (Field1,field2 ...) (field1.) \ sql:insert/_____________ | DB Index | -------------/sql:select \ Results output: Results (field1,field2..) record (field1 ...))
Document: A "unit" that needs to be indexed
A document consists of multiple fields record: Records, containing multiple fields
The field: Fields Field: Field
Hits: Query result set, composed of matching document recordset: Query result set, composed of multiple record
Full-Text Search ≠like "%keyword%"
Usually thicker books are often appended with keyword Index Table (for example: Beijing: 12, 34 pages, Shanghai: 3, 77 pages ...). To help readers find the page number of the relevant content more quickly. And the database index can greatly improve the speed of the query principle is the same, imagine the index through the back of the book to find faster than the content of a page page of how many times higher ... The other reason the index is efficient is that it is well sequenced. For a retrieval system, the core is a sort problem.
Because the database index is not designed for Full-text indexing, the database index does not work when you use like "%keyword%", and when you use the same query, the search process changes to a page-by-page traversal process, so for a database service that contains a fuzzy query, Like is a great threat to performance. If you need to make a fuzzy match for multiple keywords: like "%keyword1%" and "%keyword2%" ... Its efficiency can be imagined.
So the key to establishing an efficient retrieval system is to establish a reverse indexing mechanism similar to the scientific index. While storing data sources, such as multiple articles, in a sorted order, there is another list of keyword lists for storing keywords ==> article mapping relationships, using this mapping relationship index: [keywords ==> appear the article number of keywords, the number of occurrences (even including position: Starting offset, end offset), frequency], the retrieval process is to transform the fuzzy query into multiple indexes can use the logical combination of accurate query process. Thus greatly improving the efficiency of multiple keyword query, so the full text search problem boils down to a sort problem.
Therefore, it can be seen that the exact query of the fuzzy query relative to the database is a very uncertain problem, which is the reason why most of the databases have limited support for full-text retrieval. The core feature of Lucene is that the traditional database is not good at Full-text indexing mechanism through special index structure, and provides the extension interface to facilitate customization for different applications.
You can use the table to compare the database fuzzy query:
Lucene full-text Indexing Engine database
Indexes establish a reverse index of the data in the data source through Full-text indexing for like queries, the traditional index of data is simply not available. The data needs to be conveniently recorded for grep-type fuzzy matching, with a multiple order of magnitude drop than the indexed search speed.
The matching effect is matched by the word element (term), which can realize the support of Chinese and other non-English through the Realization of language analysis interface. Use: Like "%net%" will also match the Netherlands,
Multiple keyword fuzzy matching: use like "%com%net%": Can not match the word order reversed xxx.net..xxx.com
The matching degree has the matching degree algorithm, will match the degree (similarity degree) The high result to row in front. No control of the degree of match: for example, there are 5 words in the record and 1 times, the result is the same.
The result output is a special algorithm, which outputs the first 100 results with the highest matching degree, and the result set is a buffer type of small batch reading. Returns all result sets that require a large amount of memory to hold these temporary result sets when there are very many matching entries, such as tens of thousands of items.
Customizable through a variety of language analysis interface implementation, can be easily customized to meet the needs of the application of the index rules (including support for Chinese) no interface or interface complex, unable to customize
Conclusion the application of fuzzy query with high load needs the rules of fuzzy query, the data of index is low, the fuzzy matching rule is simple or the amount of data needing fuzzy query is less.
The biggest difference between Full-text search and database application is that the first 100 results of the most relevant content meet the needs of more than 98% users.
The innovation of Lucene:
Most search (database) engines use a B-tree structure to maintain the index, the update of the index will lead to a large number of IO operations, lucene in the implementation of this slightly improved: not to maintain an index file, but in the expansion of the index when the new index files are constantly created, These new small index files are then periodically merged into the original large index (for different update strategies, the size of the batch can be adjusted), so that without affecting the efficiency of the search, the efficiency of the index is improved.
Comparison of Lucene and some other full-text retrieval systems/Applications:
Lucene other Open source Full-text search system
Incremental indexes and bulk indexes can be incrementally indexed (Append), can be batch indexed for large amounts of data, and interfaces are designed to optimize batch indexes and small-volume incremental indexes. Many systems support only batch indexes, and sometimes a bit of the data source needs to be rebuilt.
Data source Lucene does not define a specific data source, but a document structure, so it can be very flexible to adapt to a variety of applications (as long as the front-end has a suitable converter to convert the data source to the corresponding structure), many systems only for Web pages, the lack of flexibility of other formats of documents.
Indexed content crawls the document of Lucene is composed of several fields, even can control those fields need to index, those fields do not need index, the field that is indexed in the next step also divides into need participle and do not need the type of participle:
The index that needs to be participle, for example: Title, article content field
There is no need for an index of participle, for example: the author/date field lacks versatility, and tends to document the entire index
Language analysis is implemented through different extensions of Language Analyzer:
You can filter out unwanted words: An and so on,
Latin Grammar analysis: Jumps jumped jumper are grouped into jump for indexing/retrieval
Non-English support: For Asian languages, Arabic language index support lacks common interface implementations
Query analysis through the query analysis interface implementation, you can customize their own query syntax rules:
For example: multiple keywords between the +-and or relationship
Concurrent access to support multiuser use
On the problem of Word segmentation in Asian languages (word Segment)
For Chinese, full-text indexing should first solve a language analysis problem, for English, the words in the sentence are naturally separated by the space, but the Asian language of the Chinese and Japanese Korean words in the word is a word to one, all, first of all, the statement by "word" index, How to cut out the word is a big problem.
First of all, certainly cannot use single character Fu (Si-gram) as index unit, otherwise check "Shanghai", Cannot let contain "sea" also match.
But in a word: "Beijing Tian An gate", how does the computer divide according to Chinese language habit?
"Beijing Tian An Men" or "Beijing Tian an men"? So that the computer can be divided according to language habits, often need a machine has a relatively rich thesaurus to be able to more accurately identify the words in the statement.
Another solution is to use the automatic segmentation algorithm: The word in accordance with the 2-yuan grammar (bigram) way out, such as:
"Beijing Tian An Door" ==> "Beijing every day an Ann gate".
So, at the time of the query, whether the query "Beijing" or query "Tiananmen Square", the query phrase according to the same rules for segmentation: "Beijing", "Tian an an", multiple keywords between the "and" relationship between the combination, the same can correctly map to the corresponding index. This approach is common to other Asian languages: Korean and Japanese.
The biggest advantage based on automatic segmentation is that there is no word maintenance cost, easy to implement, the disadvantage is low index efficiency, but for small and medium-sized applications, based on the 2-yuan syntax is sufficient. Based on the 2-yuan Segmentation index of the general size and the same as the source file, and for English, index files generally only the original file 30%-40% different,
Segmentation of Automatic Segmentation Thesaurus
Implementation is very simple to implement complex
Queries increase the complexity of query analysis, which is suitable for the implementation of more complex query syntax rules
Storage Efficiency index Redundancy is large, the index is almost as large as the original index efficiency, for the original size of about 30%
Maintenance cost Thesaurus maintenance cost is very high: China, Japan, Korea and other languages need to be maintained separately.
Also need to include the word frequency statistics and so on content
Applicable domain Embedded Systems: limited Operating Environment resources
Distributed System: No thesaurus synchronization problem
Multi-language environment: non-thesaurus maintenance cost to query and storage efficiency of professional search engine
At present, the larger search engine language analysis algorithms are generally based on the combination of the above 2 mechanisms. On the Chinese language analysis algorithm, we can check the keyword "wordsegment search" in Google can find more relevant information.
Installation and use
Download: http://jakarta.apache.org/lucene/
Note: Some of the more complex lexical analysis in Lucene is generated using JavaCC (Javacc:javacompilercompiler, pure Java Lexical Analysis builder), so if you compile from source code or need to modify the Queryparser, To customize your lexical analyzer, you will also need to download JavaCC from http://www.experimentalstuff.com/technologies/javacc/.
The composition of Lucene: For external applications, indexing module (index) and Retrieval module (search) are the main external application portals.
org.apache.lucene.search/Search Entrance
org.apache.lucene.index/Index Entry
org.apache.lucene.analysis/Language Analyzer
org.apache.lucene.queryparser/Query Analyzer
org.apache.lucene.document/Storage Structure
Org.apache.lucene.store/underlying io/storage structure
org.apache.lucene.util/Some of the common data structures
A simple example demonstrates how Lucene is used:
Indexing process: Reads the file name (multiple) from the command line, stores the file path (path field) and content (Body field) 2 fields, and makes the content Full-text indexed: The unit of the index is the document object, and each Document object contains multiple Fields Field objects. For different field properties and data output requirements, you can also select a different index/storage field rule for the field, as follows: Method word-cutting index storage purpose
Field.text (string name, string value) Yes Yes Yes split Word index and store, for example: Title, Content field
Field.text (String name, Reader value) Yes Yes No segmentation index does not store, for example: meta information,
Not used to return the display, but to retrieve the content
Field.keyword (string name, string value) No Yes yes do not slice indexes and store, for example: Date fields
Field.unindexed (string name, string value) No no Yes not indexed, store only, for example: file path
Field.unstored (string name, string value) Yes Yes No Full-text index, not stored
public class Indexfiles {//Use method:: indexfiles [index output DIRECTORY] [indexed file list] ... public static void main (string[] args) throws Excepti on {String Indexpath = args[0]; IndexWriter writer; Constructs a new write indexer with the specified language parser (the 3rd argument indicates whether it is an append index) writer = new IndexWriter (Indexpath, New Simpleanalyzer (), false); for (int i=1; i<args.length; i++) {System.out.println ("indexing file" + args[i]); InputStream is = new FileInputStream (args[i]); Constructs a Document object that contains 2 field fields//One is the path path field, not indexed, only stored//one is the Content body field, Full-text indexed, and stores document DOC = new document (); Doc.add (field.unindexed ("path", args[i)); Doc.add (Field.text ("Body", (Reader) New InputStreamReader (IS)); Writes the document to the index writer.adddocument (DOC); Is.close (); }; Closes the Write indexer writer.close (); }}
As you can see in the indexing process:
The language Analyzer provides an abstract interface, so language parsing (analyser) is customizable, although lucene defaults to 2 more generic parsers simpleanalyser and Standardanalyser, which are not supported by default in Chinese. So to add to the Chinese language segmentation rules, we need to modify these 2 parsers.
Lucene does not specify the format of the data source, but only provides a common structure (document object) to accept the input of the index, so the data sources entered can be: database, Word document, PDF document, HTML Document ... As long as the corresponding resolution converter can be designed to construct the data source into Docuement object can be indexed.
For large quantities of data indexes, it is also possible to improve the efficiency of batch indexing by adjusting the Indexerwrite file Merge frequency attribute (Mergefactor).
The retrieval process and the results show:
The search results return a hits object that allows you to access the contents of the Document==>field.
Assuming Full-text search based on the Body field, you can print the path field of the query result and the matching degree of the corresponding query (score).
public class Search {public static void main (string[] args) throws Exception {String Indexpath = args[0], querystring = ARGS[1]; The searcher pointing to the index directory Searcher Searcher = new Indexsearcher (Indexpath); Query parser: Use and index the same language analyzer query = Queryparser.parse (QueryString, "Body", new Simpleanalyzer ()); Search results Use Hits storage Hits Hits = searcher.search (query); The matching for (int i=0; IThroughout the search process, the Language Analyzer, Query Analyzer, and even the searcher (Searcher) provide an abstract interface that can be tailored to suit your needs.
Hacking Lucene
Simplified Query Analyzer
Personally feel that Lucene becomes a Jakarta project, drawing in too much time for debugging increasingly complex queryparser, most of which are not very familiar to most users, currently Lucene supports the syntax:
The middle logic includes: and OR +-&&| | and other symbols, as well as "phrase query" and the prefix/fuzzy query for the western language, personal feeling for the general application, these features some flashy, in fact, to achieve the current similar to Google's query statement analysis function in fact, for most users is enough. Therefore, the queryparser of the earlier version of Lucene is still a better choice.
Add modify delete specified record (document)
Lucene provides an extension of the indexing mechanism, so the dynamic extension of the index should be fine, and the changes to the specified records appear to be only deleted by the record and then rejoin the implementation. How do I delete a specified record? The deletion method is also simple, except that you need to specifically index the record ID in the data source at index time, and then use the Indexreader.delete (termterm) method to delete the corresponding document through this record ID.
Sort by the value of a field
Lucene defaults by its own correlation algorithm (score), but the ability to sort results based on other fields is a frequently mentioned issue in the Lucene development mailing list, and many of the original database applications need to be sorted beyond the matching degree (score). And from the principle of full text search we can understand that any index-less search process efficiency will lead to very low efficiency, if the search based on other fields need to access the storage field in the process, the speed back greatly reduced, so very undesirable.
But there is also a compromise solution: in the search process can affect the results of the order only the index has stored docid and score these 2 parameters, so, based on the score order, can actually by the data source in advance, and then sorted according to the DOCID to achieve. This avoids sorting the results out of the Lucene search results and accessing a field value that is not in the index during the search.
What needs to be modified here is the hitcollector process in Indexsearcher:
... scorer.score (new Hitcollector () {private float minscore = 0.0f; public final void Collect (int doc, float score) {if (Score > 0.0f &&//Ignore zeroed buckets (Bits==null | | bits.get (DOC)) {//Skip docs not in bits totalhits[0]++; if (score >= minscore) {//original: Lucene will docid and corresponding matching degrees score to the result hit list: * Hq.put (n EW Scoredoc (DOC, score)); Update hit Queue * If you use Doc or 1/doc instead of score, you realize that the data source index has been sorted according to the docid of the rows or rows, and the results are based on the DocId sort, which can even be implemented More complex score and docid fitting. */Hq.put (new Scoredoc (Doc, (float) 1/doc)); if (Hq.size () > Ndocs) {//if hit queue overfull hq.pop ();//Remove lowest in hit queue Minscore = ((Scoredoc) Hq.top ( )). Score; Reset Minscore}}}}, Reader.maxdoc ());
More general input and output interfaces
Although Lucene does not define a definite input document format, more and more people think of using a standard intermediate format as a data import interface for Lucene, and then other data, such as PDFs, can be indexed only by converting the parser into a standard intermediate format. This intermediate format is mainly XML, similar implementations are no more than 4, 5:
Data source: WORD PDF HTML DB other \ | | /XML Intermediate Format | Lucene INDEX
There is no parser for the MSWord document, because the Word document differs from the ASCII-based RTF document and needs to be parsed using COM object mechanisms. This is the information I found on Google: http://www.intrinsyc.com/products/enterprise_applications.asp
Another option is to convert Word documents into text:http://www.winfield.demon.nl/index.html
Indexing process optimization
The index is generally divided into 2 kinds of cases, one is the small-volume index expansion, one is the large-scale index reconstruction. During the indexing process, not every time a new doc joins in the index, the index file is written again (file I/O is a very resource-consuming thing).
Lucene indexes the memory first and writes the file according to a certain batch. The larger the interval of this batch, the less write times the file, but a lot of memory. Conversely, the memory is small, but file IO operations are frequent, indexing speed is very slow. There is a merge_factor parameter in the IndexWriter that will help you take advantage of the memory to reduce the file operation based on the application environment after constructing the indexer. According to my experience: The default indexer is written once per 20 records index, each merge_factor 50 times times, indexing speed can be increased by about 1 time times.
Search Process Optimization
Lucene supports memory indexing: Such searches increase in order of magnitude faster than file-based I/O.
http://www.onjava.com/lpt/a/3273
It is also necessary to minimize the creation of indexsearcher and to cache the foreground of search results.
The optimization of Lucene-oriented Full-text search is that after first index retrieval, all records (document) content is not read, but the ID of the first 100 results (Topdocs) with the highest matching degree in all results is placed in the result set cache and returned. Here you can compare database retrieval: If it is a 10,000-piece database to retrieve the result set, the database is sure to get all the record content to return to the application result set. So even though there are a lot of search-matching totals, Lucene's result set takes up less memory space. For the general fuzzy retrieval application is not so many results, the first 100 can meet more than 90% of the search requirements.
If the first batch of cached results is used and the later results are read, searcher retrieves and generates a cache 1 time times the size of the last search cache and crawls back again. So if you construct a searcher to check 1-120 results, searcher is actually doing 2 searches: After the first 100 is finished, the cached results are used up, searcher retrieves and constructs a 200 result cache, and so on, 400 cache, 800 cache. Because these caches are also accessed after each Searcher object disappears, you may want to cache the resulting records, as much as possible to ensure that the cache is kept below 100 to make the most of the first results cache, that Lucene does not waste multiple searches, and that the result cache can be graded.
Another feature of Lucene is that it automatically filters out the result of a low degree of matching in the process of collecting results. This is also where the database application needs to return all the results of the search to different places.
Some of my attempts:
Support Chinese Tokenizer: There are 2 versions, one is generated by JAVACC, the CJK part by one character one token index, the other is rewritten from the Simpletokenizer, the English support number and letter token, the Chinese by iteration index.
An indexer based on an XML data source: Xmlindexer, so that all data sources can be indexed with Xmlindxer as long as they can be converted to the specified XML by a DTD.
Sort by a field: a search that sorts results by record index: Indexordersearcher, so if you need to have the search results sorted by a field, you can have the data source press a field first (for example: Pricefield), so that after indexing, Then, using this search by the ID of the record, the result is the equivalent of the result of that field sort.
Learn more from Lucene
Luene is indeed an example of the face of object design.
All problems are facilitated by an additional abstraction layer for later expansion and reuse: You can achieve your goals by re implementing them, but not on other modules;
Simple application of the portal searcher, Indexer, and call the bottom of a series of components in collaboration to complete the search task;
The tasks of all objects are very single-minded: for example, the search process: Queryparser analysis transforms the query into a series of exact query combinations (query), reads the index through the underlying index-reading structure indexreader, and scores the search results with the corresponding scoring device. Sort, and so on. All of the functional modules are highly atomized, so they can be implemented again without the need to modify other modules.
In addition to the flexible application interface design, Lucene offers a number of language analyzer implementations for most applications (Simpleanalyser,standardanalyser), which is one of the major reasons why new users can get started quickly.
These advantages are well worth learning in the future development. As a general-purpose toolkit, Lunece does offer a lot of convenience to developers who need to embed full-text search functionality into their applications.
In addition, through the study and use of Lucene, I also have a deeper understanding of why many database optimization design requirements, such as:
Indexes the fields as much as possible to improve the query speed, but too many indexes slow down the update of the database tables, and too many of the sorting conditions are actually one of the killers of performance.
Many commercial databases provide some optimization parameters for large quantities of data inserts, and this effect is similar to the merge_factor of indexers.
20%/80% principle: The results of the search is not equal to the quality of good, especially for the return of the result set is very large, how to optimize the first dozens of results of the quality is often the most important.
Make the application get a smaller result set from the database as much as possible, because random access to the result set is a very resource-consuming operation even for large databases.
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.