Simplified query Analyzer
I personally feel that after Lucene became the Jakarta project, it took too much time to debug the increasingly complex queryparser, most of which are not very familiar to users. Currently, Lucene supports the following syntax:
Query: = (clause )*
Clause: = ["+", "-"] [<term> ":"] (<term> | "(" query ")")
The intermediate logic includes: And or +-& | and phrase queries and prefix/fuzzy queries for Western text, these functions are somewhat flashy. In fact, the query statement analysis function similar to Google is enough for most users. Therefore, the earlier version of queryparser in Lucene is still a good choice.
Add, modify, and delete a specified record (document)
Lucene provides an index extension mechanism, so the index dynamic expansion should be no problem, and the modification of the specified record also seems to be only through the deletion of the record, and then re-Add the implementation. How can I delete a specified record? The deletion method is also very simple. You just need to create another index based on the Record ID in the data source during the index, and then use the indexreader. Delete (termterm) method to delete the corresponding document through this record ID.
Sort by a Field Value
Lucene sorts the results based on its own relevance algorithm (score) by default. However, sorting results based on other fields is a problem that is frequently mentioned in the Lucene development email list, many database-based applications require sorting functions other than matching degree (score. From the principle of full-text retrieval, we can understand that any search process that is not indexed will be very inefficient, if the sorting based on other fields needs to access the stored fields during the search process, the speed is greatly reduced, so it is very undesirable.
However, there is also a compromise solution: In the search process, only the docid and score parameters that have been stored in the index can affect the sorting result. Therefore, based on Sorting other than score, you can sort the data source in advance and then sort it by docid. This avoids sorting the results out of Lucene search results and accessing a field value that is not in the index during the search process.
Here we need to modify the hitcollector process in indexsearcher:
...
Scorer. Score (New hitcollector (){
Private float minscore = 0.0f;
Public final void collect (INT doc, float score ){
If (score> 0.0f & // ignore zeroed buckets
(BITS = NULL | bits. Get (DOC) {// skip docs not in bits
Totalhits [0] ++;
If (score> = minscore ){
/* Originally: Lucene entered the docid and matching score into the result hit list:
* HQ. Put (New scoredoc (Doc, score); // update hit queue
* If you use Doc or 1/doc to replace score, you can sort the score by docid or reverse.
* Assume that the data source index has been sorted by a field, and the results are sorted by docid.
* Sorting of a field can even achieve more complex matching of score and docid.
*/
HQ. Put (New scoredoc (Doc, (float) 1/DOC ));
If (HQ. Size ()> ndocs) {// If hit queue overfull
HQ. Pop (); // remove lowest in hit queue
Minscore = (scoredoc) HQ. Top (). Score; // reset minscore
}
}
}
}
}, Reader. maxdoc ());
More common input and output Interfaces
Although Lucene does not define a definite input document format, more and more people think of using a standard intermediate format as the Lucene data import interface, and then other data, for example, you only need to convert a PDF file to a standard intermediate format through a parser to index the data. The intermediate format is mainly XML, and there are no more than four or five similar implementations:
Data source: Word pdf html db other
/|/
XML intermediate format
|
Lucene Index
At present, there is no Parser for MSWord documents. Because Word documents are different from ascii-based RTF documents, you need to use the COM object mechanism for parsing. This is what I found on Google: http://www.intrinsyc.com/products/enterprise_applications.asp
Another way is to convert Word documents into text: http://www.winfield.demon.nl/index.html
Index Process Optimization
Indexes are generally divided into two types: Small Batch index expansion and large batch index reconstruction. During the indexing process, not every time a new doc is added to the index, the index file is written again (file I/O is a very resource-consuming task ).
Lucene performs Indexing in the memory and writes files in batches. The larger the interval for this batch, the less files are written, but the memory is occupied. On the contrary, the memory usage is small, but the file IO operations are frequent, and the indexing speed will be slow. There is a merge_factor parameter in indexwriter to help you fully utilize the memory to reduce file operations after creating the index according to the application environment. Based on my usage experience: by default, indexer is written once after every 20 Record indexes. Every time merge_factor is increased by 50 times, the indexing speed can be increased by about doubled.
Search Process Optimization
Lucene supports memory indexes: such searches are faster than file-based I/O operations.
Http://www.onjava.com/lpt/a/3273
It is also necessary to minimize the creation of indexsearcher and cache the search results at the front end.
Lucene's optimization for full-text search is that after the first index search, it does not read the specific content of all records (documents, in this case, only the IDs of the first 100 results (topdocs) with the highest matching degree among all results are put in the result set cache and returned. Here we can compare the database retrieval: if it is a 10,000-Piece Database Retrieval result set, the database must retrieve all the records before returning them to the application result set. Therefore, even if the total number of searches matches is large, Lucene's result set does not occupy much memory space. For general fuzzy search applications, so many results are not used, the first 100 entries can meet more than 90% of the search requirements.
If the first batch of cached results are used up and later results are read, the searcher will search again and generate a cache that doubles the previous search cache, and then crawl it back. Therefore, if you construct a searcher to query 1-100 results, searcher actually performs two searches: after the first results are obtained, the cache results are used up, searcher re-searches and constructs another 200 results cache. In this way, 400 caches and 800 caches are pushed. Since these caches cannot be accessed every time the searcher object disappears, you may want to cache the result records. The number of caches should be less than 100 to make full use of the first result cache, lucene does not waste multiple searches, and results can be cached hierarchically.
Another feature of Lucene is that the results with low matching degree are automatically filtered out during the collection of results. This is also different from database applications that need to return all the search results.
Some of my attempts:
- Tokenizer that supports Chinese characters: there are two versions. One is generated by javacc, And the CJK part is indexed by a token character, and the other is rewritten from simpletokenizer, supports english numbers and letters as tokens, and iterative indexing of Chinese characters.
- XML-based data source indexer: xmlindexer. Therefore, you can use xmlindxer to index all data sources as long as they can be converted to the specified XML according to the DTD.
- Sorting by a field: searcher for sorting results by record index order: indexordersearcher. Therefore, if you want to sort the search results by a field, you can sort the data source by a field first (for example: pricefield), then after the index, and then use the searcher to search by record ID, the result is equivalent to the result of sorting that field.