[Reprinted] Lucene. Net getting started tutorial and example, getting started with e.net
I have seen this very good basic Lucene. Net getting started tutorial. I will repost it and share it with you to learn,
I hope you can use it in your work practice.
1. Simple Example
// Index
Private void Index ()
{
IndexWriter writer = new IndexWriter (@ "E: \ Index", new StandardAnalyzer ());
Document doc = new Document ();
Doc. Add (new Field ("Text", "Oh yeah, pretty girl. ", Field. Store. YES, Field. Index. TOKENIZED ));
Writer. AddDocument (doc );
Writer. Close ();
}
// Search
Private void Search (string words)
{
IndexSearcher searcher = new IndexSearcher (@ "E: \ Index ");
Query query = new QueryParser ("Text", new StandardAnalyzer (). Parse (words );
Hits hits = searcher. Search (query );
For (int I = 0; I System. Console. WriteLine (hits. Doc (I). GetField ("Text"). StringValue ();
Searcher. Close ();
}
Ii. Early Learning Lucene
1. What is Lucene?
Lucene is a high-performance and scalable Information Retrieval toolkit. It is only a Java class library and is not a ready-made application. It provides simple, easy-to-use, but very powerful API interfaces. Based on it, you can quickly build powerful search programs (search engines ?). The latest version 2.9.2.1.
2. What is an index?
To achieve quick search, Lucene first stores the data to be processed in a data structure called an Inverted Index. How can we understand inverted indexes? Simply put, inverted indexes do not answer "what words are contained in this document ?" This question is optimized to quickly answer "which documents contain the word XX ?" This problem. Just like sorting out a directory for quick search for books, Lucene has to optimize an Index file for the data to be searched ), this process is called "Indexing ).
3. Lucene core class
Indexing Process:
IndexWriter Directory Analyzer Document Field
Search process:
IndexSearcher Term Query TermQuery Hits
Iii. Index
1. flowchart of the indexing process:
Note: The Lucene indexing process is divided into three main operation stages: converting data into text, analyzing text, and saving the analyzed text to the index database
2. Basic index operations
2.1 Add an index
Document
Field (understanding Field parameters)
Heterogeneous Document
Append domain
Incremental Index
2.2 Delete An index
Soft Delete: only the delete tag is added. Call IndexWriter. Optimize () and delete it.
IndexReader reader = IndexReader. Open (directory );
// Delete the Document with the specified serial number (DocId.
Reader. Delete (123 );
// Delete the Document containing the specified Term.
Reader. Delete (new Term (FieldValue, "Hello "));
// Restore soft deletion.
Reader. UndeleteAll ();
Reader. Close ();
2.3 update Indexes
In fact, Lucene does not have a way to update indexes.
Update = Delete + add
Tip: it is best to batch Delete and add multiple Document objects. This process is always faster than the alternate Delete and add operations.
// You only need to set the create parameter to false to add new data to the existing index database.
Directory directory = FSDirectory. GetDirectory ("index", false );
IndexWriter writer = new IndexWriter (directory, analyzer, false );
Writer. AddDocument (doc1 );
Writer. AddDocument (doc2 );
Writer. Optimize ();
Writer. Close ();
3. Weighted (boosing)
You can add weights (Boost) to documents and fields so that they rank top in search results. By default, the search result is sorted by Document. Score. The larger the value, the higher the ranking. The default Boost value is 1.
Score = Score * Boost
Using the formula above, we can set different weights to influence the ranking.
In the following example, different weights are set based on the VIP level.
Document document = new Document (); switch (vip) {case VIP. gold: document. setBoost (2F); break; case VIP. argentine: document. setBoost (1.5F); break ;}
As long as Boost is large enough, a hit result will always rank first, which is the "billing ranking" service of Baidu and other websites.
4. Directory
Open an existing index library from the specified directory.
Private Directory directory = FSDirectory. GetDirectory ("c: \ index", false );
Load the index library into the memory to increase the search speed.
Private Directory directory = new RAMDirectory (FSDirectory. getDirectory (@ "c: \ index", false); // or // private Directory directory = new RAMDirectory (c: \ index ");
Note that when the create parameter of FSDirectory. GetDirectory is true, the existing index library files will be deleted. You can use the IndexReader. IndexExists () method to determine whether the existing index library files are deleted.
5. Merge the index database
Merge directory1 into directory2.
Directory directory1 = FSDirectory. getDirectory ("index1", false); Directory directory2 = FSDirectory. getDirectory ("index2", false); IndexWriter writer = new IndexWriter (directory2, analyzer, false); writer. addIndexes (new Directory [] {directory}); Console. writeLine (writer. docCount (); writer. close ();
6. Optimize Indexes
6.1 is very simple. A writer. Optimize () method can solve this problem. The optimization process reduces the index efficiency and improves the search performance. Do not always Optimize (). Optimization once is enough.
6.2 increasing the merge factor and minMergeDocs helps improve performance and reduce the index time when you add indexes to FSDirectory in batches.
IndexWriter writer = new IndexWriter (directory, analyzer, true); writer. maxFieldLength = 1000; // The maximum field length writer. mergeFactor = 1000; writer. minMergeDocs = 1000; for (int I = 0; I <10000; I) {// Add incluentes ...} writer. optimize (); writer. close ();
Using Lucene, you can make full use of the hardware resources of the machine in the index creation project to improve the index efficiency. When you need to index a large number of files, you will notice that the bottleneck of the index process is the process of writing index files to the disk. To solve this problem, Lucene holds a buffer in the memory. But how do we control the Lucene buffer? Fortunately, Lucene's IndexWriter class provides three parameters to adjust the buffer size and the frequency of writing index files to the disk.
(1) mergeFactor)
This parameter determines how many documents can be stored in an index block of Lucene and how often the index block on the disk is merged into a large index block. For example, if the merge factor value is 10, when the number of documents in the memory reaches 10, all documents must be written to a new index block on the disk. In addition, if the index block on the disk reaches 10, the 10 index blocks will be merged into a new index block. The default value of this parameter is 10. This value is very inappropriate if the number of documents to be indexed is very large. For batch indexes, assigning a large value for this parameter will produce better index results.
(2) minMergeDocs)
This parameter also affects the index performance. It determines the number of documents in the memory at least to write them back to the disk. The default value of this parameter is 10. If you have enough memory, setting this value as large as possible will significantly improve the index performance.
(3) maxMergeDocs)
This parameter determines the maximum number of documents in an index block. The default value is Integer. MAX_VALUE: setting this parameter to a relatively large value can improve index efficiency and search speed. Because the default value of this parameter is the maximum value of an integer, we generally do not need to change this parameter.
7. Indexing of large data volumes (concurrency, multithreading, and lock mechanisms)
7.1 multithreading Index
Shared object (Note: An IndexWriter or IndexReader object can be shared by multiple threads)
Clever Use of RAMDirectory
7.2 security lock
Lucene uses file-based locks
Write. lock
Disable the index lock (disableLuceneLocks = true)
7.3 concurrent access rules
Any number of read-only operations can be performed simultaneously.
When the index is being modified, we can also perform any number of read-only operations at the same time.
At a certain time point, only one index modification operation is allowed.
4. Search
1. IndexSearcher
Search by IndexSearcher
Two methods to build an IndexSearcher object: Directory object and file path. (The former is recommended)
Search () method
2. Query
2.1 create a Query object
Use QueryParset to construct a Query object. (Note: QueryParset converts the query expression to the built-in Query type of Lucene .)
Several common built-in types: TermQuery, RangeQuery, PrefixQuery, and BooleanQuery.
Powerful QueryParser 2.2
ToString () method of the Query Class
Boolean query (AND, OR, NOT) Example: a AND B (+ a + B) a OR B (a B) a and not B (+ a-B)
Combination query parentheses () for example: (a OR B) AND c
Domain selection example: tag: Beauty
Range Query [TO] And {TO} example: price: [100 TO 200] price: {100 TO 200}
......
(Note: strong, but not recommended)
3. Hits
3.1 Access search results using Hits objects
3.2methods of the hits class
Length () Number of documents contained in the Hits object set
Document (n) ranking n
Id (n) ranking n of entid
Score (n) rank the nth standard Score
4. Sorting
4.1 Sort by Sort object
With the construction parameters of SortField, we can set sorting fields, sorting conditions, and inverted sorting.
Sort sort = new Sort (new SortField (FieldName, SortField. DOC, false); IndexSearcher searcher = new IndexSearcher (reader); Hits hits = searcher. Search (query, sort );
4.2 Sort by index order (Document ID at index time) using Sort. INDEXORDER as the parameter
4.3 multi-domain sorting
4.4 Effect of sorting on Performance
Sorting has a great impact on search speed. Try not to use multiple sorting conditions.
Suggestion: use the default point sorting method and a well-designed Weighting Mechanism
5. Filter
Filtering is a mechanism used in Lucene to narrow down the search space.
DateFliter is only limited to the value of the specified date field in a certain time range.
QueryFilter uses the query result as another searchable document space for the new query.
Suggestion: the filter adopts a Reprocessing Method for the search results, which will significantly reduce the program performance. It is generally recommended to use BooleanQuery to combine more search conditions to achieve the effect.
Example:
We can search for products from-10-1 to-10-30.
For the date and time, we need to convert it to add it to the index database, and it must also be an index field.
// Indexdocument. add (FieldDate, DateField. dateToString (date), Field. store. YES, Field. index. UN_TOKENIZED );//... // searchFilter filter = new DateFilter (FieldDate, DateTime. parse ("2005-10-1"), DateTime. parse ("2005-10-30"); Hits hits = searcher. search (query, filter );
In addition to the date and time, you can also use integers. For example, the search price is between 100 and ~ Between 200.
Lucene. Net NumberTools performs bitwise processing on numbers. If you need floating point numbers, you can refer to the source code.
// Indexdocument. add (new Field (FieldNumber, NumberTools. longToString (long) price), Field. store. YES, Field. index. UN_TOKENIZED ));//... // searchFilter filter = new RangeFilter (FieldNumber, NumberTools. longToString (100L), NumberTools. longToString (200L), true, true); Hits hits = searcher. search (query, filter );
Use Query as the filter condition.
QueryFilter filter = new QueryFilter (QueryParser. Parse ("name2", FieldValue, analyzer ));
We can also use FilteredQuery for multi-condition filtering.
Filter filter = new DateFilter (FieldDate, DateTime. parse ("2005-10-10"), DateTime. parse ("2005-10-15"); Filter filter2 = new RangeFilter (FieldNumber, NumberTools. longToString (11L), NumberTools. longToString (13L), true, true); Query query = QueryParser. parse ("name *", FieldName, analyzer); query = new FilteredQuery (query, filter); query = new FilteredQuery (query, filter2 ); indexSearcher searcher = new IndexSearcher (reader); Hits hits = searcher. search (query );
6. Multi-Domain Search
MultiFieldQueryParser for multi-domain search
The weight affects the priority of the domain, rather than the sequence in which the domain is used.
Query query = MultiFieldQueryParser. parse ("name *", new string [] {FieldName, FieldValue}, analyzer); IndexReader reader = IndexReader. open (directory); IndexSearcher searcher = new IndexSearcher (reader); Hits hits = searcher. search (query );
7. Combined search
In addition to using QueryParser. Parse to break down complex search syntaxes, you can also combine multiple queries.
Query query1 = new TermQuery (new Term (FieldValue, "name1"); // search Query query2 = new WildcardQuery (new Term (FieldName, "name *")); // wildcard Query query3 = new PrefixQuery (new Term (FieldName, "name1"); // Field search Field: Keyword, * Query query4 = new RangeQuery (new Term (FieldNumber, NumberTools. longToString (11L), new Term (FieldNumber, NumberTools. longToString (13L), true); // Query in the range query5 = new FilteredQuery (query, filter); // search BooleanQuery with filter conditions = new BooleanQuery (); query. add (query1, BooleanClause. occur. MUST); query. add (query2, BooleanClause. occur. MUST); IndexSearcher searcher = new IndexSearcher (reader); Hits hits = searcher. search (query );
8. distributed search
We can use MultiReader or MultiSearcher to search for multiple index libraries.
MultiReader reader = new MultiReader (new IndexReader [] {IndexReader. open (@ "c: \ index"), IndexReader. open (@ "\ server \ index")}); IndexSearcher searcher = new IndexSearcher (reader); Hits hits = searcher. search (query );
Or
IndexSearcher searcher1 = new IndexSearcher (reader1); IndexSearcher searcher2 = new IndexSearcher (reader2); MultiSearcher searcher = new MultiSearcher (new Searchable [] {searcher1, searcher2}); Hits hits = searcher. search (query );
You can also use ParallelMultiSearcher for multi-thread parallel search.
9. display the search syntax string
We have combined many search conditions and may want to see what the equivalent search syntax string is.
BooleanQuery query = new BooleanQuery (); query. add (query1, true, false); query. add (query2, true, false );//... console. writeLine ("Syntax: {0}", query. toString ());
Output:
Syntax: + (name: name * value: name *) + number: [cannot exceed limit B TO exceed limit D]
5. Word Segmentation
1. What is analyzer?
Analysis: In Lucene, it refers to the process of converting the Field text into the most basic index unit-Term.
2. built-in Analyzer
KeywordAnalyzer
SimpleAnalyzer
StopAnalyzer
WhitespaceAnalyzer
StandardAnalyzer (most powerful)
3. Chinese Word Segmentation
There is no built-in Chinese Word Segmentation officially. You can choose third-party open source Chinese word segmentation, such as pangu word segmentation.
Download SourceCode from example source code
PS: The Lucene. Net version used by the example program is 2.9.2.1. The example program in this article may not be compatible with the latest version. Use the example program as the standard.
In this example, the Chinese word segmentation is pangu. Its official website for http://pangusegment.codeplex.com/