"Reprint" Lucene.Net Introductory tutorial and examples

Last Update:2014-08-19 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

I see this very good lucene.net introductory basic tutorial, reproduced to share for everyone to learn,
I hope you can use it in your work practice.

A. Simple example

//Index
Private Void index ()
{
    indexwriter writer = new IndexWriter (@ "E:\Index", new StandardAnalyzer ());
    Document doc = new document ();
    Doc. ADD (New Field ("Text", "Oh yes, beautiful girl.") ", Field.Store.YES, Field.Index.TOKENIZED));
    Writer. Adddocument (DOC);
    Writer. Close ();
}

//Search
Private void Search (string words)
{
    indexsearcher searcher = new Indexsearcher (@ "E:\Index");
    Query query = new Queryparser ("Text", New StandardAnalyzer ()). Parse (words);
    Hits Hits = searcher. Search (query);
    for (int i = 0; I < hits. Length (); i )
        System.Console.WriteLine (hits. Doc (i). GetField ("Text"). StringValue ();
    Searcher. Close ();
}

Two First knowledge of Lucene
1. What is Lucene?
Lucene is a high-performance, extensible Information Retrieval Toolkit. It's just a Java class library, not an off-the-shelf application. It provides an easy-to-use yet very powerful API interface, based on which you can quickly build powerful search programs (search engines?). ）。 Current latest version of 2.9.2.1.

2. What is an index
For a quick search, Lucene first stores the data that needs to be processed in a data structure called an inverted index (inverted). How to understand inverted index? Simply put, the inverted index is not the answer to "which words are included in this document?" "This problem, but is optimized to quickly answer" which documents contain the word xx? "This question. Just like a catalog that needs to be sorted out for quick searching, Lucene has to optimize an index file for the data that needs to be searched, which is called "Indexing" (indexing).

3. Lucene's core class
Indexing process:
IndexWriter Directory Analyzer Document Field
Search process:
Indexsearcher term Query termquery Hits

Three. Index
1. Flowchart of the indexing process:

Note: The Lucene indexing process is divided into three main operational stages: exchanging data for text, parsing text, and saving parsed text to the index library

2. Basic indexing Operations
2.1 Adding an Index
Document
field (Understanding the Parameters of field)
Heterogeneous document
Append fields
Incremental index
2.2 Deleting an index
Soft Delete, only the delete tag is added. Called Indexwriter.optimize () after the real deletion.

Indexreader reader = indexreader.open (directory);

Deletes the Document for the specified ordinal (DocId).
Reader. Delete (123);

Deletes the Document containing the specified term.
Reader. Delete (New term (fieldvalue, "Hello"));

Recover soft Delete.
Reader. Undeleteall ();

Reader. Close ();

2.3 Update Index
In fact, Lucene does not have a way to update the index
Update = delete + add
Tip: When deleting and adding multiple document objects, it's a good idea to do batch processing. This is always faster than alternating delete and add operations.

Simply set the Create parameter to False to add new data to an existing index library.
Directory directory = fsdirectory.getdirectory ("index", false);
IndexWriter writer = new IndexWriter (directory, analyzer, false);
Writer. Adddocument (Doc1);
Writer. Adddocument (DOC2);
Writer. Optimize ();
Writer. Close ();

3. Weighting (boosing)
You can add weight (Boost) to Document and Field to make it more forward in the search results rankings. By default, search results are sorted by Document.score, and the larger the number, the higher the top. The Boost default value is 1.
Score = score * Boost
With the above formula, we can set different weights to influence the rankings.
The following example sets different weights according to the VIP level.

Document document = new document (); switch (VIP) {case VIP. Gold:document. Setboost (2F); Break Case VIP. Argentine:document. Setboost (1.5F); break;}

As long as Boost is big enough, then you can let a hit result is always ranked first, this is Baidu and other sites "charge ranking" business.

4. Directory
Opens an existing index library from the specified directory.

Private Directory directory = fsdirectory.getdirectory ("C:\index", false);

Load the index library into memory to increase the search speed.

Private Directory directory = new Ramdirectory (Fsdirectory.getdirectory (@ "C:\index", false));//or//private Directory directory = new Ramdirectory (c:\index ");

Note Fsdirectory.getdirectory's create parameter, true when an existing index library file is deleted, can be judged by the indexreader.indexexists () method.

5. Merging index libraries
Merges the directory1 into the Directory2.

Directory directory1 = fsdirectory.getdirectory ("Index1", false);D irectory Directory2 = Fsdirectory.getdirectory (" Index2 ", false); IndexWriter writer = new IndexWriter (Directory2, analyzer, false); writer. Addindexes (new directory[] {Directory}); Console.WriteLine (writer. Doccount ()); writer. Close ();

6. Optimizing Indexes
6.1 Very simple, a writer. Optimize (), the optimization process reduces the efficiency of the index and optimizes results to improve search performance. Do not always optimize (), optimize once is enough
6.2 Increasing the number of merge factors (mergefactor) and minimum document merges (Minmergedocs) can help improve performance and reduce indexing time when adding indexes to fsdirectory in batches.

IndexWriter writer = new IndexWriter (directory, analyzer, true); writer.maxfieldlength = 1000; Field Maximum length Writer.mergefactor = 1000;writer.minmergedocs = 1000;for (int i = 0; i < 10000; i) {//Add documentes ...} Writer. Optimize (); writer. Close ();

With Lucene, you can make the most of your machine's hardware resources to improve the efficiency of your index in the project you are creating the index. When you need to index a large number of files, you will notice that the bottleneck of the indexing process is the process of writing index files to disk. To solve this problem, Lucene holds a buffer in memory. But how do we control the Lucene buffer? Fortunately, Lucene's class IndexWriter provides three parameters to adjust the size of the buffer and how often to write index files to disk.
(1) Merge factor (Mergefactor)
This parameter determines how many documents can be stored in one index block in Lucene and how often the index blocks on the disk are merged into a large index block. For example, if the value of the merge factor is 10, then all documents must be written to a new index block on disk when the number of documents in memory reaches 10. Also, if the number of index blocks on the disk reaches 10, the 10 index blocks are merged into a new index block. The default value for this parameter is 10, which is very inappropriate if the number of documents required for indexing is very numerous. For batch indexing, assigning a larger value to this parameter results in a better index effect.
(2) Minimum number of merged documents (MINMERGEDOCS)
This parameter also affects the performance of the index. It determines the minimum number of documents in memory that can be written back to disk. The default value for this parameter is 10, and if you have enough memory, setting this value as large as possible will significantly improve indexing performance.
(3) Maximum number of merged documents (MAXMERGEDOCS)
This parameter determines the maximum number of documents in an index block. Its default value is Integer.max_value, setting this parameter to a larger value can improve index efficiency and retrieval speed, because the default value of this parameter is the maximum value of integral type, so we generally do not need to change this parameter.

7. Large Data volume index (concurrency, multithreading, and lock mechanism)
7.1 Multi-threaded indexing
Shared Object (Note: A IndexWriter or Indexreader object can be shared by multiple threads)
Using Ramdirectory skillfully
7.2 Safety Lock
Lucene uses file-based locks
Write.lock
Disable index Lock (disablelucenelocks=true)
7.3 Rules for concurrent access
Any number of read-only operations can be performed concurrently.
When the index is being modified, we can also perform any number of read-only operations at the same time.
At some point, only one operation that modifies the index is allowed.

Four Search
1. Indexsearcher
Performing a search through Indexsearcher
Two ways to build Indexsearcher objects: directory objects and file paths. (the former is recommended)
Search () method

2. Query
2.1 Creating a Query Object
Use Queryparset to build the query object. (Note: Queryparset translates the query expression into Lucene's built-in query type.) ）
Several commonly used built-in types: Termquery, Rangequery, Prefixquery, Booleanquery.

2.2 Powerful Queryparser.
The ToString () method of the query class
Boolean query (and, or, not) example: A and B (+a +b) a OR B (a B) A and not B (+a-b)
Combined query Parentheses "()" Example: (a OR B) and C
Example of domain selection: Tag: Beauty
range query [to] and {to} example: Price:[100 to +] price:{100 to 200}
......
(Note: Tough, but not recommended)

3.Hits
3.1 Using hits objects to access search results
Several methods of the 3.2Hits class
Length () The number of documents contained in the Hits object collection
Document instance of document (n) ranking nth
Id (n) rank nth DocumentID
Score (n) the standard score of nth

4. Sorting
4.1 Sorting using Sort objects
With SortField's construction parameters, we can set sort fields, sort conditions, and inverted rows.

Sort sort = new sort (new SortField (FieldName, Sortfield.doc, false)); Indexsearcher searcher = new Indexsearcher (reader); Hits Hits = searcher. Search (query, sort);

4.2 Sort by index order (document ID at index) using Sort.indexorder as parameter
4.3 Multi-field sorting
4.4 Effects of sorting on performance
Sorting is still a big influence on search speed, so try not to use more than one sort condition.
Recommendation: A well-designed weighted mechanism with default integration ranking

5. Filtering
Filtering (Filtering) is a mechanism used in lucene to narrow the search space.
Datefliter is limited to specifying the value of a date field in a time range
QueryFilter the query as a searchable document space for another new query.
Recommendation: The filter takes the re-processing of search results, which can significantly reduce the performance of the program, it is generally recommended to use booleanquery combination of more search conditions to achieve results.

Example:
We search for goods that are in the shelf time between 2005-10-1 and 2005-10-30.
For datetime, we need to convert it to be added to the index library, and it must also be an indexed field.

Indexdocument. ADD (Fielddate, datefield.datetostring (date), Field.Store.YES, Field.Index.UN_TOKENIZED);//...//searchfilter Filter = new Datefilter (fielddate, DateTime.Parse ("2005-10-1"), DateTime.Parse ("2005-10-30")); Hits Hits = searcher. Search (query, filter);

In addition to DateTime, you can also use integers. For example, the search price between 100 ~ 200 items.
Lucene.Net Numbertools for the number of the complement processing, if you need to use floating-point numbers can refer to the source code.

Indexdocument. ADD (New Field (FieldNumber, numbertools.longtostring (long) price), Field.Store.YES, Field.Index.UN_TOKENIZED);// ...//searchfilter filter = new Rangefilter (FieldNumber, numbertools.longtostring (100L), numbertools.longtostring ( 200L), True, true); Hits Hits = searcher. Search (query, filter);

Use Query as the filter condition.

QueryFilter filter = new QueryFilter (Queryparser.parse ("name2", Fieldvalue, analyzer));

We can also use Filteredquery for multi-condition filtering.

Filter filter = new Datefilter (fielddate, DateTime.Parse ("2005-10-10"), DateTime.Parse ("2005-10-15")); Filter filter2 = new Rangefilter (FieldNumber, numbertools.longtostring (11L), numbertools.longtostring (13L), True, True ); Query query = queryparser.parse ("name*", FieldName, analyzer); query = new Filteredquery (query, filter); query = new Filtere Dquery (query, filter2); Indexsearcher searcher = new Indexsearcher (reader); Hits Hits = searcher. Search (query);

6. Multi-domain Search
Using Multifieldqueryparser for multi-domain search
Weights affect the priority of the domain, not the order in which the fields are used

Query query = multifieldqueryparser.parse ("name*", new string[] {FieldName, fieldvalue}, Analyzer); Indexreader reader = Indexreader.open (directory); Indexsearcher searcher = new Indexsearcher (reader); Hits Hits = searcher. Search (query);

7. Combination Search
In addition to using Queryparser.parse to decompose complex search syntax, you can combine multiple Query to achieve the goal.

Query Query1 = new Termquery (New term (fieldvalue, "name1")); Word search Query Query2 = new Wildcardquery (New term (FieldName, "name*")); Wildcard Query query3 = new Prefixquery (New term (FieldName, "name1")); Field Search Field:keyword, automatically add *query at the end query4 = new Rangequery (New term (FieldNumber, numbertools.longtostring (11L)), new Term (FieldNumber, numbertools.longtostring (13L)), true); Range search Query Query5 = new Filteredquery (query, filter); Search Booleanquery with filter criteria query = new Booleanquery (); query. ADD (Query1, BooleanClause.Occur.MUST); query. ADD (Query2, BooleanClause.Occur.MUST); Indexsearcher searcher = new Indexsearcher (reader); Hits Hits = searcher. Search (query);

8. Distribution Search
We can search multiple index libraries using Multireader or Multisearcher.

Multireader reader = new Multireader (new indexreader[] {Indexreader.open (@ "C:\index"), Indexreader.open (@ "\\server\ Index ")}); Indexsearcher searcher = new Indexsearcher (reader); Hits Hits = searcher. Search (query);

Indexsearcher searcher1 = new Indexsearcher (reader1); Indexsearcher searcher2 = new Indexsearcher (READER2); Multisearcher searcher = new Multisearcher (new searchable[] {searcher1, searcher2}); Hits Hits = searcher. Search (query);

You can also use Parallelmultisearcher for multi-threaded parallel searches.

9. Display the search syntax string
We've combined a number of search terms, and we might want to see what the string of search syntax is like.

Booleanquery query = new booleanquery (); query. ADD (Query1, True, false), query. ADD (Query2, True, false);//... Console.WriteLine ("Syntax: {0}", query.) ToString ());

Output:
Syntax: + (name:name* value:name*) +number:[0000000000000000b to 0000000000000000d]

Five. Participle
1. What is a parser
Analysis, in Lucene, refers to the process of converting field (field) text to the most basic index unit-item (term).

2. Built-in Profiler
Keywordanalyzer
Simpleanalyzer
Stopanalyzer
Whitespaceanalyzer
StandardAnalyzer (most powerful of all)

3. Chinese participle
The official does not have their own Chinese participle, can choose third-party open-source Chinese participle, such as Pangu participle

Example source code Download SourceCode
PS: The example program uses the Lucene.Net version of 2.9.2.1, the example program may not be compatible with the latest version, the use of the example procedure shall prevail.
The example procedure uses the Chinese participle as pangu participle. Its official website is http://pangusegment.codeplex.com/

Transferred from: http://www.cnblogs.com/JoinZhang/archive/2010/08/25/1808131.html

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

"Reprint" Lucene.Net Introductory tutorial and examples

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

"Reprint" Lucene.Net Introductory tutorial and examples

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support