Dotlucene (Lucene. Net) Research

Last Update:2018-12-06 Source: Internet

Author: User

Tags split words

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. Create a project

A) Reference Lucene. net. dll

Using Lucene. net. index;

Using Lucene. net. documents;

Using Lucene. net. analysis;

Using Lucene. net. analysis. Standard;

Using Lucene. net. search;

Using Lucene. net. queryparsers;

2. Create an index)
Before you start searching, you need to create an index and add the data you want to search.

A) create an index and use a class named indexwriter.
Indexwriter writer = new indexwriter (@ "C: \ your \ Index \ directory", new standardanalyzer (), true );
In this way, you can easily create a new index in the hard disk directory (you need to write permissions to the hard disk). You can also store the index to the memory or database.

B) then, after you create an index, you can open it when you need to add a document to it.
Indexwriter writer = new indexwriter (@ "C: \ your \ Index \ directory", new standardanalyzer (), false );

3. Add documents to the index

Once you open the index, you can add a document to it. every entity you insert into the index is a document. each domain in the index contains related information. in each domain, you need to specially catalog it, which means that no matter whether the domain meets the following requirements:

StoredIn the index (you can get any stored value from the index. This is useful for short fields, such as author, title, and so on)

Indexed(The indexing is required for the domain you want to request.)

Tokenized(Mark,Split words before recording)

Fields can be included in useful structures:

Public field (string fieldname, string fieldvalue, bool stored, bool indexed, bool tokenized );

For your convenience, you can also use static methods to create a new domain instance:

Field. Keyword ()-Store and catalog domain values without marking

Field. Text (string name, streamreader Val)-The domain is not stored, but can be indexed and marked.
Field. Text (string name, string Val)-The domain is stored, indexed, and marked.

Field. unindexed ()-The domain is only stored.

If you want to obtain the tag value and index, you need to support unformatted text. To compile other formatted text, you need a word divider. We will view an HTML document for analysis.

In our example, we will index a text file by using the following fields:

"Filename"(Storage, not indexed, not marked)

"Text"(Not stored, indexed, and tagged)

When we use such a domain, we can only search for the content (the "text" field)

Void adddocument (indexwriter writer, string path)

{

Document Doc = new document ();

Streamreader sr = new streamreader (path, system. Text. encoding. Default );

Doc. Add (field. Text ("text", Sr ));

Doc. Add (field. Keyword ("FILENAME", PATH ));

Writer. adddocument (DOC );

Sr. Close ();

}

4. Save the index
Do not forget to save the index

Writer. Close ();

You can have an index before disabling indexwriter.

Writer. Optimize ();

Writer. Close ();

Optimization may take some time, but it can improve the search performance. You should optimize it before the index write operation is completed.

5. Start searching
Before you search for an index, useIndexsearcherYou can open it and perform a search when there is another process or optimization for adding a document.

Indexsearcher searcher = new indexsearcher (@ "C: \ your \ Index \ directory ");

Then you can create a query:

String q = "dotlucene ";

Query query = queryparser. parse (Q, "text", new standardanalyzer ());

Now we can get the search results and print them out. Our function prints out the file name, because we store other fields in the index.

Hits hits = searcher. Search (query );

Console. writeline ("found" + hits. Length () + "document (s) that matched query '" + q + "': \ r \ n ");

For (INT I = 0; I

Document Doc = hits. DOC (I );

Console. writeline (Doc. Get ("FILENAME") + "\ r \ n ");

}

Finally, do not forget to close the search engine:

Searcher. Close ();

6. query syntax (query syntax)

Query	Example	Notes
Single Term Word	Document	Searches for documents that contain "document" term in the default field. Search for the default domain containing the word "document" in a document
Phrase Phrase	"Important Document"	Searches for documents that contain the phrase "important document" in the default fiels. Search for the default domain of the phrase "important document" in the document
Searching Fields Search domain	Title: Document	Searches for documents that contain "document" term in the "title" field. Search for the word "document" in the "title" field in the document"
Wildcard search Wildcard search	Doc? Ment	Single-character wildcard search. It will match "document" and "dociment" but not "docooment ". When a letter or wildcard is queried, it matches "Doc ".UMent "and" DocIMent ", but does not match" DocOoMent"
	Document *	Multi-character wildcard search. It will match "document" and "documentation ". Multi-letter wildcard query, which matches "document" and "documentation"
Fuzzy search Fuzzy search	Document ~	Search based on similar spelling. Query similar spelling
	Document ~ 0.9	Search based on similar spelling. 0.9 is the required similarity (default: 0.5) If the query similarity is 0.9, the default value is 0.5.
Proximity search Near Query	"Important Document "~ 5	Find words of a phrase that are not next to each other. Maximum Distance in this example is 5 words. Search for words in a phrase. In the following phrase, the longest distance is five words.
Range search Range Query	Author: {Einstein to Newton}	Searches for document with "author" field value between specified values. The "author" value in the query document is between the list values.
	Date: {20050101 to 20050201}	Searches for document with "date" field (datetime type) value between specified dates. Query the "date" (date type) in the document between the listed values
Relevance Appropriate	Important ^ 4 Document	Set boost factor of the term "important" to 4. Default boost factor is 1.
	"Important Document" ^ 4 "Search Engine"	You can set boost factor for phrases too.
Or Operator Or operation	Important Document	"Or" is the default operator.
	Important or document	The default field must contain either "important" or "document ". The default document must contain "important" or "document"
And Operator And operations	Important and document	The default field must contain both word. The default document must contain all words.
+ Operator + Operation	Important + document	The default field must contain "document" and may contain "important ". The default document must contain "document" and can contain "important"
Not/-Operator Non-operation	-Important Document	The default field must contain "document" but not "important ". The default document must contain "document", but "important" is not saved"
Grouping Group	(Important or office) and document	Use parentheses for expression grouping. Use garden arc grouping
	Author :( Einstein or Newton)	Parentheses work with fields as well.

7. Prohibited queries (query prohibited)

Query	Examples	Notes
Wildcard at the beginning of a term Wildcard at the beginning of a word	? Ocument, * ocument	Throws Lucene. net. queryparsers. parseexception.
Stop Words Comma	A, the, and	Stop words are not indexed.
Special characters: +-& \|! () {} [] ^ "~ *? :\ Special characters	\ + ,\:	Use a backslash to escape the special characters.

Learning accumulation:

1.TermqueryQuery a specific word

Term T = new term ("ISBN", "1930110995 ");
Query query = new termquery (t );

2.Rangequery Used for query range,Usually used for time query

For example:Query query = new rangequery (new term ("time", "20070516"), new term ("time", "20070517"), false );

The third parameter of rangequery is used to indicate whether the start and end dates are included.

3.PrefixqueryIt is used to search whether a specific prefix is included. It is often used for catalog retrieval.
Prefixquery query = new prefixquery (new term ("category", "/computers "));

4.BooleanqueryUsed to test whether multiple conditions are met.

Termquery searchingbooks =
New termquery (new term ("subject", "JUnit "));

Rangequery currentbooks =
New rangequery (new term ("pubmonth", "200301 "),
New term ("pubmonth", "200312 "),
True );
Booleanquery currentsearchingbooks = new booleanquery ();
Currentsearchingbooks. Add (searchingbooks, true, false );
Currentsearchingbooks. Add (currentbooks, true, false );
Indexsearcher searcher = new indexsearcher (directory );
Hits hits = searcher. Search (currentsearchingbooks );

When and when is or? The key lies in the parameter of the add method of the booleanquery object.

Parameter 1 is the query condition to be added.

Parameter 2: Does required indicate that this condition must be met? True indicates that the condition must be met. False indicates that the condition cannot be met.

Does parameter 3 prohibited indicate that this condition must be rejected? True indicates that the result that meets this condition must be excluded, and false indicates that the condition can be met.

Note that the size of and or can be expressed as a and-B if you want a and non-B, or + A-B.

By default, queryparser considers spaces as or links, just like Google. However, you can modify this attribute through the queryparser object.

5. phrasequery
Query phrase. There is mainly a concept of slop, that is, the displacement deviation between words,

This value will affect the score of the result. if slop is 0, it is the most matched. it is easy to understand the following example. The slop computing users do not need to understand it, but the slop is too large.
The query efficiency is affected, so we need to set this value to a smaller value in actual use.
Phrasequery does not care about the sequence of phrases. In addition to increasing the hit rate, phrasequery also has a great impact on performance,
Spannearquery can be used to control the sequence of phrases to improve performance.

Private bool matched (string [] phrase, int slop)
{
Phrasequery query = new phrasequery ();
Query. setslop (slop );

For (INT I = 0; I <phrase. length; I ++)
{
Query. Add (new term ("field", phrase [I]);
}

Hits hits = searcher. Search (query );
Return hits. Length ()> 0;
}

When you use queryparse to query phrases, you must set the slop value first. There are two ways to do this:

Query q2 = queryparser. parse ("" quick Fox "~ 1 ",//Method 1
"Field", new simpleanalyzer ());

Queryparser QP = new queryparser ("field", new simpleanalyzer ());
QP. setphraseslop (1 );//Method 2

Original article: http://www.shenjk.com/detail/277

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More