E.net Series II

Source: Internet
Author: User

Lucene index creation process:

1. extract text.

For example, extract the content in PDF and word in plain text. lucene supports the string type, and supports date and Reader for convenience. in fact, if the two types of Lucene are used, the type conversion is automatically performed.

2. Text Analysis.

Lucene performs some basic analysis on the given text and removes unnecessary information, such as common words a, an, And, if you do not care about the case sensitivity of letters during search, you can remove unnecessary information. all in all, you can think of this process as a text filter. All the text content will be filtered out through analysis, leaving the most useful information.

3. Write the index.

Like common indexing technologies such as Google, Lucene adopts the inverted index technology (inverted index.) When writing indexes. In short, it simply uses a method (like a hash table ?) Turn the common question "what words are contained in a document" into "which documents contain these words ". the indexing mechanisms of various search engines differ mainly in how to add more accurate descriptions for this inverted table. for example, Google's famous PageRank factor. lucene also has its own technology.ArticleCan be introduced to you.

In the previous article, we used the most basic index creation method. Here we will discuss some issues in detail.

1. Add document to index
The information of each document added last time is the same, both filename and contents of the document.

Doc. Add (field. Keyword ("FILENAME", file. fullname ));
Doc. Add (field. Text ("contents", new streamreader (file. fullname )));

The description of each document in Lucene can be different. For example, both documents describe one person, one of which is name, and the other is ID, sex, this irregular document description is allowed in Lucene.
Another point is that Lucene supports append the field, as shown below:

String baseword = "fast ";
String synonyms [] = string {"quick", "Rapid", "Speedy "};
Document Doc = new document ();
Doc. Add (field. Text ("Word", baseword ));
For (INT I = 0; I <synonyms. length; I ++)
Doc. Add (field. Text ("Word", synonyms [I]);

This is purely for the convenience of user use. Lucene automatically performs conversion internally, and the effect is the same as splicing them and then saving them.

2. Delete the document in the Index

Lucene adopts a strange method. It uses indexreader to mark the items to be deleted and then deletes them when the reader is close.
Several methods are briefly introduced here.

[Testfixture]
Public class documentdeletetest: baseindexingtestcase // setup method in baseindexingtestcase // an index is created with two documents added
{
[Test]
Public void testdeletebeforeindexmerge ()
{

Indexreader reader = indexreader. Open (DIR); // The current index contains two documents

Assert. areequal (2, reader. maxdoc (); // The document starts counting from 0. maxdoc indicates the serial number of the next document.

Assert. areequal (2, reader. numdocs (); // numdocs indicates the number of documents in the current index
Reader. Delete (1); // mark the document marked as 1 as to be deleted and logically Delete
Assert. istrue (reader. isdeleted (1); // checks whether a document with a serial number is marked and deleted.
Assert. istrue (reader. hasdeletions (); // check whether any document in the index is marked to be deleted
Assert. areequal (2, reader. maxdoc (); // the serial number of the next document is still 2
Assert. areequal (1, reader. numdocs (); // The number of Chinese files in the current index changes to 1.
Reader. Close (); // Delete the previously marked document physically.
Reader = indexreader. Open (DIR );
Assert. areequal (2, reader. maxdoc ());
Assert. areequal (1, reader. numdocs ());
Reader. Close ();
}

[Test]
Public void deleteafterindexmerge () // After index rescheduling
{
Indexreader reader = indexreader. Open (DIR );
Assert. areequal (2, reader. maxdoc ());
Assert. areequal (2, reader. numdocs ());
Reader. Delete (1 );
Reader. Close ();
Indexwriter writer = new indexwriter (Dir, getanalyzer (), false );
Writer. Optimize (); // index rescheduling
Writer. Close ();
Reader = indexreader. Open (DIR );
Assert. isfalse (reader. isdeleted (1 ));
Assert. isfalse (reader. hasdeletions ());
Assert. areequal (1, reader. maxdoc (); // After the index is rearranged, the next document sequence number changes to 1.
Assert. areequal (1, reader. numdocs ());
Reader. Close ();
}
}

You can also delete documents without using document serial numbers. You can use the following method to delete documents containing specific content from the index.

Indexreader reader = indexreader. Open (DIR );
Reader. Delete (new term ("city", "Amsterdam "));
Reader. Close ();

You can also use the reader. undeleteall () method to cancel the previous mark, that is, cancel all the deletion work before the read. Close () call.

3. Update the document in the Index

This function is not supported by Lucene. It can only be implemented after being deleted and added.Code, Very understandable.

[testfixture]
public class documentupdatetest: baseindexingtestcase
{< br> [test]
Public void Update ()
{< br> assert. areequal (1, gethitcount ("city", "Amsterdam");
indexreader reader = indexreader. open (DIR);
reader. delete (new term ("city", "Amsterdam");
reader. close ();
assert. areequal (0, gethitcount ("city", "Amsterdam");
indexwriter writer = new indexwriter (Dir, getanalyzer (), false );
document DOC = new document ();
Doc. add (field. keyword ("ID", "1");

Doc. Add (field. unindexed ("country", "Netherlands "));
Doc. Add (field. unstored ("contents", "Amsterdam has lots of bridges "));
Doc. Add (field. Text ("city", "Haag "));
Writer. adddocument (DOC );
Writer. Optimize ();
Writer. Close ();
Assert. areequal (1, gethitcount ("city", "Haag "));
}

Protected override analyzer getanalyzer ()
{
Return new whitespaceanalyzer (); // note that if you use simpleanalyzer to search, it will fail because simpleanalyse is used to create an index and it will change all letters to lowercase letters.

}

Private int gethitcount (string fieldname, string searchstring)
{
Indexsearcher searcher = new indexsearcher (DIR );
Term T = new term (fieldname, searchstring );
Query query = new termquery (t );
Hits hits = searcher. Search (query );
Int hitcount = hits. Length ();
Searcher. Close ();
Return hitcount;
}
}

Note that all of the above operations on indexes should be performed to avoid frequent opening and disabling writer and reader. the addition and deletion are made by different connections (writer and reader. therefore, we should try our best to put the operations for adding documents together for batch execution, and then put the operations for deleting documents together for batch execution. avoid alternate addition and deletion.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.