Lucene. Net 2.3.1 Development Introduction-III. Index (1)

Last Update:2018-12-07 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Before talking about indexes, let's talk about what indexes are? Why index? How to index?

First, let's look at how we can search for a text. For example, there is a string = "abcdefghijklmnopqrstuvwxyz", which contains 26 letters. Now let's see if there is a in it. You can easily implement it with indexof. Now the data volume is large, and there are more than 100 pieces of data in the database. Of course, the operation methods provided by the database can also be easily searched. Aside from the database, we put more than 100 records into N text files. Now we need to search for records containing the word "Lucene". What should we do? If you simply use file-by-file scanning, you can search for an image name or text in windows. So it takes a lot of time for each search. Why can Google and Baidu be so fast? Since we just put aside the database, is it possible to use the database? From my long-term experience, databases certainly won't work. (this refers to a relational database. If there are some databases dedicated for search, let's talk about it .). What is fast enough for search? There are several types in C #, such as hashtable and dictionary. Can I use this idea for search engines? Obviously yes! There are a lot of things in the microscopic (here refers to a smallAlgorithmOr a small data structure application) and a macro (Framework-level or system-level) have different names, but there are a lot of similarities and comparability. No doubt Lucene. net is such a framework, implementing a more macro phenomenon of hashtable! There are also many differences.

Lucene. net uses a Data Structure of inverted indexes. I have read one before.ArticleIn the object-oriented era, the role of data structures has been weakened. I think this idea must be abandoned at least in Lucene. net. To answer the original question, what is an index? Looking at the development history of search engines, early search engines were all based on keywords and directories, and now they have been transformed into full-text search. If you want to ask what index is, you have a book. On the eighth page, you write a story. You can use a piece of paper to record the story name, which is the index. The book directory, the number of pages, and the number of books can be regarded as indexes. Lucene. net uses inverted indexes. What is inverted indexes? That is to analyze a piece of text and index the keywords obtained by analysis based on the analysis results. For example, "I am using Lucene. net .", After you use the standardanalyzer tokenizer for indexing, "I", "in", "use", "Lucene", ".", "Net", and "." are stored. These words, that is, indexes are stored in word units. At the same time, it records the document in which these words appear, as well as their locations and frequencies. Is it similar to redundancy in databases? The data to be calculated has been recorded and can be read directly. Theoretically, when m languages are defined, the number of words that appear is always limited. It can be seen from this that word segmentation is so important, because word segmentation makes n words join into a whole, and it cannot be found with any word in the whole, unless at the cost of speed. In the past, Google was not satisfied because Google is inferior to Baidu in Chinese word segmentation, but the gap is narrowing.

Another question is that it is not difficult to answer why indexes are required. As for how to index, this is a long story.

1. Logical Storage Structure

Words are the smallest unit in inverted indexes. the unit of balance in net is term, which consists of n terms and filed, and N fileds constitute a document. N documents form a segment, N segments will be written to Lucene.. Net file system. The file system will be retained later, because Lucene. net implements its own file system, and the minimum unit of the system is composed of three files, which can be placed in a directory or in the memory. Lucene. net file system can be understood as a file, in Windows is a directory, contains three files, but from Lucene.. Net logically, this is a file. Then the text in the file is segmented into N chapters, that is, segment. Each paragraph has n paragraphs (documents). Each sentence in the paragraph is filed, and term is each word. It is similar to our processing habits, isn't it? The most important one is the term, which is used for its layout.

Compared with word segmentation, this index can also be measured by another class, that is, Token. Are you familiar with this index? The text of term and token is the same, but the attributes of the text recorded are different.

The previous two write operations on the index,CodeSimilar. Create a tokenizer first, and then hand the tokenizer to indexwriter. Create n documents, fill the field in docuemnt, and then hand over the document to indexwriter to complete the index process. The segment processing is dropped by the black box, and the term processing can only be seen from the word divider.

(PS: No, go to bed. Zzzzzzz ~~)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Lucene. Net 2.3.1 Development Introduction-III. Index (1)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Lucene. Net 2.3.1 Development Introduction-III. Index (1)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support