Lucene. Net 2.3.1 Development Introduction-III. Index (1)

Source: Internet
Author: User

Before talking about indexes, let's talk about what indexes are? Why index? How to index?

 

First, let's look at how we can search for a text. For example, there is a string = "abcdefghijklmnopqrstuvwxyz", which contains 26 letters. Now let's see if there is a in it. You can easily implement it with indexof. Now the data volume is large, and there are more than 100 pieces of data in the database. Of course, the operation methods provided by the database can also be easily searched. Aside from the database, we put more than 100 records into N text files. Now we need to search for records containing the word "Lucene". What should we do? If you simply use file-by-file scanning, you can search for an image name or text in windows. So it takes a lot of time for each search. Why can Google and Baidu be so fast? Since we just put aside the database, is it possible to use the database? From my long-term experience, databases certainly won't work. (this refers to a relational database. If there are some databases dedicated for search, let's talk about it .). What is fast enough for search? There are several types in C #, such as hashtable and dictionary. Can I use this idea for search engines? Obviously yes! There are a lot of things in the microscopic (here refers to a smallAlgorithmOr a small data structure application) and a macro (Framework-level or system-level) have different names, but there are a lot of similarities and comparability. No doubt Lucene. net is such a framework, implementing a more macro phenomenon of hashtable! There are also many differences.

 

Lucene. net uses a Data Structure of inverted indexes. I have read one before.ArticleIn the object-oriented era, the role of data structures has been weakened. I think this idea must be abandoned at least in Lucene. net. To answer the original question, what is an index? Looking at the development history of search engines, early search engines were all based on keywords and directories, and now they have been transformed into full-text search. If you want to ask what index is, you have a book. On the eighth page, you write a story. You can use a piece of paper to record the story name, which is the index. The book directory, the number of pages, and the number of books can be regarded as indexes. Lucene. net uses inverted indexes. What is inverted indexes? That is to analyze a piece of text and index the keywords obtained by analysis based on the analysis results. For example, "I am using Lucene. net .", After you use the standardanalyzer tokenizer for indexing, "I", "in", "use", "Lucene", ".", "Net", and "." are stored. These words, that is, indexes are stored in word units. At the same time, it records the document in which these words appear, as well as their locations and frequencies. Is it similar to redundancy in databases? The data to be calculated has been recorded and can be read directly. Theoretically, when m languages are defined, the number of words that appear is always limited. It can be seen from this that word segmentation is so important, because word segmentation makes n words join into a whole, and it cannot be found with any word in the whole, unless at the cost of speed. In the past, Google was not satisfied because Google is inferior to Baidu in Chinese word segmentation, but the gap is narrowing.

 

Another question is that it is not difficult to answer why indexes are required. As for how to index, this is a long story.

 

1. Logical Storage Structure

Words are the smallest unit in inverted indexes. the unit of balance in net is term, which consists of n terms and filed, and N fileds constitute a document. N documents form a segment, N segments will be written to Lucene.. Net file system. The file system will be retained later, because Lucene. net implements its own file system, and the minimum unit of the system is composed of three files, which can be placed in a directory or in the memory. Lucene. net file system can be understood as a file, in Windows is a directory, contains three files, but from Lucene.. Net logically, this is a file. Then the text in the file is segmented into N chapters, that is, segment. Each paragraph has n paragraphs (documents). Each sentence in the paragraph is filed, and term is each word. It is similar to our processing habits, isn't it? The most important one is the term, which is used for its layout.

 

Compared with word segmentation, this index can also be measured by another class, that is, Token. Are you familiar with this index? The text of term and token is the same, but the attributes of the text recorded are different.

 

The previous two write operations on the index,CodeSimilar. Create a tokenizer first, and then hand the tokenizer to indexwriter. Create n documents, fill the field in docuemnt, and then hand over the document to indexwriter to complete the index process. The segment processing is dropped by the black box, and the term processing can only be seen from the word divider.

 

(PS: No, go to bed. Zzzzzzz ~~)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.