Chapter 4 index Full Text Indexing

Source: Internet
Author: User
Chapter 4 index full text indexing should have a certain knowledge reserve before learning the search engine technology. Modern Information Retrieval is the textbook of this classic IR. By default, readers in this article have the corresponding foundation.
Data needs to be processed in different types. Generally, webpage content text can be divided into four parts:
Keyword: no analysis is performed. indexes are created and stored by a word, such as URL, file system path, date, person name, social insurance account, and phone number. The local search example uses the system path name as a keyword.
Unindexed: no analysis or index, but the value is stored in the index. This is generally the displayed content that provides search results. We usually do not need to search for the content (URL or database's main value), but store the original value in the index. Therefore, it is not suitable for storing a large amount of information.
Unstored: opposite to unindexed, this type of field is analyzed and indexed, but not stored in the index. Generally, you do not need to search for information in the original form, such as the body part of a web page or text that is not very important.
Text: analysis, index, and storage index. It means that this part will be indexed.
Of course, it is not limited to these types. More data types can be designed in practical applications. The design principle can be summarized as follows: the content that needs to be searched frequently or the actual content of the webpage should be indexed accordingly (similar to the dictionary function ), provides effective information organization and reduces redundancy.
The underlying implementation of indexes and queries is the core of search engines:
1) index fields as much as possible to improve the query speed. However, if too many indexes are used, the update operation on the index table is slow, and there are too many sorting conditions for the results, in fact, it is often one of the performance killers.
2) The merge_factor Of The indexer provides the function of merging several indexes. Reasonable parameter settings directly affect the performance of the indexer.
3) 20%/80% principle: the quality of query results is not equal to the quality. Especially for a large number of returned results, how to optimize the quality of the first few results is always the most important.
4) Reduce the result set as much as possible. Compared with a single application, even for a large distributed file system, random access to the result set is a resource-consuming operation.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.