Chapter 4 index Full Text Indexing

Last Update:2018-12-06 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Chapter 4 index full text indexing should have a certain knowledge reserve before learning the search engine technology. Modern Information Retrieval is the textbook of this classic IR. By default, readers in this article have the corresponding foundation.
Data needs to be processed in different types. Generally, webpage content text can be divided into four parts:
Keyword: no analysis is performed. indexes are created and stored by a word, such as URL, file system path, date, person name, social insurance account, and phone number. The local search example uses the system path name as a keyword.
Unindexed: no analysis or index, but the value is stored in the index. This is generally the displayed content that provides search results. We usually do not need to search for the content (URL or database's main value), but store the original value in the index. Therefore, it is not suitable for storing a large amount of information.
Unstored: opposite to unindexed, this type of field is analyzed and indexed, but not stored in the index. Generally, you do not need to search for information in the original form, such as the body part of a web page or text that is not very important.
Text: analysis, index, and storage index. It means that this part will be indexed.
Of course, it is not limited to these types. More data types can be designed in practical applications. The design principle can be summarized as follows: the content that needs to be searched frequently or the actual content of the webpage should be indexed accordingly (similar to the dictionary function ), provides effective information organization and reduces redundancy.
The underlying implementation of indexes and queries is the core of search engines:
1) index fields as much as possible to improve the query speed. However, if too many indexes are used, the update operation on the index table is slow, and there are too many sorting conditions for the results, in fact, it is often one of the performance killers.
2) The merge_factor Of The indexer provides the function of merging several indexes. Reasonable parameter settings directly affect the performance of the indexer.
3) 20%/80% principle: the quality of query results is not equal to the quality. Especially for a large number of returned results, how to optimize the quality of the first few results is always the most important.
4) Reduce the result set as much as possible. Compared with a single application, even for a large distributed file system, random access to the result set is a resource-consuming operation.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Chapter 4 index Full Text Indexing

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support