Introduction and algorithm analysis of full-text information retrieval

Last Update:2017-02-27 Source: Internet

Author: User

Tags comparison

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

I. Summary

This paper mainly introduces the concept, application domain, algorithm classification, technical difficulties and algorithm comparison of full-text information retrieval. and a data structure and algorithm for Full-text search.

Second, what is Full-text database and full-text information retrieval

The record data saved in the database can be divided into two categories from the type. One is structured data, like characters, dates, numbers, currencies, etc., which are data of limited length or fixed format, and unstructured data, also called Full-text data, such as resumes, profiles, papers, etc., are character data stored in indefinite, unfixed formats.

The existing database systems are all based on structured data as the main goal of retrieval, because the implementation is relatively simple. For example, numerical search, you can create a sorted index table, in order to achieve a binary search, fast. But for the unstructured data, that is, the full-text data, to achieve the search, the relative difficulty is much more.

Of course, you might say: "How simple is this, read the full text of the data into memory, and then compare to find it?" ”。 Yes, it's really a very simple idea. But the most serious problem is that if there are 10,000, 100,000, 1 million records in the database, can you imagine the time spent searching?! If a Full-text database system has more than half a minute response time to a retrieval command, no user can tolerate it.

Therefore, the main purpose of Full-text search is to realize the fast searching of the unstructured data with large capacity.

Iii. Areas of Application

Now, with the increasing popularity of computer use, data accumulation more and more, the request of Full-text search is more and more urgent. At present, the main application areas are: library database, intelligence database, patent database, medical database, office automation, historical database, electronic publishing system, etc.

Comparison of algorithms and algorithms

At present, the algorithm for Full-text information retrieval has two basic schemes, Word Index and Word index.

Word index, a retrieval algorithm with words as index units. This technology is the originator of Full-text search technology (in the 60 's, there are already products available). The reason is very simple, the computer is suitable for the English language environment, and English is the word for the language element. More generally speaking, there is a space between each English word. Therefore, when indexing the Full-text database, it is simple and natural to establish the index according to the word division. When we first introduced Full-text search technology in our country, it was the database system of Chinese English, so it was natural to use Word indexing technology. However, due to the different characteristics of morpheme in Chinese and English environment, it is necessary to solve the problem of participle. For example, the words "I am Chinese", then must be cut out "I am Chinese" such a word form. If it is the human brain to make word segmentation judgments, it is too simple, as long as the second grade of primary school Chinese proficiency, it is enough. However, if you want the computer to be able to make participle, it is very difficult. The approximate algorithm of the computer segmentation is: Cut out the paragraph by the article, cut out the sentence from the paragraph, cut out the phrase from the sentence, and then find the thesaurus, and then divide the words according to the verbs, conjunctions and adjectives. In some cases, the computer is simply not able to do word segmentation correctly. The following is the computer automatic participle joke:

(1) We must be active in the landlord action good family planning work

Computer Stupid participle Result: we should be active landlord action good family planning work

Comment: I Hu Hansan back again.

Consequence: When retrieving "landlord", the result of false check is produced

(2) The people of Changchun in Jilin Province

Computer stupid participle Result: the people of Changchun in Jilin Province

Comment: I know, Jilin has a governor called "Spring City"

Consequence: When retrieving the "Jilin Province", the result of the leak is found

Therefore, the technical difficulty of word indexing is the segmentation algorithm. The Chinese database system, such as Oracle and Notes, also provides some full-text search capabilities, but there are problems like this or that. The promotion space of Word segmentation algorithm is still some, need to join artificial intelligence analysis, context judgment and so on technology. But there is also a fatal weakness, that is, the judgment of names, names.

Word index, a retrieval algorithm with Chinese Word as index unit. This is also the algorithm I recommend, more suitable for the Chinese environment than the word index. This is why English-Chinese version of the Full-text search software has not occupied the Chinese market, the main reason. (at present, local nationalization of software, such as handwritten tablet, Chinese character scanning recognition, Chinese full-text information retrieval ...) or more than the same products abroad a lot of lead. But the word index is not without drawbacks, the main problem is:

(1), search "Chinese", will mistakenly detect "People's Republic of China"

(2), the search for Chinese Medicine "rhubarb", will be mistaken for "rhubarb sealed", "rhubarb hemp" and other completely different concepts of drugs. And these words will not be wrong in English, because it is a different spelling.

The multiple-check error of the word index can also be corrected. For example, the search for "rhubarb" at the same time, but also to retrieve "rhubarb sealed", and then exclude "rhubarb sealed" the search hit point, but this needs to pay the cost of retrieval time. The following table is a comparison of the performance of Word and word indexing schemes:

Index mode	Index ratio	Index speed	Retrieval speed	False check	Leak check
Word Index	0.8 ~ 2.0	Slow	Fast	Yes	Yes
Word Index	0.3 ~ 2.0	A little faster.	Slightly slower	Yes	No

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More