10 min To understand full text search

Last Update:2015-09-07 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

learn some of the records after the full-text search. 1: Issues to be solved for full-text search

The data we encounter is generally divided into two types: structured and unstructured data .

structured data: refers to data that has a fixed or finite length, such as a database, metadata, and so on.
unstructured data: refers to data that is indefinite or not in fixed format, such as mail, Word documents, and so on.

for structured data, we can use database and other methods to search (poor efficiency). For the retrieval of unstructured data, Search by Windows system can also search for file contents, such as commands like grep commands under Linux. However, the use of this sequential scanning method is quite time-consuming, so there is a full-text retrieval system.

2: The principle of full-text retrieval

Turn unstructured data into structured data (the process of indexing) and look it up from the index.

3: What is in the index and why is it retrieved so quickly?

If there are 300 million product titles in the database, you need to use the full-text search system based on keywords to find out the relevant information.

For example, the first 5 product titles were:

1: Crusher, excavator price, crusher picture, crusher Industry Introduction

2: Crusher Sales, price, grinding machine price, grinding machine price

3: TV Price

4: Refrigerator Price

5: Fan Price

indexing process: iterate through each product title, participle, create inverted list.

The first Data participle: Crusher excavator price Crusher Picture Crusher Industry Introduction

Number of statistics occurrences:

Crushing 3

Machine 4

Mining 1

Price 1

Picture 1

Industry 1

Introduction 1

The second data participle: Crusher sales price Grinding Machine Price Grinder price

Count the occurrences of each word:

Crushing 1

Machine 3

Sales 1

Price 3

Grinding Stone 1

Sharpening 1

iterate through all the data and generate the inverted chain table as follows:

Crushing: 1 = "2" + 3 = N (number representing document number)

Machine: 1 = "2" = 4

Price: 1 = "2

Digging: 1 = 4 = 5 = "N

Pictures: 1 = N

Activities: 1 = N

Description: 1 = 8 = 10 = "N-1

Sales: 2 = N

Grinding Stone: 2 = N

Sharpening: 2 = N

The inverted table is now set up. In the case of many documents, the inverted list is particularly large, but don't worry, the inverted table is formatted to save the hard drive.

Retrieval process:

Enter the keyword "crusher price", the full-text retrieval system first to the incoming string word "crusher price", and then take out the 3 words corresponding inverted list. At this point the CPU is on the table, and the CPU calculates which documents appear in the 3 inverted tables. After calculation, all the document numbers (1 and 2) containing the "crusher price" are obtained, and the corresponding details are identified from the database and presented to the user according to these numbers.

4: Search Results sort

As in the above example, the search for "crusher price", document 1 and document 2 will appear, then which data row before the need to go through a scoring process.

1: Calculate weights for each matching document (based on the search keyword "crusher price"), use the following formula:

Term Frequency (TF): frequency, how many times it appears in this document.

Document Frequency (DF): The frequency of documents that indicates how many documents contain this word.

The formula is the most straightforward explanation: the larger the DF, the less important, the larger the TF the more important the description.

Calculate weights for all the documents obtained in the previous step according to this formula, and finally sum (the complex algorithm, here is the simplified one) to get the score of this document. Finally sorted by score.
In the example above: "Broken" appears in 2 documents, document frequency (DF) is 2
"Machine" appears in 3 documents with a document frequency of 3
"Price" appears in 5 documents with a document frequency of 5
At this point, the "broken" Document frequency (DF) is the smallest, the word "broken" is the most important. Check again "Broken" appears 3 times (TF) in document 1 and 1 times in document 2. At this point, document 1 has been given a larger rating. Calculate the score for all documents in turn.
Continue to calculate the "machine", "price" in all documents to score, the final result will be the highest score in document 1. After the full-text search system is scored, document 1 is ranked first and document 2 is ranked 2nd.

5: Implementation of phrase matching (exact match)
Cond
6: Implementation of attribute filtering
Cond

10 min To understand full text search

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

10 min To understand full text search

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

10 min To understand full text search

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support