10 min To understand full text search

Source: Internet
Author: User

learn some of the records after the full-text search. 1: Issues to be solved for full-text search

The data we encounter is generally divided into two types: structured and unstructured data .

    • structured data: refers to data that has a fixed or finite length, such as a database, metadata, and so on.
    • unstructured data: refers to data that is indefinite or not in fixed format, such as mail, Word documents, and so on.

for structured data, we can use database and other methods to search (poor efficiency). For the retrieval of unstructured data, Search by Windows system can also search for file contents, such as commands like grep commands under Linux. However, the use of this sequential scanning method is quite time-consuming, so there is a full-text retrieval system.

2: The principle of full-text retrieval

Turn unstructured data into structured data (the process of indexing) and look it up from the index.

3: What is in the index and why is it retrieved so quickly?

If there are 300 million product titles in the database, you need to use the full-text search system based on keywords to find out the relevant information.

For example, the first 5 product titles were:

1: Crusher, excavator price, crusher picture, crusher Industry Introduction

2: Crusher Sales, price, grinding machine price, grinding machine price

3: TV Price

4: Refrigerator Price

5: Fan Price

indexing process: iterate through each product title, participle, create inverted list.

The first Data participle: Crusher excavator price Crusher Picture Crusher Industry Introduction

Number of statistics occurrences:

Crushing 3

Machine 4

Mining 1

Price 1

Picture 1

Industry 1

Introduction 1

The second data participle: Crusher sales price Grinding Machine Price Grinder price

Count the occurrences of each word:

Crushing 1

Machine 3

Sales 1

Price 3

Grinding Stone 1

Sharpening 1

iterate through all the data and generate the inverted chain table as follows:

Crushing: 1 = "2" + 3 = N (number representing document number)

Machine: 1 = "2" = 4

Price: 1 = "2

Digging: 1 = 4 = 5 = "N

Pictures: 1 = N

Activities: 1 = N

Description: 1 = 8 = 10 = "N-1

Sales: 2 = N

Grinding Stone: 2 = N

Sharpening: 2 = N

The inverted table is now set up. In the case of many documents, the inverted list is particularly large, but don't worry, the inverted table is formatted to save the hard drive.

Retrieval process:

Enter the keyword "crusher price", the full-text retrieval system first to the incoming string word "crusher price", and then take out the 3 words corresponding inverted list. At this point the CPU is on the table, and the CPU calculates which documents appear in the 3 inverted tables. After calculation, all the document numbers (1 and 2) containing the "crusher price" are obtained, and the corresponding details are identified from the database and presented to the user according to these numbers.

4: Search Results sort

As in the above example, the search for "crusher price", document 1 and document 2 will appear, then which data row before the need to go through a scoring process.

1: Calculate weights for each matching document (based on the search keyword "crusher price"), use the following formula:

  • Term Frequency (TF): frequency, how many times it appears in this document.
  • Document Frequency (DF): The frequency of documents that indicates how many documents contain this word.

The formula is the most straightforward explanation: the larger the DF, the less important, the larger the TF the more important the description.

Calculate weights for all the documents obtained in the previous step according to this formula, and finally sum (the complex algorithm, here is the simplified one) to get the score of this document. Finally sorted by score.
In the example above: "Broken" appears in 2 documents, document frequency (DF) is 2
"Machine" appears in 3 documents with a document frequency of 3
"Price" appears in 5 documents with a document frequency of 5
At this point, the "broken" Document frequency (DF) is the smallest, the word "broken" is the most important. Check again "Broken" appears 3 times (TF) in document 1 and 1 times in document 2. At this point, document 1 has been given a larger rating. Calculate the score for all documents in turn.
Continue to calculate the "machine", "price" in all documents to score, the final result will be the highest score in document 1. After the full-text search system is scored, document 1 is ranked first and document 2 is ranked 2nd.

5: Implementation of phrase matching (exact match)
Cond
6: Implementation of attribute filtering
Cond

10 min To understand full text search

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.