learn some of the records after the full-text search. 1: Issues to be solved for full-text search
The data we encounter is generally divided into two types: structured and unstructured data .
- structured data: refers to data that has a fixed or finite length, such as a database, metadata, and so on.
- unstructured data: refers to data that is indefinite or not in fixed format, such as mail, Word documents, and so on.
for structured data, we can use database and other methods to search (poor efficiency). For the retrieval of unstructured data, Search by Windows system can also search for file contents, such as commands like grep commands under Linux. However, the use of this sequential scanning method is quite time-consuming, so there is a full-text retrieval system.
2: The principle of full-text retrieval
Turn unstructured data into structured data (the process of indexing) and look it up from the index.
3: What is in the index and why is it retrieved so quickly?
If there are 300 million product titles in the database, you need to use the full-text search system based on keywords to find out the relevant information.
For example, the first 5 product titles were:
1: Crusher, excavator price, crusher picture, crusher Industry Introduction
2: Crusher Sales, price, grinding machine price, grinding machine price
3: TV Price
4: Refrigerator Price
5: Fan Price
indexing process: iterate through each product title, participle, create inverted list.
The first Data participle: Crusher excavator price Crusher Picture Crusher Industry Introduction
Number of statistics occurrences:
Crushing 3
Machine 4
Mining 1
Price 1
Picture 1
Industry 1
Introduction 1
The second data participle: Crusher sales price Grinding Machine Price Grinder price
Count the occurrences of each word:
Crushing 1
Machine 3
Sales 1
Price 3
Grinding Stone 1
Sharpening 1
iterate through all the data and generate the inverted chain table as follows:
Crushing: 1 = "2" + 3 = N (number representing document number)
Machine: 1 = "2" = 4
Price: 1 = "2
Digging: 1 = 4 = 5 = "N
Pictures: 1 = N
Activities: 1 = N
Description: 1 = 8 = 10 = "N-1
Sales: 2 = N
Grinding Stone: 2 = N
Sharpening: 2 = N
The inverted table is now set up. In the case of many documents, the inverted list is particularly large, but don't worry, the inverted table is formatted to save the hard drive.
Retrieval process:
Enter the keyword "crusher price", the full-text retrieval system first to the incoming string word "crusher price", and then take out the 3 words corresponding inverted list. At this point the CPU is on the table, and the CPU calculates which documents appear in the 3 inverted tables. After calculation, all the document numbers (1 and 2) containing the "crusher price" are obtained, and the corresponding details are identified from the database and presented to the user according to these numbers.
4: Search Results sort
As in the above example, the search for "crusher price", document 1 and document 2 will appear, then which data row before the need to go through a scoring process.
1: Calculate weights for each matching document (based on the search keyword "crusher price"), use the following formula:
- Term Frequency (TF): frequency, how many times it appears in this document.
- Document Frequency (DF): The frequency of documents that indicates how many documents contain this word.
The formula is the most straightforward explanation: the larger the DF, the less important, the larger the TF the more important the description.
Calculate weights for all the documents obtained in the previous step according to this formula, and finally sum (the complex algorithm, here is the simplified one) to get the score of this document. Finally sorted by score.
In the example above: "Broken" appears in 2 documents, document frequency (DF) is 2
"Machine" appears in 3 documents with a document frequency of 3
"Price" appears in 5 documents with a document frequency of 5
At this point, the "broken" Document frequency (DF) is the smallest, the word "broken" is the most important. Check again "Broken" appears 3 times (TF) in document 1 and 1 times in document 2. At this point, document 1 has been given a larger rating. Calculate the score for all documents in turn.
Continue to calculate the "machine", "price" in all documents to score, the final result will be the highest score in document 1. After the full-text search system is scored, document 1 is ranked first and document 2 is ranked 2nd.
5: Implementation of phrase matching (exact match)
Cond
6: Implementation of attribute filtering
Cond
10 min To understand full text search