What is full-text search?
There are two types of data in our life: Structured Data and unstructured data.
* Structured data: Data with a fixed format or a limited length, such as databases and metadata. * Unstructured data: data with an indefinite length or no fixed format, such as emails or Word documents.
Of course, the third type of semi-structured data, such as XML and HTML, can be processed as needed, you can also extract plain text to process unstructured data.
Unstructured data is also called full-text data.
By data classification, there are two types of searches:
* Search for structured data: for example, search for a database using SQL statements. Another example is metadata search, such as searching for file names, types, and modification times by using Windows Search. * Search for unstructured data: for example, you can search for file content using Windows Search, Linux grep commands, and Google and Baidu to search for a large amount of content data.
Serial scanning)
The so-called Sequential Scan, for example, to find a file containing a string, is a document, a document, for each document, see from the beginning and end, if this document contains this string, this document is the file we are looking for, and then we will look at the next file until all the files are scanned. For example, you can use Windows to search for file content, which is quite slow. If you have an 80 GB hard disk and want to find a file containing a string on it, it will not take a few hours. The grep command in Linux is also in this way. You may think this method is primitive, but this method is the most direct and convenient for files with small data volumes. But for a large number of files, this method is very slow.
Full-text index
Basic Ideas for full-text search:
Extract some information from unstructured data and reorganize it to make it structured. Then, search for data with a certain structure to achieve a relatively fast search. This part of the information extracted from unstructured data and then reorganized, which we call an index. The process of creating an index and then searching for the index is called full-text search ).
Dictionary example
For example, the dictionary pinyin table and the first word checklist are equivalent to the dictionary index. The interpretation of each word is unstructured. If the dictionary does not have a syllable table or the first word checklist, you can only scan one word in sequence. However, some information of a word can be extracted for structured processing. For example, the pronunciation is more structured. There are only a few types of initials and vowels that can be listed one by one, therefore, the pronunciation is arranged in a certain order, and each pronunciation points to the number of detailed explanations of the word. When searching, you can find the pronunciation in structured pinyin, and then click the number of pages it points to find our unstructured data-that is, the interpretation of words.
General process of full-text retrieval
Figure from Lucene in action
Full-text search consists of two processes: index creation and search ).
* Index creation: extracts information from all structured and unstructured data in the real world and creates an index. * Search index: the process of obtaining a user's query request, searching for the created index, and then returning the result.
As a result, full-text search has three important issues:
1. What exactly does the index store? (Index) 2. How to create an index? (Indexing)
3. How to search for indexes? (Search)
What Are indexes stored?
Why is sequential scanning slow? It is caused by inconsistent information stored in the information to be searched and unstructured data.
The information stored in unstructured data is the strings contained in each file, that is, known files. It is relatively easy to evaluate strings, that is, the ing from files to strings.
The information we want to search for is which files contain this string, that is, known strings, and the file to be searched, that is, the ing from the string to the file.
Reverse Index
The two are the opposite. Therefore, if the index can always save the ing from the string to the file, the search speed will be greatly improved.
Since the ing from a string to a file is a reverse process from a file to a string ing, the index that stores this information is called a reverse index.
Information saved by Reverse indexes (dictionary-inverted table)
Assume that there are 100 documents in my document set. For convenient representation, we numbered the document from 1 to 100 to get the following structure.
The left side stores a series of strings called dictionaries.
Each string points to a document linked list that contains this string. This document linked list is called a posting list ).
With the index, the saved information is consistent with the information to be searched, which can greatly speed up the search.
Reverse index query example
For example, to find a document that contains both the string "Lucene" and the string "SOLR", we only need the following steps:
- Retrieve the Linked List of documents containing the string "Lucene.
- Retrieve the Linked List of documents containing the string "SOLR.
- Combine the linked list to find the files that contain "Lucene" and "SOLR.
Advantages and disadvantages of reverse Indexing
- Disadvantage: the process of creating an index does not necessarily make full-text search faster than sequential scanning, especially when the data volume is small. Creating an index on a large amount of data is also a very slow process.
- Advantage: sequential scanning requires scanning at a time, while full-text indexes can be indexed at a time and used multiple times. The retrieval speed is fast.
How to create an index?
The full-text search index creation process generally involves the following steps:
1. original document to be indexed 2. Pass the original document to the tokenizer)
Tokenizer performs the following operations (tokenize ):
- Divide documents into separate words;
- Remove punctuation marks;
- Stop Word is the most common word in a language, because it has no special meaning, therefore, it cannot be a keyword for search in most cases. Therefore, when creating an index, this word will be removed to reduce the index size. Stop word in English, such as "the", "A", and "this. The tokenizer component of each language has a stop word set. The result obtained after tokenizer is called the word count (token ).
3. Pass the word count (token) to the linguistic Processor)
The linguistic processor mainly processes the obtained tokens in the same language.
For English, the linguistic processor generally includes the following:
- Lower case)
- Reduce words to the root form, such as "cars" to "car. This operation is called stemming.
- Convert words into the root form, such as "drove" to "Drive. This operation is called lemmatization. The result of the linguistic processor is called a term ).
4. Pass the term to the index component (Indexer)
The index component (Indexer) mainly does the following:
1. Create a dictionary using the obtained term)
2. Sort the dictionaries alphabetically. 3. Merge the same term into the inverted List of documents. In this table, there are several definitions: Document Frequency is the document frequency, indicating the total number of files containing this term ). Frequency is the word frequency, which indicates that this file contains several terms ).
From: https://www.cnblogs.com/wwwggg/p/5588698.html
Basic principles of full-text search