Repost an article that summarizes quite good full-text retrieval principles
There are two types of data in our life:Structured DataAndUnstructured data.
Structured Data: Refers to data with a fixed format or a limited length, such as databases and metadata.
Unstructured data: Data with an indefinite length or no fixed format, such as emails or Word documents.
Of course, the third type of semi-structured data, such as XML and HTML, can be processed as needed, you can also extract plain text to process unstructured data.
Unstructured data is also called full-text data.
By data classification, there are two types of searches:
Search for structured data: Use SQL statements for database search. Another example is metadata search, such as searching for file names, types, and modification times by using Windows Search.
Search for unstructured data: If you use Windows to search for file content, you can also search for the grep command in Linux. If you use Google or Baidu, you can search for a large amount of content data.
There are two main methods to search unstructured data, that is, full-text data:
One isSequential scanning(Serial scanning): the so-called sequential scan. For example, if you want to find a file containing a string, you can view a document and a document. For each document, you can see it from the beginning and end, if this document contains this string, this document is the file we are looking for, and then read the next file until all the files are scanned. For example, you can use Windows to search for file content, which is quite slow. If you have an 80 GB hard disk and want to find a file containing a string on it, it will not take a few hours. The grep command in Linux is also in this way. You may think this method is primitive, but this method is the most direct and convenient for files with small data volumes. But for a large number of files, this method is very slow.
Some may say that the sequential scanning of unstructured data is slow, but the search for structured data is relatively fast (because structured data has a certain structure, a certain search algorithm can be used to speed up the search ), can we find a way to make the unstructured data have a certain structure?
This idea naturally forms the basic idea of full-text retrieval, that is, extracting part of the information in unstructured data and reorganizing it to make it structured, then search for the data with a certain structure to achieve the goal of relatively fast search.
This part of the information extracted from unstructured data and then reorganized, we call itIndex.
This statement is abstract. It is easy to understand for a few examples. For example, the dictionary, the dictionary pinyin table, and the radical checkword table are equivalent to the dictionary index. The interpretation of each word is unstructured, if the dictionary does not have a syllable table or a radical word table, you can only scan one word in sequence. However, some information of a word can be extracted for structured processing. For example, the pronunciation is more structured. There are only a few types of initials and vowels that can be listed one by one, therefore, the pronunciation is arranged in a certain order, and each pronunciation points to the number of detailed explanations of the word. We searched for the pronunciation in a structured pinyin, and then found the unstructured data, that is, the interpretation of words, based on the number of pages it points.
The process of creating an index and then searching an index is calledFull-text search).
The following figure describes the general process of full-text retrieval:
Full-text retrieval is divided into two processes,Index creation(Indexing) andSearch Index(Search ).
Index creation: Extract information from all structured and unstructured data in the real world and create an index.
Search Index: It is the process of obtaining a user's query request, searching for the created index, and then returning the result.
As a result, full-text search has three important issues:
1. What exactly does the index store? (Index)
2. How to create an index? (Indexing)
3. How to search for indexes? (Search)
Next we will study each problem in sequence.
1. What is stored in the index?
What exactly do indexes need to be stored?
First, let's see why sequential scanning is slow:
The reason is that the information we want to search for is inconsistent with the information stored in unstructured data.
The information stored in unstructured data is the strings contained in each file, that is, known files. It is relatively easy to evaluate strings, that is, the ing from files to strings. The information we want to search for is which files contain this string, that is, known strings, and the file to be searched, that is, the ing from the string to the file. The two are the opposite. Therefore, if the index can always save the ing from the string to the file, the search speed will be greatly improved.
Since the ing from string to file is a reverse process from file to string ing, the index that stores this information is calledReverse Index.
The information stored in the reverse index is generally as follows:
Assume that there are 100 documents in my document set. For convenient representation, we numbered the document from 1 to 100 to get the following structure:
The left side stores a series of strings calledDictionary.
Each string points to the document linked list containing this string. This document linked list is calledInverted table(Posting list ).
With the index, the saved information is consistent with the information to be searched, which can greatly speed up the search.
For example, to find a document that contains both the string "Lucene" and the string "SOLR", we only need the following steps:
1. Retrieve the Linked List of documents containing the string "Lucene.
2. Retrieve the Linked List of documents containing the string "SOLR.
3. Combine the linked list to find the files that contain "Lucene" and "SOLR.
Some people may say that full-text retrieval does accelerate the search speed, but the process of indexing is not necessarily faster than sequential scanning. Indeed, the full-text search process is not necessarily faster than sequential scanning, especially when the data volume is small. Creating an index on a large amount of data is also a very slow process.
However, there is a difference between the two. sequential scanning requires scanning every time, and the index creation process only needs to be performed once. In the future, it will be done once and for all. The index creation process does not have to go through every search, you only need to search for the created index.
This is also one of the advantages of full-text search over sequential scanning:One index, multiple use.
2. How to create an index
The full-text search index creation process generally involves the following steps:
Step 1: some original documents to be indexed (document)
To facilitate the index creation process, two files are used here as an example:
File 1: Students shocould be allowed to go out with their friends, but not allowed to drink beer.
File 2: My friend Jerry went to school to see his students but found them drunk which is not allowed.
Step 2: Pass the original document to the tokenizer)
Tokenizer performs the following operations (tokenize ):
1. Divide documents into separate words.
2. Remove punctuation marks.
3. Stop Word ).
The so-calledStopword(Stop Word) is the most common word in a language. Because it has no special meaning, it cannot be a search keyword in most cases. Therefore, when you create an index, this word will be removed to reduce the index size.
Stop word in English, such as "the", "A", and "this.
Each language's Word Segmentation component (tokenizer) hasStop Word set.
The result obtained after tokenizer is calledWord dollar(Token ).
In our example, we get the following tokens ):
"Students", "Allowed", "go", "their", "friends", "Allowed", "drink", "beer", "my", "friend ", "Jerry", "went", "school", "see", "his", "Students", "found", "them", "drunk", and "allowed ".
Step 3: Pass the obtained tokens to the linguistic Processor)
The linguistic processor mainly processes the obtained tokens in the same language.
For English, the linguistic processor generally includes the following:
1. lower case ).
2. Reduce words to the root form, such as "cars" to "car. This operation is called stemming.
3. convert words into the root form, such as "drove" to "Drive. This operation is called lemmatization.
Differences between stemming and lemmatization:
Similarities: Both stemming and lemmatization should make words into the root form.
The two methods are different.:
Stemming uses the "reduction" method: "cars" to "car", "Driving" to "Drive ".
Lemmatization adopts the "transformation" method: "drove" to "drove", "Driving" to "Drive ".
Different Algorithms:
Stemming mainly uses a fixed algorithm for this reduction, such as removing "S", "ing" and "E", and changing "rational" to "ate ", change "tional" to "tion ".
Lemmatization is mainly used to save a dictionary for this transformation. For example, there is a ing between "Driving" to "Drive", "drove" to "Drive", "am, is, are" to "be" in the dictionary, you only need to look up the dictionary.
Stemming and lemmatization are not mutually exclusive and have an intersection. Some Words can achieve the same conversion using both methods.
The result of the linguistic processor is calledTerm).
In our example, the term obtained after language processing is as follows:
"Student", "allow", "go", "their", "friend", "allow", "drink", "beer", "my", "friend ", "Jerry", "go", "school", "see", "his", "student", "find", "them", "drink", and "allow ".
It is precisely because of the steps of language processing that drove can be searched, and drive can also be searched.
Step 4: Pass the obtained term to the index component (Indexer)
The index component (Indexer) mainly does the following:
1. Use the obtained term to create a dictionary.
In our example, the dictionary is as follows:
2. Sort the dictionaries alphabetically.
3. Merge the same term into the inverted posting List of documents.
In this table, there are several definitions:
Document Frequency isDocument FrequencyIndicates the total number of files containing the term ).
Frequency isWord FrequencyThis file contains several terms ).
Therefore, for term "allow", there are a total of two documents that contain the term, so there are two linked lists of documents after the term, the first item indicates the first document containing "allow", that is, document 1. In this document, "allow" appears twice, the second item indicates the second document containing "allow", which is document 2. In this document, "allow" appears once.
So far, the index has been created, and we can quickly find the document we want.
In addition, we were pleasantly surprised to find that searches for "Drive", "Driving", "drove", and "driven" can also be found. Because in our indexes, "Driving", "drove", and "driven" are all converted into "Drive" by language processing. If you enter "Driving" during the search ", the entered query statement goes through one or three steps here to query the desired document.
3. How to search for Indexes
It seems that we can announce that "we have found the desired document ".
However, the process is not over, and it is only a aspect of full-text search. Isn't it? If only one or ten documents contain the strings we query, we can find them. But what if there are one thousand or even thousands of results? What is the most desired file?
Open google. For example, if you want to find a job at Microsoft, you enter "Microsoft job", but you find a total of 22600000 results are returned. A big number suddenly finds that it is a problem, and too many are also a problem. How can we put the most relevant results at the beginning?
Of course Google does a good job. You can find jobs at Microsoft in a moment. Imagine if the first few are all "Microsoft does a good job at software industry ..." What a terrible thing.
How can we find the most relevant query statement among thousands of search results like Google?
How can we determine the relevance between the searched documents and query statements?
This will return to our third question: how to search for indexes?
You can search by following these steps:
Step 1: Enter the query statement.
Query statements are similar to common languages and have certain syntax. Different query statements have different syntaxes, such as SQL statements. The query statement syntax varies with the full-text retrieval system. The most basic types are: And, or, not.
For example, the user input statement Lucene and learned not hadoop. It indicates that you want to find a document that contains Lucene and learned, but does not include hadoop.
Step 2: Perform lexical analysis, syntax analysis, and Language Processing on query statements
Because query statements have syntax, lexical analysis, syntax analysis, and language processing are also required.
1. lexical analysis is mainly used to identify words and keywords.
In the preceding example, after lexical analysis, the words Lucene, learned, and hadoop are obtained. The keywords include and, not. If an invalid keyword is found in the lexical analysis, an error occurs. For example, Lucene amd learned, where amd participates in the query as a common word due to and misspelling.
2. syntax analysis mainly forms a syntax tree based on the syntax rules of the query statement.
If the query statement does not meet the syntax rules, an error is returned. For example, Lucene not and learned may cause errors. As in the preceding example, Lucene and learned not hadoop form the following syntax tree:
3. Language Processing is almost the same as the language processing in the indexing process.
For example, you can change learned to learn. After step 2, we get a language-processed syntax tree:
Step 3: search for an index to obtain a document that complies with the syntax tree.
This step takes several small steps:
1. First, in the reverse index table, find the Linked List of documents including Lucene, learn, and hadoop.
2. Then, merge the linked lists that contain Lucene and learn to obtain the Linked List of documents that contain both Lucene and learn.
3. Then, the linked list is different from the hadoop document linked list to remove the documents containing hadoop, so as to obtain the document linked list that contains both Lucene and learn and does not contain hadoop.
4. The linked list of this document is the document we are looking.
Step 4: sort the results based on the relevance of the obtained documents and query statements.
Although we obtained the desired document in the previous step, the query results should be sorted according to the relevance with the query statement.
How to calculate the correlation between documents and query statements?
It is better to regard the query statement as a short document and rate the correlation between the document and the document (relevance). If the correlation between the document and the document is high, the query statement should be placed first.
So how can we rate the relationship between documents?
This is not an easy task. First, let's take a look at the relationship between people.
First of all, a person often has many elements, such as character, belief, hobbies, clothing, height, and obesity. Secondly, for the relationship between people, different elements are of different importance. personality, beliefs, and hobbies may be more important. Clothing, height, and obesity may not be so important, therefore, people with the same or similar personalities, beliefs, and hobbies can easily become good friends. However, people with clothes, height, and weight can also become good friends. Therefore, to determine the relationship between people, first find out which elements are most important to the relationship between people, such as personality, belief, and hobbies. Second, we should judge the relationship between these elements of two people. For example, one person is cheerful, the other person is extroverted, one person believes in Buddhism, the other believes in God, one person is fond of playing basketball, and the other is fond of playing football. We found that both people are very positive in character, have good faith, and love sports in hobbies. Therefore, the relationship between the two people should be good.
Let's take a look at the relationship between companies.
First of all, a company consists of many people, such as the general manager, manager, chief technology officer, common employee, security guard, and guard. Secondly, for the relationship between the company and the company, different people are of different importance. The general manager, manager, and chief technology officer may be more important. General Employees, security guards, and guards may be less important. Therefore, if the relationship between the two company's general manager, manager, and chief technology officer is good, the two companies may have a better relationship. However, even if an ordinary employee has a deep hatred with a common employee in another company, it is difficult to influence the relationship between the two companies. Therefore, to determine the relationship between the company and the company, first find out who is most important to the relationship between the company, such as the general manager, manager, and chief technology officer. Second, we should judge the relationship between these people. The general manager of the two companies was once a classmate, the manager was a fellow, and the chief technology officer was a startup partner. We found that the relationship between the two companies, regardless of the General Manager, manager, chief technology officer, is very good, so the relationship between the two companies should be very good.
After analyzing the two relationships, let's take a look at how to determine the relationships between documents.
First, a document consists of many terms, such as search, Lucene, full-text, this, A, and what.
Secondly, for the relationship between documents, the importance of different terms is different, for example, for this document, search, Lucene, full-text is relatively important, this,, what may be less important. Therefore, if both documents contain search, Lucene, and Fulltext, the two documents are more relevant. However, even if a document contains this, a, what, the other document does not contain this, A, and what, and does not affect the relevance of the two documents.
Therefore, to determine the relationship between documents, first find out which words (Terms) are most important to the relationship between documents, such as search, Lucene, and Fulltext. Then, judge the relationship between these terms.
The process of identifying the importance of a term to a document is called the process of calculating the term weight.
Term Weight has two parameters: Term and document ).
Term Weight indicates the importance of the term in this document. The more important the term is, the more weight the term has ), therefore, it will play a greater role in calculating the relevance between documents.
When determining the relationship between words (TERM) to get the relevance of documents, a vector space model is used ).
The following two processes are analyzed carefully:
1. The process of calculating the weight (term weight.
There are two main factors that affect the importance of a term in a document:
Term Frequency (TF): The number of times this term appears in this document. The greater TF, the more important it is.
Document Frequency (DF): the number of times that a document contains a term. The greater DF, the less important it is.
Easy to understand? The more words (TERM) appear in the document, the more important the term is to the document. For example, the word "Search" appears frequently in this document, this document focuses on this aspect. However, in an English document, if this appears more times, is it more important? No, this is adjusted by the second factor. The second factor indicates that the more documents contain the term, the less common the term is, therefore, the less important it is.
This is also like the technology that our programmers have learned. For programmers themselves, the deeper the technology is, the better (the deeper the understanding, the more time it takes to look at, the more TF), the more competitive the job is. However, for all programmers, the fewer people this technology knows, the better (fewer people know about DF), and the more competitive the job is. The value of human beings lies in the irretrievability.
The truth is clear. Let's look at the formula:
This is only a typical Implementation of the term weight formula. People who implement the full-text retrieval system will implement it on their own. Lucene is slightly different from this.
2. determine the relationship between terms to get the relevance of documents, that is, the vector space model algorithm (VSM ).
We regard the document as a series of words (TERM), each word (TERM) has a weight (term weight), different words (TERM) the scoring Calculation of document relevance is affected based on the weight of the document.
So we regard all the term weights in this document as a vector. Document = {term1, term2 ,...... , Term n} document vector = {weight1, weight2 ,...... , Weight n}
Similarly, we regard the query statement as a simple document, which is also expressed by vectors. Query = {term1, term 2 ,...... , Term n} query vector = {weight1, weight2 ,...... , Weight n}
We place all the searched document vectors and query vectors in an n-dimensional space. Each term is one-dimensional.
The smaller the angle between two vectors, the greater the correlation.
Therefore, we calculate the cosine of the angle as the correlation score. The smaller the angle, the larger the cosine value, the higher the score, and the greater the correlation.
Someone may ask, the query statement is generally very short and contains very few words (TERM). Therefore, the dimension of the query vector is very small, and the document is very long and contains many words (TERM, the document vector has a large dimension. In your graph, how are the two dimensions n?
Here, since we want to put it in the same vector space, the natural dimension is the same, but not at the same time, we take the Union set of the two. If it does not contain a word (TERM), the weight (term weight) 0.
The correlation score formula is as follows:
For example, there are 11 Terms in the query statement, and a total of three documents are searched. The following table lists their respective weights.
Therefore, the correlation between the three documents and the query statement is calculated as follows:
Therefore, the document has the highest binary correlation, Which is returned first, followed by document 1, and finally document 3. So far, we can find the most desired document.
Summary of the above index creation and search process,
1. indexing process:
1) there are a series of indexed files
2) the indexed files form a series of terms through syntax analysis and language processing ).
3) create a dictionary and reverse index table through the index.
4) write the index to the hard disk through index storage.
2. search process:
A) The user enters the query statement.
B) obtain a series of terms through syntax analysis and language analysis on the query statement ).
C) Obtain a query tree through syntax analysis.
D) read the index into the memory through the index storage.
E) Use the query tree to search for the index. In this way, the linked list of each word (TERM) is obtained, and the linked list of the document is handed over. The result document is obtained.
F) sort the search result documents to sort the query relevance.
G) returns the query result to the user.
(Sina Weibo: @ quanliang _ machine learning)