This article mainly introduces how to enable relevance scoring for full-text search of JavaScript. It uses an algorithm named OkapiBM25, which is also described in this article. If you need it, refer to the full-text search below, unlike most other problems in the machine learning field, it is a problem that Web programmers often encounter in their daily work. The customer may ask you to provide a search box somewhere, and then you will write an SQL statement similar to WHERE title LIKE %: query % to implement the search function. At the beginning, this was okay. One day, the customer found you and told you, "a search error occurred !"
Of course, in fact, there is no "error" in the search, but the search results are not what the customer wants. Generally, users do not know how to perform exact matching, so the quality of the search results is poor. To solve the problem, you decided to use full-text search. After a piece of boring learning, you have enabled the FULLTEXT index of MySQL and used more advanced query syntax, such as "MATCH ()... AGAINST ()".
All right, solve the problem, and scatter the flowers! When the database is not large, it is okay.
But when you have more and more data, you will find that your database is getting slower and slower. MySQL is not a very useful full-text search tool. Therefore, you decided to use ElasticSearch to refactor the code and deploy the Lucene-driven full-text search cluster. You will find that it works very well, fast and accurate.
Then you may wonder: why is Lucene so awesome?
This article (mainly about TF-IDF, Okapi BM-25 and general correlation score) and the next article (mainly about index) will show you the basic concepts behind full-text search.
Correlation
For each search query, we can easily define a "related score" for each document ". When searching, you can sort the scores instead of document appearance time. In this way, the most relevant documents will be the first one, no matter how long it was created before (of course, sometimes it is also related to the document creation time ).
There are many ways to calculate the correlation between words, but we should start with the simplest and statistical-based method. This method does not need to understand the language itself, but determines the "correlation score" by calculating the use, matching, and weights based on the popularity of specific words in the document ".
This algorithm does not care about whether words are nouns or verbs or the meaning of words. It only cares about common words and rare words. If a search statement contains common words and rare words, you 'd better grade documents containing rare words and reduce the weight of common words.
This algorithm is called Okapi BM25. It contains two basic concepts: term frequency (term frequency) referred to as Word frequency ("TF") and inverse document frequency ("IDF "). put them together, known as "TF-IDF", which is a statistical measure used to indicate how important a word (term) is in the document.
TF-IDF
Term Frequency (TF) is a simple Metric: the number of times a specific word appears in a document. You can divide this value by the total number of words in the document to get a score. For example, if there are 100 words in the document and the word ''the appears eight times, then the TF of ''the is 8, 8/100, or 8% (depending on how you want to represent it ).
The Frequency (Inverse Document Frequency) of reverse files, referred to as "IDF", must be more complex: the more rare a word is, the higher the value. It is calculated by dividing the total number of files by the number of files containing the word, and then obtains the logarithm of the obtained quotient. The more rare the word, the higher the "IDF ".
If you multiply the two numbers together (TF * IDF), you will get the weight of a word in the document. What is the definition of "weight": How often is this word rare and frequently appears in the document?
You can use this concept to search and query documents. For each keyword in a query, calculate their TF-IDF scores and add them together. The document with the highest score that best matches the query statement.
Cool!
Okapi BM25
The above algorithm is a usable algorithm, but not perfect. It provides a correlation score algorithm based on statistics, and we can further improve it.
Okapi BM25 is one of the most advanced ranking algorithms so far (so called ElasticSearch ). Okapi BM25 adds two tunable parameters, k1 and B, on the basis of the TF-IDF, representing "term frequency saturation" and "field length specification", respectively ". What is this?
To intuitively understand the "Word Frequency saturation", imagine two articles about baseball in about length. In addition, assuming that there is not much baseball-related content in all the documents (except the two), the term "Baseball" will have a high IDF-it is rare and important. Both of these articles discuss baseball and have spent a lot of time discussing it, but one of them uses the word "Baseball" more than the other. In this case, is an article really much different from another one? Since both documents are devoted to baseball, the word "Baseball" is the same for 40 or 80 times. As a matter of fact, it should have been capped for 30 times!
This is the word frequency saturation. Native TF-IDF algorithms do not have the concept of saturation, so the appearance of 80 "Baseball" documents is twice as high as the appearance of 40 times. Sometimes, what we want at this moment, but sometimes we don't want.
In addition, Okapi BM25 also has an k1 parameter, which is used to adjust the rate of saturation variation. The value of k1 is generally between 1.2 and 2.0. The lower the value, the faster the saturation process. (It means that the two documents above have the same score because they both contain a large number of "Baseball" words)
Field-length normalization normalize the length of a document to the average length of all documents. This is useful for single-field collections (for example, ours). You can unify documents of different lengths to the same comparison condition. It is more meaningful for a set of double fields (such as "title" and "body"). It can also unify the title and body fields to the same comparison condition. Field Length reduction is represented by B. Its value ranges from 0 to 1. 1 means all are reduced, and 0 means no reduction.
Algorithm
You can learn the formulas of Okapi algorithms in Okapi BM25 Wikipedia. Since we all know what each item in the formula is, it must be easy to understand. So we will not mention the formula and directly go to the Code:
BM25.Tokenize = function(text) { text = text .toLowerCase() .replace(/\W/g, ' ') .replace(/\s+/g, ' ') .trim() .split(' ') .map(function(a) { return stemmer(a); }); // Filter out stopStems var out = []; for (var i = 0, len = text.length; i < len; i++) { if (stopStems.indexOf(text[i]) === -1) { out.push(text[i]); } } return out;};
We have defined a simple static Tokenize () method to parse strings into the tokens array. In this way, we use all tokens in lower case (to reduce entropy ). We run the Porter Stemmer algorithm to reduce the entropy and increase the matching degree (the matching between "walking" and "walk" is the same ). In addition, we also filter out deprecated words (common words) to reduce entropy values in a closer step. Before I go deep into the concepts I have written, I would be more comfortable explaining this section.
BM25.prototype.addDocument = function(doc) { if (typeof doc.id === 'undefined') { throw new Error(1000, 'ID is a required property of documents.'); }; if (typeof doc.body === 'undefined') { throw new Error(1001, 'Body is a required property of documents.'); }; // Raw tokenized list of words var tokens = BM25.Tokenize(doc.body); // Will hold unique terms and their counts and frequencies var _terms = {}; // docObj will eventually be added to the documents database var docObj = {id: doc.id, tokens: tokens, body: doc.body}; // Count number of terms docObj.termCount = tokens.length; // Increment totalDocuments this.totalDocuments++; // Readjust averageDocumentLength this.totalDocumentTermLength += docObj.termCount; this.averageDocumentLength = this.totalDocumentTermLength / this.totalDocuments; // Calculate term frequency // First get terms count for (var i = 0, len = tokens.length; i < len; i++) { var term = tokens[i]; if (!_terms[term]) { _terms[term] = { count: 0, freq: 0 }; }; _terms[term].count++; } // Then re-loop to calculate term frequency. // We'll also update inverse document frequencies here. var keys = Object.keys(_terms); for (var i = 0, len = keys.length; i < len; i++) { var term = keys[i]; // Term Frequency for this document. _terms[term].freq = _terms[term].count / docObj.termCount; // Inverse Document Frequency initialization if (!this.terms[term]) { this.terms[term] = { n: 0, // Number of docs this term appears in, uniquely idf: 0 }; } this.terms[term].n++; }; // Calculate inverse document frequencies // This is SLOWish so if you want to index a big batch of documents, // comment this out and run it once at the end of your addDocuments run // If you're only indexing a document or two at a time you can leave this in. // this.updateIdf(); // Add docObj to docs db docObj.terms = _terms; this.documents[docObj.id] = docObj;};
This is where the addDocument () method miraculously appears. We basically establish and maintain two similar data structures: this.doc uments. And this. terms.
This.doc umentsis is a database that stores all documents. It stores all the original text, document length information, and a list of documents, the list contains the number and frequency of all words and words in the document. Using this data structure, we can easily and quickly answer the following questions (yes, very fast, only the time required for the hash table query with the time complexity of O (1: in document #3, How many times does the word 'walk 'appear?
We also use another data structure, this. terms. It represents all words in the corpus. Through this data structure, we can answer the following question in O (1): How many documents have the word 'walk 'appeared? What is their id?
Finally, we recorded the length of each document and recorded the average length of the document in the entire corpus.
Note: In the above Code, idf is initialized to 0, and the updateidf () method is commented out. This is because this method runs very slowly and only needs to run once after the index is created. Since one operation can meet the requirements, there is no need to run 5000 times. Comment it out first, and then run it after a large number of index operations, which can save a lot of time. The code for this function is as follows:
BM25.prototype.updateIdf = function() { var keys = Object.keys(this.terms); for (var i = 0, len = keys.length; i < len; i++) { var term = keys[i]; var num = (this.totalDocuments - this.terms[term].n + 0.5); var denom = (this.terms[term].n + 0.5); this.terms[term].idf = Math.max(Math.log10(num / denom), 0.01); }};
This is a very simple function, but because it needs to traverse all words in the entire corpus and update the values of all words, it is a little slow to work. The implementation of this method adopts the standard formula of inverse document frequency (you can find this formula on Wikipedia) -Divide the total number of files by the number of files containing the word, and then obtain the logarithm of the obtained quotient. I made some changes to keep the return value greater than 0.
BM25.prototype.search = function(query) { var queryTerms = BM25.Tokenize(query); var results = []; // Look at each document in turn. There are better ways to do this with inverted indices. var keys = Object.keys(this.documents); for (var j = 0, nDocs = keys.length; j < nDocs; j++) { var id = keys[j]; // The relevance score for a document is the sum of a tf-idf-like // calculation for each query term. this.documents[id]._score = 0; // Calculate the score for each query term for (var i = 0, len = queryTerms.length; i < len; i++) { var queryTerm = queryTerms[i]; // We've never seen this term before so IDF will be 0. // Means we can skip the whole term, it adds nothing to the score // and isn't in any document. if (typeof this.terms[queryTerm] === 'undefined') { continue; } // This term isn't in the document, so the TF portion is 0 and this // term contributes nothing to the search score. if (typeof this.documents[id].terms[queryTerm] === 'undefined') { continue; } // The term is in the document, let's go. // The whole term is : // IDF * (TF * (k1 + 1)) / (TF + k1 * (1 - b + b * docLength / avgDocLength)) // IDF is pre-calculated for the whole docset. var idf = this.terms[queryTerm].idf; // Numerator of the TF portion. var num = this.documents[id].terms[queryTerm].count * (this.k1 + 1); // Denomerator of the TF portion. var denom = this.documents[id].terms[queryTerm].count + (this.k1 * (1 - this.b + (this.b * this.documents[id].termCount / this.averageDocumentLength))); // Add this query term to the score this.documents[id]._score += idf * num / denom; } if (!isNaN(this.documents[id]._score) && this.documents[id]._score > 0) { results.push(this.documents[id]); } } results.sort(function(a, b) { return b._score - a._score; }); return results.slice(0, 10);};
Finally, the search () method traverses all documents, returns the BM25 score for each document, and sorts the score in ascending order. Of course, it is unwise to traverse every document in the corpus during the search process. This problem is solved in Part Two (reverse indexing and performance.
The above code has been well annotated. The key points are as follows: Calculate the BM25 score for each document and word. The idf score of a word has been calculated in advance. You only need to query it when using it. The word frequency is also calculated in advance as part of the document attribute. Then, you only need a simple arithmetic operation. Add a Temporary Variable _ score for each document, sort the score in descending order, and return the first 10 results.
Example, source code, precautions, and next plan
There are many methods to optimize the above example. We will introduce them in the second part of "full-text search". Welcome to continue watching. I hope I can finish it in a few weeks. The following lists the content to be mentioned next time:
- Reverse indexing and quick search
- Quick Index
- Better search results
For this demonstration, I made up a small Wikipedia crawler and climbed to the first section of a considerable number of Wikipedia articles (85000. Since it takes about 90 seconds to index all 85K files, I have cut my computer by half. I don't want you to waste your laptop power just for a simple full-text demonstration.
Because indexing is a heavy and modular CPU operation, I regard it as a network worker. The index runs on a background thread-here you can find the complete source code. You will also find source code references in the stem algorithm and my deprecated word list. Code licenses are still free for educational purposes, not for any commercial purposes.
The last is the demonstration. Once the index is complete, Wikipedia will know how to search for random things and phrases. Note that there is only 40000-segment index, so you may have to try some new topics.