Full-Text Search, unlike most other problems in machine learning, is a common problem that Web programmers often encounter in their day-to-day work. The customer may ask you to provide a search box somewhere, and then you will write a SQL statement similar to where title like%:query% to implement the search function. At first, this is no problem, until one day, the client finds you and says to you, "the search went wrong!" ”
Of course, there is no "error" in the search, but the search results are not what the customer wants. The average user does not know exactly how to do the exact match, so the resulting search results are of poor quality. To solve the problem, you decide to use Full-text search. After a tedious study, you turned on MySQL's fulltext index and used more advanced query syntax, such as "MATCH () ... Against () ".
All right, problem solved, end of the flower! It's no problem when the database is small.
But as your data gets more and more, you'll find that your database is slowing down. MySQL is not a very useful full-text search tool. So you decided to use Elasticsearch, refactor the code, and deploy the Lucene-driven full-text search cluster. You'll find it works very well, fast and accurate.
Then you wonder: Why is Lucene so cool?
This article (mainly about Tf-idf,okapi BM-25 and the general relevance score) and the next article (main introduction index) will tell you the basic concepts behind full-text search.
Correlation
For each search query, it is easy to define a "related score" for each document. When a user makes a search, we can sort by using related fractions instead of using the time the document appears. In this way, the most relevant document will be the first, regardless of how long it was created (and, of course, sometimes it is related to the creation time of the document).
There are a lot of ways to compute the correlation between words, but we have to start with the simplest, statistically based approach. This method does not need to understand the language itself, but to determine "related scores" by using statistical words, matching and the weight of the popularity of the specific words in the document.
This algorithm does not care whether words are nouns or verbs, nor do they care about the meaning of words. The only thing it cares about is the common words, those are the rare words. If a search statement includes both common words and rare words, you'd better score higher on documents that contain rare words, while reducing the weight of commonly used words.
This algorithm is called Okapi BM25. It contains two basic concept words frequency (term frequency) short word frequency ("TF") and the document frequency countdown (inverse document Frequency) is abbreviated ("IDF"). Putting them together, called "TF-IDF," is a statistical measure that indicates how important a word (term) is in a document.
Tf-idf
The word frequency (Term Frequency), referred to as "TF", is a very simple metric: the number of times a particular word appears in a document. You can divide this value by the total number of words in the document and get a score. For example, there are 100 words in the document, the word ' the ' appears 8 times, and the TF for ' the ' is 8 or 8/100 or 8% (depending on how you want to represent it).
Reverse file frequency (inverse document Frequency), referred to as "IDF", is more complex: the more rare a word, the higher the value. It consists of the total number of files divided by the number of files containing the word, and then the obtained quotient is obtained by logarithm. The rarer the word, the higher the "IDF" will be produced.
If you multiply these two numbers together (TF*IDF), you'll get the weight of a word in the document. The definition of "weight" is: How rare is the word and how often does it appear in the document?
You can use this concept in search queries for documents. In the query for each keyword in the query, calculate their TF-IDF scores and add them together. The highest score is the document that best matches the query statement.
It's cool!
Okapi BM25
The above algorithm is an available algorithm, but not too perfect. It gives a statistical correlation score algorithm, and we can further improve it.
Okapi BM25 is one of the most advanced ranking algorithms to date (so called elasticsearch). Okapi BM25 added two adjustable parameters on TF-IDF basis, K1 and B, respectively representing "word frequency saturation (term frequency saturation)" and "Field length specification". What the hell is this?
To get an intuitive understanding of the word frequency saturation, imagine two articles about baseball that are about the same length. Also, we assume that all documents (minus these two) don't have much to do with baseball, so the word "baseball" will have a high IDF-it's very rare and important. These two articles are all about baseball, and they all cost a lot of space to discuss it, but one uses the word "baseball" more than another. So in this case, is an article really a lot different than another article? Since two of the two documents are devoted to baseball, the word "baseball" appears 40 or 80 times. In fact, it should be capped 30 times!
That's "the frequency saturation of words." The native TF-IDF algorithm has no saturation concept, so a 80-time "baseball" document is one-fold higher than the 40-time score. There are times when we want to, but sometimes we don't want to.
In addition, the Okapi BM25 also has a K1 parameter that regulates the rate at which the saturation changes. The value of the K1 parameter generally ranges from 1.2 to 2.0. The lower the number, the faster the saturation process. (meaning that the two top two documents have the same score, because they all contain a lot of "baseball" the word)
The length of the field (Field-length normalization) converts the length of the document to the average length of all documents. This is useful for a single field collection (Single-field collections) (for example, ours), where documents of different lengths can be unified to the same comparison conditions. For a two-field collection, such as "title" and "Body", it makes sense to unify the title and body fields to the same comparison condition. The length of the field is stated in B, its value is between 0 and 1, 1 means total reduction, and 0 is not reduced.
Algorithm
You can see the formula for the Okapi algorithm in Okapi BM25 wikipedia. Now that you know what each item is in the equation, it must be easy to understand. So we're not going to mention the formula, go directly to the code:
BM25. Tokenize = function (text) {
text = text
. toLowerCase ()
. Replace (/\w/g, ').
replace (/\s+/g, ')
. Trim ()
. Split (')
. Map (function (a) {return stemmer (a);});
Filter out Stopstems
var out = [];
for (var i = 0, len = text.length i < len; i++) {
if (Stopstems.indexof (text[i)) = = 1) {
Out.push (Text[i]) ;
}
}
return out;
};
We define a simple static method Tokenize (), which is designed to parse the string into an array of tokens. In this way, we lowercase all the tokens (in order to reduce entropy). We run the Porter Stemmer algorithm to reduce the amount of entropy while also improving the match ("Walking" and "walk" matches are the same). And we also filter out the Stop word (a very common word) to reduce the entropy in a step closer. Before I go into the concept of writing, if I explain this section too much magnanimous.
BM25.prototype.addDocument = function (doc) {if (typeof doc.id = = ' undefined ') {throw new Error (1000, ' id is a Requir Ed property of documents. ');
};
if (typeof doc.body = = = ' undefined ') {throw new Error (1001, ' body is a required property of documents '); Raw tokenized List of words var tokens = BM25.
Tokenize (Doc.body);
Would hold unique terms and their counts and frequencies var _terms = {};
Docobj would eventually be added to the documents database var docobj = {id:doc.id, tokens:tokens, body:doc.body};
Count number of terms docobj.termcount = tokens.length;
Increment totaldocuments this.totaldocuments++;
Readjust averagedocumentlength this.totaldocumenttermlength + = Docobj.termcount;
This.averagedocumentlength = this.totaldocumenttermlength/this.totaldocuments; Calculate term frequency//I terms count for (var i = 0, len = tokens.length; i < Len; i++) {var t
ERM = tokens[i]; if (!_terMs[term]) {_terms[term] = {count:0, freq:0};
};
_terms[term].count++;
}//Then re-loop to calculate term frequency.
We ' ll also update inverse document frequencies here.
var keys = Object.keys (_terms);
for (var i = 0, len = keys.length i < len; i++) {var term = keys[i];
Term Frequency for this document.
_terms[term].freq = _terms[term].count/docobj.termcount; Inverse Document Frequency initialization if (!this.terms[term]) {This.terms[term] = {n:0,//Numb
Er of docs This term appears in, uniquely idf:0};
} this.terms[term].n++;
}; Calculate Inverse Document Frequencies//This are slowish so if you want to index a big batch of documents,//Comm Ent this out and run it once at the end of your adddocuments run//If you ' re only indexing a document or two at a time
can leave this in.
THIS.UPDATEIDF ();
ADD docobj to Docs dbDocobj.terms = _terms;
This.documents[docobj.id] = docobj;
};
This is where adddocument () will magically appear. We basically build and maintain two similar data structures: This.documents and this.terms.
This.documentsis is a database of all documents that holds all of the original text of the document, the length of the document, and a list of all the words and expressions in the document that hold the number and frequency of occurrences. Using this data structure, we can easily and quickly (yes, very fast, only need the time complexity of O (1) Table query time) answer the following question: In the document #3, the word ' walk ' appears how many times?
We've also used another data structure, this.terms. It represents all the words in the corpus. With this data structure, we can answer the following question in O (1) Time: How many documents did the word ' walk ' appear in? What are their IDs?
Finally, we recorded the length of each document and recorded the average length of the Chinese document for the entire corpus.
Note that in the above code, the IDF is initialized with 0, and the UPDATEIDF () method is commented out. This is because the method runs very slowly, and it only needs to run once after the index is built. Since running once can satisfy demand, there is no need to run 5,000 times. You can save a lot of time by commenting it out and then running after a High-volume index operation. The following is the code for this function:
BM25.PROTOTYPE.UPDATEIDF = function () {
var keys = Object.keys (this.terms);
for (var i = 0, len = keys.length i < len; i++) {
var term = keys[i];
var num = (THIS.TOTALDOCUMENTS-THIS.TERMS[TERM].N + 0.5);
var denom = (THIS.TERMS[TERM].N + 0.5);
THIS.TERMS[TERM].IDF = Math.max (Math.log10 (num/denom), 0.01);
}
;
This is a very simple function, but because it needs to traverse all the words in the entire corpus and update the values of all the words, this causes it to work a bit slowly. The implementation of this method takes the standard formula for the reverse document frequency (which you can find on Wikipedia)-the number of total files divided by the amount of files containing the word, and then the inverse of the resulting quotient. I made some changes so that the return value is always greater than 0.
BM25.prototype.search = function (query) {var queryterms = BM25.
Tokenize (query);
var results = []; Look at the each document in turn.
There are better ways to does this with inverted indices.
var keys = Object.keys (this.documents);
for (var j = 0, Ndocs = keys.length J < Ndocs; J + +) {var id = keys[j];
The relevance score for a document is the sum of a tf-idf-like//calculation for each query term.
This.documents[id]._score = 0; Calculate the score for each query term for (var i = 0, len = queryterms.length; i < Len; i++) {var query
Term = Queryterms[i];
We ' ve never seen this term before so IDF would be 0.
Means We can skip the whole term, it adds nothing to the score//and isn ' t into any document.
if (typeof this.terms[queryterm] = = = ' undefined ') {continue; }//This term isn ' t in the document, so the TF portion are 0 and this//term contributes RCH SCOre.
if (typeof this.documents[id].terms[queryterm] = = = ' undefined ') {continue;
}//The term is in the document, let's go. The whole term is://IDF * (TF * (k1 + 1))/(TF + K1 * (1-b + b * doclength/avgdoclength))/IDF
is pre-calculated for the whole docset.
var IDF = THIS.TERMS[QUERYTERM].IDF;
Numerator of the TF portion.
var num = This.documents[id].terms[queryterm].count * (THIS.K1 + 1);
Denomerator of the TF portion. var denom = This.documents[id].terms[queryterm].count + (THIS.K1 * (1-this.b + this.b * this.documents[id].ter
mcount/this.averagedocumentlength)));
Add This query term to the score This.documents[id]._score + = IDF * NUM/DENOM; } if (!isnan (this.documents[id]._score) && this.documents[id]._score > 0) {results.push (this.docum
Ents[id]);
} results.sort (function (A, b) {return b._score-a._score;}); ReTurn results.slice (0, 10);
};
Finally, the search () method iterates through all the documents and gives the BM25 score for each document and then sorts it in order from large to small. Of course, it is unwise to traverse every document in the corpus during the search. This problem is addressed in part Two (reverse indexing and performance).
The code above has been well annotated with the following points: Calculate BM25 scores for each document and each word. The IDF scores of the words have been calculated in advance, and only queries can be used. Word frequencies are also calculated as part of the document's properties. Then you just need a simple arithmetic. Finally, add a temporary variable _score to each document, and then sort by score and return the first 10 results.
Examples, source code, considerations, and next plan
There are many ways to optimize these examples, and we'll introduce them in the second part of "full-text search," and welcome to continue watching. I hope I can finish it in a few weeks. Here's what you'll be talking about next:
- Reverse indexing and Quick Search
- Quick Index
- Better search Results
For this demo, I made up a small Wikipedia crawler and crawled into the first paragraph of a fairly large (85000) Wikipedia article. Since indexing to all 85K files takes about 90 seconds, in my computer I've cut half. I don't want you to waste your laptop power just for a simple full text presentation.
Because the index is a heavy, modular CPU operation, I think of it as a network worker. The index runs on a background thread-where you can find the complete source code. You will also find the source reference in the stemming algorithm and my list of deactivated words. As for the code license, it is always free for educational purposes, but not for any commercial purpose.
Finally, the demo. Once the index is complete, try to find random things and phrases that Wikipedia will know about. Note that there are only 40000 paragraphs in the index, so you may want to try some new topics.