Relevance score for JavaScript full-text Search

Source: Internet
Author: User
Tags idf

Full-Text Search, unlike most other problems in machine learning, is a common problem that Web programmers often encounter in their daily work. The customer may ask you to provide a search box somewhere, and then you will write a SQL statement similar to the where title like%:query% to implement the search function. At first, it was no problem, until one day, the customer found you and said, "The search is wrong!" ”

Of course, there is no "error" in the search, but the result of the search is not what the customer wants. The average user does not know exactly how to do the exact match, so the resulting search results are of poor quality. To solve the problem, you decide to use full-text search. After a tedious study, you opened the MySQL Fulltext Index and used more advanced query syntax, such as "MATCH () ... Against () ".

All right, problem solved, sprinkle the flowers! It's no problem when the database is small.

But as your data grows, you'll find that your database is getting slower. MySQL is not a very useful full-text search tool. So you decide to use ElasticSearch, refactor the code, and deploy the Lucene-driven full-text search cluster. You will find it works very well, fast and accurate.

Then you wonder: Why is Lucene so awesome?

This article, which focuses on Tf-idf,okapi BM-25 and the general relevance score, and the next article (main introduction index) will tell you the basic concepts behind full-text search.

Correlation

For each search query, it is easy to define a "related score" for each document. When a user makes a search, we can sort by using related fractions instead of using the document occurrence time. This way, the most relevant document will be ranked first, regardless of how long it was created (and, of course, sometimes associated with the creation time of a document).

There are many ways to calculate the correlation between words, but we need to start with the simplest, statistical-based approach. This method does not need to understand the language itself, but is determined by the use of statistical terms, matching and the weight of the prevalence of specific words in the document, and other conditions to determine the "relevant score."

The algorithm does not care whether the word is a noun or a verb, nor does it care about the meaning of words. The only thing it cares about is the common words, those are rare words. If you have a search statement that includes common words and rare words, you might want to get a higher score for documents that contain rare words, while reducing the weight of common words.

This algorithm is called Okapi BM25. It contains two basic conceptual word frequencies (term frequency) abbreviated as the frequency of words ("TF") and the Document frequency reciprocal (inverse documents frequency) abbreviated as ("IDF"). Putting them together, called "TF-IDF," is a statistical measure used to indicate how important a term is in a document.

Tf-idf

Word frequency (term Frequency), referred to as "TF", is a very simple metric: the number of times a particular word appears in a document. You can divide this value by the total number of words in the document to get a score. For example, there are 100 words in the document, the word ' the ' appears 8 times, then the ' the ' TF is 8 or 8/100 or 8% (depending on how you want to express it).

The reverse file frequency (inverse document Frequency), referred to as "IDF", is more complex: the more rare a word, the higher the value. It is divided by the number of the total number of documents that contain the word, and the obtained quotient logarithm is obtained. The more rare the word, the higher the "IDF" is produced.

If you multiply these two numbers together (TF*IDF), you will get the weight of a word in the document. The definition of "weight" is: How rare is the word and how often does it appear in the document?

You can use this concept for search queries of documents. In the query for each keyword in the query, calculate their TF-IDF fractions and add them together. The highest score is the document that best matches the query statement.

It's cool!

Okapi BM25

The above algorithm is an available algorithm, but not very perfect. It gives a statistics-based correlation score algorithm, and we can further improve it.

Okapi BM25 is one of the most advanced ranking algorithms so far (so called ElasticSearch ). Okapi BM25 added two tunable parameters on the basis of TF-IDF, K1 and B, respectively, representing "word frequency saturation (term frequency saturation)" and "Field length specification". What the hell is this?

In order to intuitively understand "word frequency saturation", imagine two articles that discuss baseball in almost the length of the story. In addition, we assume that all documents (excluding these two articles) do not have much baseball-related content, so the word "baseball" will have a very high idf– it is rare and important. Both articles discuss baseball and spend a lot of time discussing it, but one of them uses the word "baseball" more than the other. So in this case, is an article really a lot different from the other article? Since two of the two documents are devoted to baseball, the word "baseball" appears to be the same for 40 or 80 times. In fact, 30 times should be capped!

This is the word frequency saturation. The native TF-IDF algorithm does not have the concept of saturation, so the 80-time "baseball" document is twice as high as the 40-time score. Sometimes it's what we want, but sometimes we don't want that.

In addition, the Okapi BM25 also has a K1 parameter, which is used to adjust the rate of saturation changes. The value of the K1 parameter is typically between 1.2 and 2.0. The lower the value, the faster the saturation process. (meaning two of the above two documents have the same score, because they all contain a lot of "baseball" the word)

The length of the field (Field-length normalization) of the document is normalized to the average length of all documents. This is useful for single-field collections (Single-field collections), such as ours, to unify documents of different lengths to the same comparison criteria. For two-field collections, such as "title" and "Body", it makes sense to unify the title and body fields to the same comparison criteria. The field length is approximately B to indicate that its value is between 0 and 1, 1 means all is normalized, and 0 is not normalized.

Algorithm

In okapi BM25 Wikipedia you can see the formula for the Okapi algorithm. Now that you know what each of the formulas are, it's easy to understand. So let's not mention the formula and go directly into the code:

    1. BM25. Tokenize = function (text) {
    2. Text = text
    3. . toLowerCase ()
    4. . replace (/\w/g, ")
    5. . replace (/\s+/g, ")
    6. . Trim ()
    7. . Split (")
    8. . map (function (a) {return stemmer (a);});
    9. Filter out Stopstems
    10. var out = [];
    11. for (var i = 0, len = text.length; i < Len; i++) {
    12. if (Stopstems.indexof (text[i]) = = =-1) {
    13. Out.push (Text[i]);
    14. }
    15. }
    16. return out;
    17. };
Copy Code

We defined a simple static method, Tokenize (), to parse the string into an array of tokens. In this way, we lowercase all the tokens (in order to reduce entropy). We run the Porter Stemmer algorithm to reduce the amount of entropy while also improving the matching level ("Walking" and "walk" matches are the same). And we also filter out the stop words (very common words) in order to reduce the entropy in a step closer. Before I put the concept in depth, I would ask for more magnanimous if I explained this section too much.

  1. BM25.prototype.addDocument = function (doc) {
  2. if (typeof doc.id = = = ' undefined ') {throw new Error ($, ' ID is a required property of documents. ');};
  3. if (typeof doc.body = = = ' undefined ') {throw new Error (1001, ' body is a required property of documents. ');};
  4. Raw tokenized List of words
  5. var tokens = BM25. Tokenize (Doc.body);
  6. Would hold unique terms and their counts and frequencies
  7. var _terms = {};
  8. Docobj'll eventually is added to the documents database
  9. var docobj = {id:doc.id, tokens:tokens, body:doc.body};
  10. Count Number of terms
  11. Docobj.termcount = Tokens.length;
  12. Increment totaldocuments
  13. this.totaldocuments++;
  14. Readjust Averagedocumentlength
  15. This.totaldocumenttermlength + = Docobj.termcount;
  16. This.averagedocumentlength = this.totaldocumenttermlength/this.totaldocuments;
  17. Calculate Term Frequency
  18. First Get terms Count
  19. for (var i = 0, len = tokens.length; i < Len; i++) {
  20. var term = tokens[i];
  21. if (!_terms[term]) {
  22. _terms[term] = {
  23. count:0,
  24. freq:0
  25. };
  26. };
  27. _terms[term].count++;
  28. }
  29. Then Re-loop-Calculate term frequency.
  30. We ' ll also update inverse document frequencies here.
  31. var keys = Object.keys (_terms);
  32. for (var i = 0, len = keys.length; i < Len; i++) {
  33. var term = keys[i];
  34. Term Frequency for this document.
  35. _terms[term].freq = _terms[term].count/docobj.termcount;
  36. Inverse Document Frequency Initialization
  37. if (!this.terms[term]) {
  38. This.terms[term] = {
  39. n:0,//number of docs this term appears in, uniquely
  40. idf:0
  41. };
  42. }
  43. this.terms[term].n++;
  44. };
  45. Calculate Inverse Document Frequencies
  46. This was slowish so if you want to index a big batch of documents,
  47. Comment This off and run it once at the end of your adddocuments run
  48. If you ' re only indexing a document or both at a time you can leave the.
  49. THIS.UPDATEIDF ();
  50. ADD docobj to Docs db
  51. Docobj.terms = _terms;
  52. This.documents[docobj.id] = docobj;
  53. };
Copy Code

This is where adddocument () will magically appear. We basically build and maintain two similar data structures: this.documents. and this.terms.

This.documentsis is a database of all documents, which holds all the original text of the document, the length information of the document, and a list of all the words and words in the document and the frequency with which they appear. Using this data structure, we can be very easy and fast (yes, very fast, only the time complexity is O (1) of the table query time) to answer the following questions: In the document #3, how many times does the word ' walk ' appear?

We're also using another data structure, this.terms. It represents all the words in the corpus. With this data structure, we can answer the following question in the O (1) Time: How many documents did the word ' walk ' appear in? What is their ID?

Finally, we recorded the length of each document and recorded the average length of the file in the entire corpus.

Note that in the above code, IDF is initialized with 0, and the UPDATEIDF () method is commented out. This is because this method is very slow to run and can only be run once after the index is established. Now that you can meet your needs once you run it, there is no need to run it 5,000 times. You can save a lot of time by commenting it out first and then running it after a large batch of index operations. Here is the code for this function:

    1. BM25.PROTOTYPE.UPDATEIDF = function () {
    2. var keys = Object.keys (this.terms);
    3. for (var i = 0, len = keys.length; i < Len; i++) {
    4. var term = keys[i];
    5. var num = (THIS.TOTALDOCUMENTS-THIS.TERMS[TERM].N + 0.5);
    6. var denom = (THIS.TERMS[TERM].N + 0.5);
    7. THIS.TERMS[TERM].IDF = Math.max (Math.log10 (num/denom), 0.01);
    8. }
    9. };
Copy Code

This is a very simple function, but because it needs to traverse all the words in the whole corpus and update the values of all the words, it makes it work a little bit slower. The implementation of this method uses the standard formula of the reverse document frequency (inverse document frequency) (you can find this formula on Wikipedia )-The number of total files divided by the amount of files containing the word, and the resulting quotient logarithm. I made some changes so that the return value is always greater than 0.

  1. BM25.prototype.search = function (query) {
  2. var queryterms = BM25. Tokenize (query);
  3. var results = [];
  4. Look at each document in turn. There is better ways to does this with inverted indices.
  5. var keys = Object.keys (this.documents);
  6. for (var j = 0, Ndocs = keys.length; J < Ndocs; J + +) {
  7. var id = keys[j];
  8. The relevance score for a document is the sum of a tf-idf-like
  9. Calculation for each query term.
  10. This.documents[id]._score = 0;
  11. Calculate the score for each query term
  12. for (var i = 0, len = queryterms.length; i < Len; i++) {
  13. var queryterm = queryterms[i];
  14. We ' ve never seen this term before so IDF would be 0.
  15. Means We can skip the whole term, it adds nothing to the score
  16. and isn ' t in any document.
  17. if (typeof this.terms[queryterm] = = = ' undefined ') {
  18. Continue
  19. }
  20. The term isn ' t in the document, so the TF portion are 0 and this
  21. Term contributes the search score.
  22. if (typeof this.documents[id].terms[queryterm] = = = ' undefined ') {
  23. Continue
  24. }
  25. The term was in the document and let's go.
  26. The whole term is:
  27. IDF * (TF * (k1 + 1))/(TF + K1 * (1-b + b * doclength/avgdoclength))
  28. IDF is pre-calculated for the whole docset.
  29. var IDF = THIS.TERMS[QUERYTERM].IDF;
  30. Numerator of the TF portion.
  31. var num = This.documents[id].terms[queryterm].count * (THIS.K1 + 1);
  32. Denomerator of the TF portion.
  33. var denom = This.documents[id].terms[queryterm].count
  34. + (THIS.K1 * (1-this.b + (this.b * this.documents[id].termcount/this.averagedocumentlength)));
  35. Add This query term to the score
  36. This.documents[id]._score + = IDF * NUM/DENOM;
  37. }
  38. if (!isnan (this.documents[id]._score) && this.documents[id]._score > 0) {
  39. Results.push (This.documents[id]);
  40. }
  41. }
  42. Results.sort (function (A, b) {return b._score-a._score;});
  43. Return Results.slice (0, 10);
  44. };
Copy Code

Finally, the search () method iterates through all the documents and gives a BM25 score for each document, and then sorts them in order from large to small. Of course, it is unwise to traverse every document in the corpus during the search. This problem is addressed in part one (reverse index and performance).

The above code is well commented, with the following points: Calculate BM25 scores for each document and each word. The IDF score for the word has been pre-calculated and only needs to be queried when using it. The word frequency is also pre-calculated as part of the document's properties. After that, you just need a simple arithmetic. Finally, add a temporary variable _score to each document, and then sort in descending order according to score and return the first 10 results.

Examples, source code, considerations and next plan

The above examples are optimized in a number of ways, and we'll cover them in the second part of full-text search, and you're welcome to keep watching. I wish I could finish it in a few weeks. Here's what we'll mention next:

    • Reverse indexing and Quick Search
    • Quick Index
    • Better search Results

For this demonstration, I made up a small Wikipedia crawler, crawling to quite a few (85000) Wikipedia articles in the first paragraph. Since the index to all 85K files takes about 90 seconds, I have cut the half in my computer. Don't want you to waste your laptop's power just for a simple full-text demo.

Because the index is a heavy, modular CPU operation, I think of it as a web worker. The index runs on a background thread – Here you can find the full source code . You will also find the source code reference in the stemming and my discontinued words list. As for the code permission, it is free for educational purposes as always, and not for any commercial purposes.

and finally the demo. Once the index is complete, try to find random things and phrases that Wikipedia will know about. Note that there are only 40000 segments of the index, so you might want to try some new topics.

Relevance score for JavaScript full-text Search

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.