SOLR in action Note (2) scoring mechanism (similarity calculation) 1
This is a brief introduction to similarity calculation in search engine.
Content similarity calculation is modeled by the search engine's search model. It is the theoretical basis of the search engine and provides a mathematical model for quantitative correlation. Otherwise, it cannot be calculated. Of course, there is an idealized implicit assumption in the theoretical research of the search model, that is, if the user needs have been clearly and clearly expressed through the query, the search model tasks do not involve modeling the user needs, but in fact, this is far from the reality. Even if the same query word is used, the needs and objectives of different users may vary greatly, and the search model cannot do anything about this. There are several common search models:
- Boolean Model: The mathematical basis is set theory. Documents and user queries are expressed by the set of words they contain. The similarity between the two is determined by Boolean algebra;The disadvantage is that the result output is binary (correlation and not correlation), so the result cannot be sorted to a certain extent, and the user's search requirements with a Boolean expression are too high;
- Vector space model: The document is considered as a vector composed of T-dimension characters. features generally use words, and each feature will calculate its weight based on a certain basis, the T-dimension weighted features constitute a document to represent the topic content of the document. The similarity of the calculated documents can be defined by cosine. In fact, it is used to obtain the angle between the word vector and the document vector in the T-dimension space. The smaller the similarity, the more similar the feature weight, the TF * IDF framework can be used. TF indicates the word frequency. IDF indicates the frequency of occurrence of the same word in the document set range. This is a global factor, it does not consider the characteristics of the document, but the relative importance between feature words. The more documents the feature words appear in, the lower the IDF value, this word distinguishes different documents with less ability. In this framework, Weight = TF * IDF is used as the weight calculation formula. Of course, the vector spaceThe disadvantage is that it is an empirical model that relies on intuition and experience to constantly explore and improve, and lacks a clear theory to guide its improvement direction, for example, when TF and IDF values are obtained to punish long documents, experience values must be added;
- Probability Model: One of the most effective models, Okapi bm25, a classic probability model calculation formula, has been widely used in commercial search engines for webpage sorting. The probability search model is derived from the probability sorting principle. The basic idea is to give a user query, if the search system can sort the search results in a descending order based on the Relevance between documents and user requirements, the accuracy of the search system is optimal. It is the core of estimating such relevance as accurately as possible based on the document set.
- Language Model: it was first proposed in 1998. The thinking path of other search models is from query to document, that is, how to find relevant documents for a given user query. The idea of this model is exactly the same, it is from the document to the query direction, that is, to create a different language model for each document, to determine the possibility of a user query generated by the document, and then sort by the probability of this generation from high to low, as the search result. The language model represents the distribution of words or word sequences in documents;
- Machine Learning Sorting Algorithm: With the development of search engines, more and more factors need to be taken into account for sorting a webpage. This cannot be done based on human experience. At this time, machine learning is very suitable, for example, Google's current webpage sorting formula considers more than 200 factors. Data sources required by machine learning are well suited to search engines, such as users' search click records. It consists of four steps: manual tagging training, Document Feature Extraction, learning classification function, and machine learning model used in the actual search system. Manual tagging training allows users to click records to simulate the scoring mechanism for documents.
2. vector space model
Multiple similarity calculation methods are introduced in brief. SOLR adopts the most basic vector space model. This section describes the vector space model. Other vector space models will have time to learn.
SOLR in action Note (2) scoring mechanism (similarity calculation)