Basic Principles of Search Engine sorting algorithms (08:01:02) using medical search as an example: Classification: Search Technology
When we submit a query to the search engine, the search engine will list a large number of results first come and then. What are the criteria for sorting these results? This seemingly simple question is one of the core challenges of information retrieval experts.
To illustrate this problem, we will study a much older topic than search engines: seeking medical treatment. For example, what kind of doctor should I see if I have a toothache? Suppose I have only three options:
- Dr. A, both eye disease and gastrology;
- Doctor B, both treating dental diseases and gastrology, but also treating eye diseases;
- Doctor C is specialized in dental diseases.
Dr. A is definitely not on the list. Between Dr. B and Dr. C, you should choose Dr. C because he is more focused and more suitable for my illness. Let us add another condition: Dr. B has rich experience, has 20 years of medical experience, and has excellent medical skills. But Dr. C has only five years of medical experience. This problem is not easy to judge, whether to choose a more focused C doctor or B doctor with better medical skills is a matter of careful consideration.
At least, we have come to the conclusion that two conditions must be taken into consideration when selecting a doctor: the degree of adaptability of the doctor's expertise to the condition, and the doctor's skill. You must take this conclusion for granted, and you can naturally think of it. Isn't the search engine sorting the same? You must consider the matching degree between the webpage content and the user query, the quality of the web page must be considered. But how can we combine these two factors to get one, not two or more sorting criteria? If we represent these two factors as numerical values, will the final sorting be based on combining these two values, multiplication, or organizing them by decision trees? If the sum is used, is it simple addition or weighted addition?
We can combine these two factors through trial and error methods based on our intuition and experience. But the better way is to find a clear basis, and we 'd better associate it with a solid discipline like mathematics. Based on simple experience, humans can create high-rise buildings in ancient times. However, if there is no such solid discipline as building mechanics and material mechanics, we need to build a high-rise building up to hundreds of meters, it is very difficult. Similarly, a search engine algorithm based on simple experience can be used to process tens of thousands of Web sets. However, to retrieve hundreds of millions of web pages, a more solid theoretical foundation is required.
When seeking medical treatment, patients will give priority to doctors with accurate diagnosis and good treatment effects. For search engines, the probability of meeting user needs on a webpage is generally ranked from large to small. If you useQIndicates that the user has given a specific query.DIt indicates that a specific webpage meets the user's needs, so the ordering basis can be expressed with a conditional probability:
P (d | q)
This simple conditional probability associates the search engine Sorting Algorithm with the solid discipline of probability theory, just like a ship sailing in the sea is equipped with a compass. Using Bayesian formula, the probability of this condition can be expressed:
As you can see clearly, the search engine sorting standard consists of three parts: Query attributesP (q); Web page attributesP (d); Matching relationship between the twoP (q | D). For the same queryP (q)They are all the same, so you can skip the sorting, that is
On the left side of the formula is a query of a known user, which calculates the probability that a webpage meets the user's needs. To improve the performance of responding to user queries, the search engine needs to pre-process all webpages to be queried in advance. During preprocessing, you only know the web page and do not know the user's query. Therefore, you need to calculate the number of requests that each web page can meet, obtain the first item on the right of the formula.P (q | D)This is equivalent to the degree of expertise of the doctor described above. For example, if one web page specifically introduces toothache and the other introduces both toothache and gastrology, the previous web pageP (q | D)The value is higher.
The second item on the right of the formulaP (d)Is the probability that a Web page meets user needs. It reflects the quality of the web page and has nothing to do with the query. If you want to recommend a webpage to a stranger (we don't know what he needs ),P (d)This is equivalent to the probability that a specific web page is recommended. In traditional information retrieval models, this amount is not very important, such as traditional vector space models and bm25 models, all attempts to obtain the sorting weight only based on the matching relationship between the query and the document. In fact, the amount irrelevant to the query is very important. If we use the frequency of webpage access to estimate its probability of meeting user needs, we can see that there is a huge difference between the two different webpages: some web pages are only accessed once or twice a day, while some web pages are visited thousands or more times a day. The amount of difference that can be provided is ignored by traditional search engines for a long time until Google invented PageRank and involved it in sorting. PageRank isP (d)A good estimate of the value, the addition of this factor makes the search engine effect immediately rose to a new level.
This formula also answers the questions raised above, the degree of matching between webpages and queries, and the quality of webpages. How should these two factors be combined to participate in sorting. This formula tells us for irrefutable reasons that if the matching degree between the webpage and the query isP (q | D)To indicate whether the web page is good or bad.P (d)Then, sort by their product. In modern commercial search engines, more detailed sorting factors need to be considered. There may be hundreds of such factors. It is more complicated and difficult to combine them.
By correlation group Jiangling
Http://stblog.baidu-tech.com /? P = 121