The reciprocal of expected sort based on hierarchical correlation

Source: Internet
Author: User
Tags see definition

The reciprocal of expected sort based on hierarchical correlation

Summary
Many of the indicators for evaluating the results of information retrieval can be used in binary correlation situations, and there is only one indicator that can usually be used for hierarchical correlation, namely, the cumulative gain (DCG) of the discount. One drawback of this indicator is its added nature and potential independence hypothesis: the document at a given location always has the same gain and impairment, independent of the document displayed before it. Inspired by the "cascading" user model, we present a new indicator based on hierarchical correlation to overcome this difficulty and implicitly discount the scores of documents that are located under very relevant documents. More precisely, this new indicator is defined as the expected reciprocal of the time spent by the user being guided to find the relevant document. This can be seen as the traditional sort of reciprocal indicator being extended to use for hierarchical correlation, which we call the expected sort reciprocal (ERR). We have extensively evaluated the query logs of a commercial search engine and found that the ERR indicator is more closely aligned with CTR metrics than other editorial indicators.

Classification and topic description
h.3.4 [information storage and retrieval]: systems and software – performance evaluation (effectiveness and efficiency)
General Terminology
Experiment, Measure
Key Words
Reviews, non-two-dollar correlation, web search, user model

1. Introduction
Recently, the evaluation work in the field of information retrieval has been paid great attention, mainly due to the rapid change of information retrieval system. The validity of the assumptions behind the Cranfield evaluation paradigm and the TREC style evaluation methodology is challenged, especially based on the growing set of tests, new types of information needs, and the availability of new related data sources such as click Logs and crowdsourcing.
How to evaluate the Web search engine correctly is still a challenging and open research problem. Most of the evaluation indexes of web search information retrieval are based on cumulative earnings, such as the cumulative gain (DCG) of discounts. These metrics are popular because they support hierarchical correlations that are commonly used to determine whether a Web document is relevant or not. While it is important to support hierarchical correlation, there are other important factors that should be considered in the evaluation criteria.
An important factor that DCG cannot explain is how the user actually interacts with the document ranking list. This metric assumes that the user browses to a position in the sorted list based on some location-dependent probabilities. In reality, however, a user browsing to a sorted list depends on many other factors rather than just the location factor. A serious problem with DCG is to make the assumption that the usefulness of the document in the first position is independent of the document that precedes it. Recent research on modeling user Click Behavior has shown that location-based browsing assumptions behind DCG are not valid. Instead, these studies show that the likelihood of a user checking in the I-bit document depends on how satisfied the user is with the document in the sorted list. This type of user model is called a cascading model.
More specifically, let's consider a simple example. Suppose we are evaluating two ranking lists, judging by 5 points (perfect, excellent, good, general, and poor). Suppose the first list contains 20 good documents, and the second list has 1 perfect documents followed by 19 bad documents. Which sort list is better? In most settings, DCG shows the first list to be better. However, in the second list there is a perfect document that fully satisfies the user's information needs so that the user will observe the first document, be satisfied, stop browsing, and not see any poorly related documents. On the other hand, based on the first ranking list, users will have to spend more effort to meet his information needs, because each good document only partially satisfies the requirements. Therefore, compared to the first list, the user actually prefers the second sorted list because it not only satisfies their information needs, but also minimizes the effort expended by the query.
The Moffat and Zobel recently proposed deviation sorting accuracy (RBP) indicator contains a simple user model, but does not directly address the problem described here. This metric models the user's tolerance for finding relevant documents. Impatient users will see only a small part of the results, and the patient's patience will look deeper in the sorted list. The main problem with this simple user model is that the real user browsing behavior is not entirely dependent on the user's patience, but also on the quality of the sorted list results, as illustrated in the example above. Indeed, Moffat and Zobel acknowledge this and explain that a more complex user model can be used, but left to extend.
The main goal of this paper is to design an indicator for RBP to match a more accurate user model, thus providing a better choice than DCG and RBP. To achieve this goal, we present a new indicator called the expected sort reciprocal (ERR). Metrics support hierarchical correlation judgements and assume a cascading browse model. In this way, the indicator quantifies the usefulness of the article I-bit documents in terms of the degree of relevance of the document that precedes it. We present and demonstrate in practice that our indicators are better models of user satisfaction than DCG, and therefore are a more appropriate criterion for evaluating retrieval systems, especially those that are graded to determine relevance, such as web search engines.
The work has two major contributions. First, we propose a new index based on cascade user browsing model. We will prove that the vast majority of indicators are location-based browsing, which has proven to be a bad hypothesis, and cascading models can better capture real user browsing behavior. Secondly, in the context of a very large test set from a commercial search engine, our experiment shows that the ERR indicator is much better than other editorial indicators, such as DCG, in relation to a series of click-based indicators.
The remainder of this article is as follows. First, section 2nd provides some background information on various aspects of information retrieval evaluation. Then the 3rd section introduces the recent research of user browsing model. Our proposed ERR indicator is explained in section 4th. The 5th section explores the relevance of ERR to several existing index indexes. The rigorous and comprehensive experimental evaluation is given in the 6th section. Finally, in the 7th section, we discuss possible extensions to the model and summarize this article in section 8th.

2. Evaluation of information Retrieval
Evaluation plays an important role in the field of information retrieval. Retrieval systems are often evaluated for their effectiveness and efficiency. The effectiveness evaluation of the quantitative search system to meet the needs of users search needs, and efficiency evaluation of the measurement system speed. Since our focus is on effectiveness, we will provide a brief overview of the various effectiveness metrics that have been proposed in the information retrieval.
The validity measure of most information retrieval relies on the ambiguous concept of relevance. This is largely due to the prevalence and popularity of Cranfield evaluation paradigms and TREC style evaluation methods. These evaluations are based on a fixed set of queries, a set of fixed files, and a set of fixed correlation judgments. Relevance judgments are collected based on the editorial staff's assessment of the relevance of a given query document. In this way, relevance determines the concept of capturing user affinity. Then the retrieval index is calculated by comparing the output of the retrieval system, usually based on the query of the evaluation set and the correlation judgment, and finally in the form of a sorted list.
Although correlation judgments and indicators can be decoupled, they are closely related. For this reason, whenever a new method is proposed to collect relevance judgments, it is usually accompanied by a new retrieval metric. However, the opposite is not common, such as the new proposed indicator does not always require a new correlation criterion.
There are many different ways to judge correlation. Cranfield experiments, as well as subsequent evaluations, make a series of assumptions about judgment. For example, assuming that the relevance is a topic (the relevant document is the same as the query subject), the judgment is two (related/irrelevant), independent (the relevance of document A is independent of the relevance of document B), stable (judging does not change over time), consistent (the expert judgment is consistent), and complete (no missing judgment). These assumptions form the basis of most classical information retrieval indicators, such as accuracy, average accuracy, and recall rates.
These assumptions simplify the editing process to some extent, many of which are unrealistic and do not fit well with user affinity. As a result, the researchers looked at the hypothesis that a new search index would be generated after the relaxation. Let's give a few examples now. First, Jarvelin and Kekalainen focus on grading relevance judgments and propose DCG indicators that can take advantage of this judgment. Second, TREC novelty tracking is committed to relevant relevance assessment. TREC Interactive tracking is rated as sub-topic per query assignment. Experts are required to determine the relevance of each document based on sub-topics. This has resulted in a variety of sub-topic searches and diversity indicators. Finally, different researchers showed that completeness assumptions can be relaxed by inferring the relevance of a document that lacks judgment in different ways, ignoring documents that have not been tried, or intelligently selecting documents that are not being judged to obtain judgment.
The evaluation indicators themselves tend to be much simpler and less problematic to judge relative relevance. The evaluation index is calculated according to the output and correlation of the given retrieval system. Most metrics make a variety of assumptions about what helps a "good" sorted list, based on existing editorial judgments and the manipulation of some simple mathematical formulas. For example, the DCG value at K at a given query location is calculated as:

Here GI refers to the level of relevance that is ranked in the I-bit document. The numerator of the formula evaluates the benefit of the document with a large correlation level, while the denominator when ranked lower is the benefit of varying degrees of discount documents. This simple indicator enables the concept of having high-correlation documents in front of the system better than the one that follows them. This general idea forms the basis for most precision-based indicators, including the average precision and RBP indicators described in the introduction, calculated as follows:

Here GI refers to the degree of relevance of a document in the first position, and P is a parameter that simulates the user's patience when browsing a sorted list. This indicator has a similar hypothesis with DCG, except that the endurance parameter P models Some concepts of user browsing behavior, which is missing by DCG. We will return to this important aspect of how it relates to the indicators we propose to be elaborated in the following article.
We now explain that the proposed indicators are suitable for a wide range of evaluation studies. The idea of our indicators is similar to DCG, because we assume that correlation judgments are hierarchical, independent and complete. However, it is important to note that our metrics can be easily extended to deal with incompleteness, but for simplicity we assume completeness compared to DCG. But our indicator is different from DCG, because it integrates a user model as a proxy for dependency correlation judgment. DCG only DCG The value of the discount based on the document's rank discount, regardless of the document previously seen by any user. However, our indicator is based on the implicit discount score value of the correlation that was previously seen in the document. The discount method we use corresponds to the user browsing model.

3. User Model
An accurate user model that closely reflects the user's interaction with the retrieval system is necessary to produce a good correlation metric. Before we dive into the details of the indicator, it is necessary to first review the existing user model of the retrieval system. In general, there are two main types of user models: the location model and the cascading model. Both types of models attempt to capture the positional bias of search results rendering. The location model assumes the independence of the different positions in the document and verifies the probability as a function of location, while the cascading model models the document dependencies and test probabilities at the same time throughout the result set. The
Location Model
Location-based model is a popular method for dealing with rendering bias problems inherent in ranking retrieval systems. Among them, the test model explicitly predicts the test probabilities of different locations. It relies on the assumption that the user clicks on a link when and only if the following two conditions are met: The user checks the URL and discovers it is relevant, and the test probability is only dependent on the location. The probability of clicking the URL u on position P is therefore modeled as:

here au refers to the attractiveness of the URL U or the tendency of the user to click on the URL (independent of position)-and the test probability at the BP position P (depending on the position only).


Both the DCG and RBP indicators use the location model as the basis for their user-browsing model and apply to location-based discounted functions that progressively reduce the contribution of the document as they are ranked. Figure 1 compares the discount functions of DCG and RBP, using the test probability BP (using the location model described above according to the probability of clicking the log estimate by the commercial search engine). , the RBP of p=0.7 is more approximate to the estimated test probability. However, as shown below, it is assumed that the test probability depends only on the location of some serious fundamental flaws.
For example, in a 3rd-digit related document, if the 1th and 2nd-digit documents are very relevant, it is likely that the document will not be reviewed for a few clicks. On the other hand, if the two top documents are non-relevant, it is more likely to be censored and clicked multiple times. Depending on the location of the click Model, the two scenarios will not be able to be modeled – when the previous document is different, the same document at the same location has a distinct clickthrough rate (CTR). Click on the log from the commercial search engine to take a real example of this phenomenon, as shown in 2.

In this example, we observe that the URL of location 1 www.myspace.com Ctr is 9 times times the same URL at position 2. On average, however, the CTR of position 1 is about twice times the position of 2. The reason for this difference is that the URL shown at position 1 is a perfect match so the user rarely browses to location 2. The location model, assuming the independence of the URLs, is difficult to explain such a sharp CTR difference.

Cascading models
The above example shows that in order to accurately model CTR and test probabilities, the location is not sufficient and the relevance of the document above it must be considered. The cascading model differs from the location model: It takes into account the relevance of URLs on search results pages. In its general form, the cascading model assumes that the user view search results from the top to the bottom, and at each location, the user has a certain probability of being satisfied. RI refers to the probability that the user is satisfied at position I. Once the user is satisfied with a document, he or she terminates the search and will not be reviewed regardless of how the following documents are arranged. It is natural to expect that RI is an increasing function of the correlation level, and in fact we will absorb loosely defined concept "correlations". Algorithm 1 will outline the cascading model of this generic version.


Two model documents the probability of satisfying the user is set to RI. The RI value can be estimated as the maximum possibility on the click Log. In addition, the Ri value can be set as a function of the URL editorial level, as in the following sections. For a given set of RIS, the likelihood of a session user expressing satisfaction and stopping at the first bit is:

This simplifies the possibility that the user is dissatisfied with the first i-1 results and is satisfied with the results of the first I.

4. Indicators
Let us now introduce our indicators, based on the cascading model described in the previous section. A key step
is a function that defines the probability of a user finding a related document as a document level. GI is the level of the I document, and then:

where R is a mapping of the correlation level to the correlation probability. R is optional in different ways, and corresponds to the benefit function used in DCG, which we identified as:

When the document is irrelevant (g=0), the probability that the user thinks it is 0, but when the document is very relevant (g=4,5), the probability of correlation is close to 1.
We first define the indicator in a more general way and consider a benefit function φ about the position. This function satisfies φ (1) =1 and when R approaches positive infinity, φ (R) approaches 0.

Below we will consider special cases: φ (r) =1/r, but this choice has no particular purpose, for example, we can also choose to make φ (r) =1/log2 (r+1) like the discount function in DCG.

seems to calculate err from the previous definition because there is an expectation. However, it can be easily computed as follows:

N is the number of documents in the sorted list. The probability of the user stopping at position R is derived from the definition of the cascading model. Substituting it for the above equation, we finally get:

The time complexity of O (N2) is required for the initial calculation. However, as shown in algorithm 2, err can be easily computed with time complexity O (n).

Compared to location-based metrics, such as DCG and rbp,err that are only discounted based on location, depend on the relevance of the document before it. The "valid" err in the document at position R is:

Therefore, the more relevant the previous document is, the greater the corresponding discount. This discount returns satisfactory results because it reflects the real user behavior.
Figure 3 summarizes the discussion we've been up to here. The diagram shows the link between the user model and the metric. , most traditional indicators, such as DCG and RBP, assume a location-based user browsing model. As we have already discussed, these models have been shown to be poorly reflective of actual user behavior. The cascade-based user model, a more accurate user model, forms the basis for the ERR indicator we are proposing.

5. Correlation with other indicators
as we discussed in section 2nd, our indicators are somewhat similar to DCG. The indicator also has something in common with several other indicators that we will describe in this section.
First, our indicator is related to the expected search length (ESL) indicator, which was presented by Cooper in 1968. This indicator, defined in weak sequential documents, quantifies the amount of work required for a user to find K related documents. This indicator calculates the number of non-related documents that are expected to be seen by the search results list in sequence, before the user finds the K related document. In the simplest case, the weak order is the ranking output of the system, at which time ESL can be statistically calculated by simply counting the number of unrelated documents before the K-related document. Our metrics measure the effort required to satisfy the desired user, rather than assuming that the user wants to find a K-related document.
Second, err is closely related to the RBP indicators proposed by Moffat and Zobel. Our indicators can be seen as an extension of them by using cascading models as a user-browsing model to generalize RBP. The combination of cascading models with RBP is natural and offers many benefits, including the need to set a priori probability and seamlessly combine human judgments and clickthrough rates in a unified framework, as discussed in section 7.3.
again, corresponds to a two-dollar correlation background, assuming that all RI values are either 0 or 1. This can be simplified to:

is just the rank-to-bottom (RR) metric. Therefore, in the case of a two-dollar correlation, err is simplified to RR.
IV, err can be regarded as the normalized cumulative benefit of a special case (NCU), calculated as follows:

P (r) is the probability that the user stops at position R, and Nu (R) is the benefit-defined as a combination of the benefits and efforts of the user from position 1 to position R. In the Err case, P (r) is given by the cascading model (2), and Nu (r) =1/r. But we can consider other benefits functions Nu, such as precision or normalized cumulative benefits. The important thing about err and ncu is that the probability of the user stopping is separated from the benefit.
Finally, when RI is very small, err can be thought of as additional metrics, such as DCG. In this case, the equation (5) is approximately equal to:

There is no reason to consider an infinitesimal RI from the user model perspective, but the point here is that when RI is much smaller than 1 o'clock, err is confirmed to be more similar to DCG. In particular, difficult queries with only minor related documents are extremely prone to occur. This behavior is also confirmed in practice, see Section 6.2 at the end.

6. Reviews
The evaluation of the new indicator is challenging because there is no standard to compare it with. Therefore, most of the papers that put forward new metrics are not directly evaluated. For example, [16,17] indicates that the newly introduced indicator is related to metric measures of other standards. However, this does not mean that these metrics are "better" in terms of user satisfaction.
We try to narrow this gap by considering CTR metrics in this article. Even if the clickthrough rate includes indirect, noisy user feedback, they still contain valuable information about the user's preferences. In fact, the analysis of CTR data has proved to be a better estimate of the quality of the retrieval system. In this assessment, we will calculate the correlation of the various click Metrics and editorial metrics.

6.1 Data sets
6.2 Correlation with CTR Index
6.3 Comparison of two kinds of sorting methods

7. Extended
This section discusses several possible err extensions that may be useful in a number of evaluations.

7.1 Parameter R
If L is the number of correlation levels, then in our indicator there is an L adjustable parameter corresponding to the R (G) value. Instead of correcting these values in the equation (4), we can optimize them by maximizing the relevance of clickthrough rates or the consistency of parallel tests. The latter is for DCG. In addition to studying the mapping function r, it is also possible to study the benefit function φ (see definition 1), i.e., not to take φ (R) =1/r, but to study φ (1),... The value of φ (k), where K is the number of positions.

7.2 Cascading models of extensions
The original cascading model extends to include the abort probability: If the user is not satisfied at the given location, he will study the probability of the next URL as γ, but at the same time have 1-γ probability of abandonment. In this model, the probability of the user stopping at R is:

The same as the equation (2) except by Γr-1. The probability of abandonment by the user, which can make it reasonable to define a simpler benefit function: The user discards the value 0, otherwise 1; define the φ (r) =1 in 1. The indicator of the result is defined as:

Very similar to the ERR formula (5), the only difference is that the attenuation factor 1/r is replaced by a geometric decay factor, γr-1.

7.3 Contact with ClickThrough rate
From the correlation level to the correlation probability mapping function is currently the selection of the matching DCG gain function. Another way to define a mapping function is to study it directly from the click Log. For example, R (g) can be estimated from the click Log of URLs of all levels g.
The metrics we put forward can also be easily extended. So far most of the preceding indicators have been completely based on one type of correlation judgement and are not easily extended to use more than one. Because our indicators are based on the click Model, the indicator clicks can seamlessly combine relevance judgments. When there is a lack of editorial judgment in a document, we can use its predicted correlation probability, which combines the cascading model with the click Log. This can help in the absence of judgment
The problem. On the other hand, we also rely heavily on CTR in terms of evaluation, and when we cannot confidently predict probabilities from the click Log only actively collect edits to judge. The latter will be a more cost-effective way to evaluate search engines.

7.4 Multi-sample sex
The study of indicators incorporating the concept of diversity has recently received attention. The indicators presented in this paper and the cascading models behind them can be easily extended to deal with this concept.
The probability distribution with P (T|Q) as the subject of the given query Q. Each document is now judged based on possible topics, and git represents the level of the document relative to the subject T location I. The associated correlation probability is rit:=r (git). As with the standard cascading model, the probability that users interested in topic T will stop at R is:

The probability of the user stopping at R in the marginalized topic is:

So the various versions of ERR can be written:

Interestingly, similar equations (6) have been deduced in [1], but to find a whole set of different results, rather than for evaluation (their evaluation is based on the expected DCG, we do not see enough emphasis on the end intention). To be precise, [1] The target function is the probability that the user finds the relevant result, not the expected sort countdown. But the two are similar. and [1] It is pointed out in the conclusion that in future work, it will be better optimized for "the user will find the expected sort of effective information" rather than finding some probability: that is what the formula (6) achieves.

8. Summary
In this paper, we present a novel evaluation index of information retrieval called the expected sort reciprocal or err. This indicator, triggered by cascading user browsing models, measures the (inverse) expected workload of users to meet their information needs. The indicator differs from the average precision, bias ranking accuracy, and DCG: it largely discounts the contribution of documents that appear after highly relevant documents. This intuition, derived from the cascading model, assumes that users who have seen one or more highly relevant documents are more likely to stop browsing. ERR supports hierarchical correlation judgments and is simplified to the reciprocal of the order in the case of a two-dollar correlation judgement.
A rigorous empirical evaluation was conducted on the data set from the commercial search engine. The results show that err is better correlated with DCG and other editorial indicators than with CTR-based indicators. Correlation differences are particularly prominent in navigation, short, and header queries, and err is more highly correlated than DCG. Our experimental results show that err is better than DCG to reflect the real user browsing behavior and more accurately quantify user satisfaction.
Finally, we present a few possible extensions of err, which make the indicator more robust and attractive. These extensions include a method for automatically estimating indicator parameters, the ability to use human editorial judgments and CTR data in indicators, and a simple way to incorporate the concept of diversity into indicators.
Therefore, we argue that err should replace the actual evaluation indicator DCG of the web search engine, as it is inspired by an accurate user-browsing model, fits better with a wide range of CTR-based metrics, and has many highly practical and useful extensions.

The reciprocal of expected sort based on hierarchical correlation

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.