[Rank] Learning to rank

Source: Internet
Author: User
Tags svm

From: http://jiangfeng1124.diandian.com/post/2011-04-02/5532416

In May last October, I started to get started with learning to rank. The initial motivation was due to sorting tasks encountered in experiments. However, although traditional sorting formulas are simple and easy to adjust, they have very few features that can be used, unable to mine the information of the Internal Supporting sorting, resulting in unsatisfactory sorting performance. Therefore, supervised learning is used to guide sorting. The main articles to be referenced are as follows:

1: Adapting ranking SVM to document retrieval. (Liu tie Yan. et al. msra) [PDF]

2: Learning to rank for information retrieval-tutorial. (Liu tie Yan. et al. msra) [PDF]

2: Learning to rank: from pairwise approach to listwise approace. (Liu tie Yan. et al. msra) [PDF]

Recently, engineers and researcher are more and more interested in learning to rank. In this national invitational competition for data mining for college students, we also better integrate the learning to rank method into the model, so I think it is necessary to summarize the methods and principles of learning to rank, as well as the use of relevant tools.

Learning to rank has a Chinese name called sorting learning. As the name suggests, it is a sort method based on supervised learning. The traditional sorting method is generally implemented by constructing a sorting function, which can reflect different standards and sort by relevance in the IR field. Typically, in a search engine, the search engine returns a list of related documents for a query, and then based on the Relevance between (query, document, sort the documents in the list and return them to the user. There are many factors that affect relevance, such as TF, IDF, and DL. There are many classic models to complete this task, such as VSM (vector space model) and Boolean Model (Boolean Model). We will not discuss this traditional method here, each book on Information Retrieval has a detailed analysis.

For the traditional sorting model, if there are too many parameters, it will make it very difficult to adjust parameters using empirical methods. As a result, people naturally think of using machine learning to solve this problem. Therefore, learning to rank appears.

The idea is actually very simple! We usually use machine learning to do the most two tasks, that is, classification and regression ). Therefore, people think, can we convert sorting into classification or regression? If you can, whether or not the model is good or bad, it can at least help us sort the task. Based on this idea, three methods are proposed: 1. Pointwise 2. Pairwise 3. listwise

The three methods, from principle, method, to related tools, will be discussed in detail below.

I, Pointwise Approach

The main idea of the pointwis method is to convert the sorting problem into a multiclass classification problem or a regression problem. Taking multiclass classification as an example: assume that the collection of documents related to query is {D1, D2 ,..., DN }. First, we extract features from these N pair: (query, DI) and represent them as feature vectors. The correlation between query and Di is used as the label. The general label classification method is {perfect, excellent, good, fair, bad}. There are five categories in total. Therefore, for a query and its document set, N training instances can be formed. With a training example, we can use any of the classifier classes for learning, such as the maximum entropy and SVM.

Given that Pointwise is relatively simple, it is not formally expanded here. Briefly analyze its features.

The Pointwise method implies the assumption that the absolute relevance is unrelated to the query, that is, the query-independent. That is to say, as long as the relevance of (query, document) is the same, for example, "perfect", they will be placed in the same category, that is, instances of the same category, regardless of the query. However, relevance is not unrelated to queries. For a very common query and an unrelated document, the TF between them may be higher than the TF between a very rare query and one of its related documents. In this way, the training data is inconsistent and it is difficult to achieve good results. Documents that are predicted to be of the same category cannot be sorted.

Mcrank (NIPs, 2007) is a tool used to implement the Pointwise method ).

II, Pairwise Approach

The pairwise method is a popular method and has a very good effect. Its main idea is to convert the form of ranking into binary classification.

The following figure intuitively expresses the idea of the pairwise method and also shows the method for constructing training instances.

For the same query, A Training instance (pair) can be obtained for any two documents with different labels in all related document sets of the query ), ([pmath size = 12] {d _ {1} ^ I, {d _ {2} ^ I [/pmath]) the labels are 5 and 3, respectively. For this pair instance, assign category + 1 (5> 3), and vice versa. In this way, we get the samples required for binary classifier training. During prediction, you only need to classify all pair pairs to obtain a partial order relationship of the document set, so as to achieve sorting.

Pairwise has many implementations, such as SVM rank (Open Source) used in my experiment, and ranknet (C. burges, et al. icml 2005), Frank (M. tsai, T. liu, et al. SIGIR 2007), rankboost (Y. freund, et al. jmlr 2003.

Compared with the Pointwise method, the pairwise method no longer independently assumes relevance, because it only generates training samples for the document set in the same query. However, the pairwise model also has some disadvantages: 1. It treats the distinction between different levels in a consistent manner. In the field of information retrieval, especially for search engines, people prefer to only click the first few pages of results returned by the search engine, or even the first few. Therefore, we should better differentiate highly relevant documents. 2. Model bias brought about by a collection of related documents. Assume that the size of the relevant document set corresponding to query1 is 5, and that of query2 is 1000, the number of training samples constructed by the latter is far greater than that of the former, in this way, the classifier does not distinguish the training instances generated by queries with small sets of relevant documents, and even ignores them.

Another important factor also affects the sorting performance of the pairwise method. Taking ranking SVM as an example, the goal of optimization is to maximize the margin between positive and negative samples, rather than the sorting performance. Just like BP neural networks optimize functions with training errors as the goal, making it easy to overfit. The difference in the optimization goal will lead to the function bias of the model itself. Therefore, based on this feature, people have proposed the listwise method.

III, Listwise Approach

Compared with Pointwise (pairwise), The listwise method no longer directly formalizes the ranking problem into a classification or regression problem, but directly optimizes the sorting result (list) of the document. There are currently two major optimization methods:

  1. Optimize ranking evaluation indicators directly. For example, commonly used map and ndcg. This idea is very natural, but it is often difficult to implement, because evaluation indicators such as ndcg are usually non-smooth (continuous), and general objective function optimization methods are targeted at continuous functions.
  2. Optimization loss function

There are many ways to construct loss functions. Rankcosine (T. Qin, T. Liu, et al. IP & M 2007) uses the cosine similarity (angle) between the correctly sorted and predicted sorted value vectors to represent the loss function. Listnet (Z. cao, T. qin, T. liu, et al. icml 2007) uses the KL distance (cross entropy) between the correctly sorted and predicted probability distributions as the loss function.

Taking listnet as an example, the loss function is as follows:

[Pmath size = 12] Y [/pmath] and [pmath size = 12] Z [/pmath] indicate correct sorting and predicted sorting respectively. The probability distribution is defined by the following formula:

[Pmath size = 12] S _ {J} [/pmath] is the score of the J-th feature vector. Of course, this probability distribution must satisfy some properties. For example, for better sorting, the probability value should be higher. The final loss function can be expressed as follows:

It can be seen from the formula that listnet performs a simple linear weighting on the feature vector to predict the score. In this case, the task is converted to learning the weight vector W. This is obviously a common problem. Gradient Descent is the most commonly used method. I will not go into details here.

I think the listwise method is the most beautiful, because it focuses on its own goals and tasks. In contrast, pairwise is a bit tricky :) of course, this method is not perfect, but there are still some shortcomings. For example, how to construct score ([pmath size = 12] S _ {J} [/pmath? Can I directly use a label? In fact, this is also a major reason to restrict performance. In addition, when solving the KL distance, we need to calculate the probability of all sorts. The algorithm complexity tends to [pmath size = 12] O (N * n !) [/Pmath]. Solutions are available for these problems.

For listnet, as far as I know, there are two open-source Java version implementations, one is minorthird, Which is Professor William W of CMU. what Cohen led his students to do, similar to WEKA, is an open-source tool that implements a large number of machine learning and data mining algorithms. Its homepage on sorceforge is here. Luo Lei recently used a single-layer Neural Network Model to adjust the weight. It is already open-source on Google Code. The address is here. You are welcome to use and give your comments. I am also planning to implement a C ++ version, and may join Luo Lei's project.

[Postscript]

"To get the method, you have to get it. To get the method, you have to get it. To get the method, you have to get it ". I forgot the original source of this sentence, like the Analects of Confucius, Sun Tzu's Art of War, and LV's spring and autumn. There are only three references in this article. It seems that the height of the reference document is not even high, let alone the public. Therefore, please correct me if there are any errors or profound issues :)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.