Learning to Rank using Gradient DescentKeywords: ranking, gradient descent, neural networks, probabilistic cost functions, internet searchChris Burges [email protected] Shaked∗[email protected] Renshaw [email protected] Research, One Microsoft Way, Redmond, WA 98052-6399Ari Lazier [email protected] Deeds [email protected] Hamilton [email protected] Hullender [email protected], One Microsoft Way, Redmond, WA 98052-6399AbstractWe investigate using gradient descent meth-ods for learning ranking functions; we pro-pose a simple probabilistic cost function, andwe introduce RankNet, an implementation ofthese ideas using a neural network to modelthe underlying ranking function. We presenttest results on toy data and on data from acommercial internet search engine.1. IntroductionAny system that presents results to a user, orderedby a utility function that the user cares about, is per-forming a ranking function. A common example isthe ranking of search results, for example from theWeb or from an intranet; this is the task we will con-sider in this paper. For this problem, the data con-sists of a set of queries, and for each query, a setof returned documents. In the training phase, somequery/document pairs are labeled for relevance (“ex-cellent match”, “good match”, etc.). Only those doc-uments returned for a given query are to be rankedagainst each other. Thus, rather than consisting of asingle set of objects to be ranked amongst each other,the data is instead partitioned by query. In this pa-per we propose a new approach to this problem. Ourapproach follows (Herbrich et al., 2000) in that wetrain on pairs of examples to learn a ranking functionAppearing in Proceedings of the 22ndInternational Confer-ence on Machine Learning, Bonn, Germany, 2005. Copy-right 2005 by the author(s)/owner(s).that maps to the reals (having the model evaluate onpairs would be prohibitively slow for many applica-tions). However (Herbrich et al., 2000) cast the rank-ing problem as an ordinal regression problem; rankboundaries play a critical role during training, as theydo for several other algorithms (Crammer & Singer,2002; Harrington, 2003). For our application, giventhat item A appears higher than item B in the out-put list, the user concludes that the system ranks Ahigher than, or equal to, B; no mapping to particularrank values, and no rank boundaries, are needed; tocast this as an ordinal regression problem is to solve anunnecessarily hard problem, and our approach avoidsthis extra step. We also propose a natural probabilis-tic cost function on pairs of examples. Such an ap-proach is not specific to the underlying learning al-gorithm; we chose to explore these ideas using neuralnetworks, since they are flexible (e.g. two layer neuralnets can approximate any bounded continuous func-tion (Mitchell, 1997)), and since they are often fasterin test phase than competing kernel methods (and testspeed is critical for this application); however our costfunction could equally well be applied to a variety ofmachine learning algorithms. For the neural net case,we show that backpropagation (LeCun et al., 1998) iseasily extended to handle ordered pairs; we call the re-sulting algorithm, together with the probabilistic costfunction we describe below, RankNet. We present re-sults on toy data and on data gathered from a com-mercial internet search engine. For the latter, the datatakes the form of 17,004 queries, and for each query,up to 1000 returned documents, namely the top docu-∗Current affiliation: Google, Inc.Learning to Rank using Gradient Descentments returned by another, simple ranker. Thus eachquery generates up to 1000 feature vectors.Notation: we denote the number of relevance levels (orranks) by N, the training sample size by m, and thedimension of the data by d.2. Previous WorkRankProp (Caruana et al., 1996) is also a neural netranking model. RankProp alternates between twophases: an MSE regression on the current target val-ues, and an adjustment of the target values themselvesto reflect the current ranking given by the net. Theend result is a mapping of the data to a large num-ber of targets which reflect the desired ranking, whichperforms better than just regressing to the original,scaled rank values. Rankprop has the advantage thatit is trained on individual patterns rather than pairs;however it is not known under what conditions it con-verges, and it does not give a probabilistic model.(Herbrich et al., 2000) cast the problem of learning torank as ordinal regression, that is, learning the map-ping of an input vector to a member of an ordered setof numerical ranks. They model ranks as intervals onthe real line, and consider loss functions that dependon pairs of examples and their target ranks. The po-sitions of the rank boundaries play a critical role inthe final ranking function. (Crammer & Singer, 2002)cast the problem in similar form and propose a rankerbased on the perceptron (’PRank’), which maps a fea-ture vector x ∈ Rdto the reals with a learned w ∈ Rdsuch that the output of the mapping function is justw · x. PRank also learns the values of N increasingthresholds1br= 1, · · · , N and declares the rank of xto be minr{w · x − br< 0}. PRank learns using oneexample at a time, which is held as an advantage overpair-based methods (e.g. (Freund et al., 2003)), sincethe latter must learn using O(m2) pairs rather thanm examples. However this is not the case in our ap-plication; the number of pairs is much smaller thanm2, since documents are only compared to other doc-uments retrieved for the same query, and since manyfeature vectors have the same assigned rank. We findthat for our task the memory usage is strongly dom-inated by the feature vectors themselves. Althoughthe linear version is an online algorithm2, PRank hasbeen compared to batch ranking algorithms, and aquadratic kernel version was found to outperform allsuch algorithms described in (Herbrich et al., 2000).1Actually the last threshold is pegged at infinity.2The general kernel version is not, since the supportvectors must be saved.(Harrington, 2003) has proposed a simple but very ef-fective extension of PRank, which approximates find-ing the Bayes point by averaging over PRank mod-els. Therefore in this paper we will compare RankNetwith PRank, kernel PRank, large margin PRank, andRankProp.(Dekel et al., 2004) provide a very general frameworkfor ranking using
View Full Document