DOC PREVIEW
UT Dallas CS 6375 - icml_ranking

This preview shows page 1-2-3 out of 8 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Learning to Rank using Gradient DescentKeywords: ranking, gradient descent, neural networks, probabilistic cost functions, internet searchChris Burges [email protected] Shaked∗[email protected] Renshaw [email protected] Research, One Microsoft Way, Redmond, WA 98052-6399Ari Lazier [email protected] Deeds [email protected] Hamilton [email protected] Hullender [email protected], One Microsoft Way, Redmond, WA 98052-6399AbstractWe investigate using gradient descent meth-ods for learning ranking functions; we pro-pose a simple probabilistic cost function, andwe introduce RankNet, an implementation ofthese ideas using a neural network to modelthe underlying ranking function. We presenttest results on toy data and on data from acommercial internet search engine.1. IntroductionAny system that presents results to a user, orderedby a utility function that the user cares about, is per-forming a ranking function. A common example isthe ranking of search results, for example from theWeb or from an intranet; this is the task we will con-sider in this paper. For this problem, the data con-sists of a set of queries, and for each query, a setof returned documents. In the training phase, somequery/document pairs are labeled for relevance (“ex-cellent match”, “good match”, etc.). Only those doc-uments returned for a given query are to be rankedagainst each other. Thus, rather than consisting of asingle set of objects to be ranked amongst each other,the data is instead partitioned by query. In this pa-per we propose a new approach to this problem. Ourapproach follows (Herbrich et al., 2000) in that wetrain on pairs of examples to learn a ranking functionAppearing in Proceedings of the 22ndInternational Confer-ence on Machine Learning, Bonn, Germany, 2005. Copy-right 2005 by the author(s)/owner(s).that maps to the reals (having the model evaluate onpairs would be prohibitively slow for many applica-tions). However (Herbrich et al., 2000) cast the rank-ing problem as an ordinal regression problem; rankboundaries play a critical role during training, as theydo for several other algorithms (Crammer & Singer,2002; Harrington, 2003). For our application, giventhat item A appears higher than item B in the out-put list, the user concludes that the system ranks Ahigher than, or equal to, B; no mapping to particularrank values, and no rank boundaries, are needed; tocast this as an ordinal regression problem is to solve anunnecessarily hard problem, and our approach avoidsthis extra step. We also propose a natural probabilis-tic cost function on pairs of examples. Such an ap-proach is not specific to the underlying learning al-gorithm; we chose to explore these ideas using neuralnetworks, since they are flexible (e.g. two layer neuralnets can approximate any bounded continuous func-tion (Mitchell, 1997)), and since they are often fasterin test phase than competing kernel methods (and testspeed is critical for this application); however our costfunction could equally well be applied to a variety ofmachine learning algorithms. For the neural net case,we show that backpropagation (LeCun et al., 1998) iseasily extended to handle ordered pairs; we call the re-sulting algorithm, together with the probabilistic costfunction we describe below, RankNet. We present re-sults on toy data and on data gathered from a com-mercial internet search engine. For the latter, the datatakes the form of 17,004 queries, and for each query,up to 1000 returned documents, namely the top docu-∗Current affiliation: Google, Inc.Learning to Rank using Gradient Descentments returned by another, simple ranker. Thus eachquery generates up to 1000 feature vectors.Notation: we denote the number of relevance levels (orranks) by N, the training sample size by m, and thedimension of the data by d.2. Previous WorkRankProp (Caruana et al., 1996) is also a neural netranking model. RankProp alternates between twophases: an MSE regression on the current target val-ues, and an adjustment of the target values themselvesto reflect the current ranking given by the net. Theend result is a mapping of the data to a large num-ber of targets which reflect the desired ranking, whichperforms better than just regressing to the original,scaled rank values. Rankprop has the advantage thatit is trained on individual patterns rather than pairs;however it is not known under what conditions it con-verges, and it does not give a probabilistic model.(Herbrich et al., 2000) cast the problem of learning torank as ordinal regression, that is, learning the map-ping of an input vector to a member of an ordered setof numerical ranks. They model ranks as intervals onthe real line, and consider loss functions that dependon pairs of examples and their target ranks. The po-sitions of the rank boundaries play a critical role inthe final ranking function. (Crammer & Singer, 2002)cast the problem in similar form and propose a rankerbased on the perceptron (’PRank’), which maps a fea-ture vector x ∈ Rdto the reals with a learned w ∈ Rdsuch that the output of the mapping function is justw · x. PRank also learns the values of N increasingthresholds1br= 1, · · · , N and declares the rank of xto be minr{w · x − br< 0}. PRank learns using oneexample at a time, which is held as an advantage overpair-based methods (e.g. (Freund et al., 2003)), sincethe latter must learn using O(m2) pairs rather thanm examples. However this is not the case in our ap-plication; the number of pairs is much smaller thanm2, since documents are only compared to other doc-uments retrieved for the same query, and since manyfeature vectors have the same assigned rank. We findthat for our task the memory usage is strongly dom-inated by the feature vectors themselves. Althoughthe linear version is an online algorithm2, PRank hasbeen compared to batch ranking algorithms, and aquadratic kernel version was found to outperform allsuch algorithms described in (Herbrich et al., 2000).1Actually the last threshold is pegged at infinity.2The general kernel version is not, since the supportvectors must be saved.(Harrington, 2003) has proposed a simple but very ef-fective extension of PRank, which approximates find-ing the Bayes point by averaging over PRank mod-els. Therefore in this paper we will compare RankNetwith PRank, kernel PRank, large margin PRank, andRankProp.(Dekel et al., 2004) provide a very general frameworkfor ranking using


View Full Document

UT Dallas CS 6375 - icml_ranking

Documents in this Course
ensemble

ensemble

17 pages

em

em

17 pages

dtree

dtree

41 pages

cv

cv

9 pages

bayes

bayes

19 pages

vc

vc

24 pages

svm-2

svm-2

16 pages

svm-1

svm-1

18 pages

rl

rl

18 pages

mle

mle

16 pages

mdp

mdp

19 pages

knn

knn

11 pages

intro

intro

19 pages

hmm-train

hmm-train

26 pages

hmm

hmm

28 pages

hmm-train

hmm-train

26 pages

hmm

hmm

28 pages

ensemble

ensemble

17 pages

em

em

17 pages

dtree

dtree

41 pages

cv

cv

9 pages

bayes

bayes

19 pages

vc

vc

24 pages

svm-2

svm-2

16 pages

svm-1

svm-1

18 pages

rl

rl

18 pages

mle

mle

16 pages

mdp

mdp

19 pages

knn

knn

11 pages

intro

intro

19 pages

vc

vc

24 pages

svm-2

svm-2

16 pages

svm-1

svm-1

18 pages

rl

rl

18 pages

mle

mle

16 pages

mdp

mdp

19 pages

knn

knn

11 pages

intro

intro

19 pages

hmm-train

hmm-train

26 pages

hmm

hmm

28 pages

ensemble

ensemble

17 pages

em

em

17 pages

dtree

dtree

41 pages

cv

cv

9 pages

bayes

bayes

19 pages

vc

vc

24 pages

svm-2

svm-2

16 pages

svm-1

svm-1

18 pages

rl

rl

18 pages

mle

mle

16 pages

mdp

mdp

19 pages

knn

knn

11 pages

intro

intro

19 pages

hmm-train

hmm-train

26 pages

hmm

hmm

28 pages

ensemble

ensemble

17 pages

em

em

17 pages

dtree

dtree

41 pages

cv

cv

9 pages

bayes

bayes

19 pages

hw2

hw2

2 pages

hw1

hw1

4 pages

hw0

hw0

2 pages

hw5

hw5

2 pages

hw3

hw3

3 pages

20.mdp

20.mdp

19 pages

19.em

19.em

17 pages

16.svm-2

16.svm-2

16 pages

15.svm-1

15.svm-1

18 pages

14.vc

14.vc

24 pages

9.hmm

9.hmm

28 pages

5.mle

5.mle

16 pages

3.bayes

3.bayes

19 pages

2.dtree

2.dtree

41 pages

1.intro

1.intro

19 pages

21.rl

21.rl

18 pages

CNF-DNF

CNF-DNF

2 pages

ID3

ID3

4 pages

mlHw6

mlHw6

3 pages

MLHW3

MLHW3

4 pages

MLHW4

MLHW4

3 pages

ML-HW2

ML-HW2

3 pages

vcdimCMU

vcdimCMU

20 pages

hw0

hw0

2 pages

hw3

hw3

3 pages

hw2

hw2

2 pages

hw1

hw1

4 pages

9.hmm

9.hmm

28 pages

5.mle

5.mle

16 pages

3.bayes

3.bayes

19 pages

2.dtree

2.dtree

41 pages

1.intro

1.intro

19 pages

15.svm-1

15.svm-1

18 pages

14.vc

14.vc

24 pages

hw2

hw2

2 pages

hw1

hw1

4 pages

hw0

hw0

2 pages

hw3

hw3

3 pages

9.hmm

9.hmm

28 pages

5.mle

5.mle

16 pages

3.bayes

3.bayes

19 pages

2.dtree

2.dtree

41 pages

1.intro

1.intro

19 pages

Load more
Download icml_ranking
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view icml_ranking and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view icml_ranking 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?