CSCI 5417 Information Retrieval Systems Jim MartinTodayMachine Learning for ad hoc IRWhy is There a Need for ML?Why is There a Need for MLUsing ML for ad hoc IR (Approach 1)Training dataUsing classification for ad hoc IRSlide 9More Complex CasesProblemLearning to RankSlide 13Slide 14The Ranking SVM [Herbrich et al. 1999, 2000; Joachims et al. 2002]Slide 16Slide 17Limitations of Machine LearningBreakSlide 20IE vs. IRInformation RetrievalSlide 23WhyWebInformation ExtractionSlide 27Slide 28Information Extraction: EntitiesNamed Entity RecognitionInformation Extraction: RelationsRelation ExtractionSlide 33Information Extraction: EventsEvent DetectionInformation Extraction: Times, numbers, measures, etc.Temporal and Numerical ExpressionsSlide 38Template AnalysisIE detailsNERNE TypesSlide 43AmbiguityNER ApproachesML ApproachData Encoding for Sequence LabelingSlide 48IOB EncodingSlide 50TrainingNER FeaturesNER as Sequence LabelingRelationsSlide 55Relation TypesSlide 57Relation AnalysisSlide 59FeaturesSlide 61Slide 62ExampleBootstrapping ApproachesBootstrapping Example: Seed TupleBootstrapping RelationsInformation Extraction SummarySocial Media and IRSlide 69CSCI 5417Information Retrieval SystemsJim MartinLecture 2111/8/2011TodayFinish learning to rankInformation extraction01/14/19 CSCI 5417 - IR 3Machine Learning for ad hoc IRWe’ve looked at methods for ranking documents in IR using factors likeCosine similarity, inverse document frequency, pivoted document length normalization, Pagerank, etc.We’ve looked at methods for classifying documents using supervised machine learning classifiersNaïve Bayes, kNN, SVMsSurely we can also use such machine learning to rank the documents displayed in search results?Sec. 15.401/14/19 CSCI 5417 - IR 4Why is There a Need for ML?Traditional ranking functions in IR used a very small number of featuresTerm frequencyInverse document frequencyDocument lengthIt was easy to tune weighting coefficients by handAnd people didBut you saw how “easy” it was on HW101/14/19 CSCI 5417 - IR 5Why is There a Need for MLModern systems – especially on the Web – use a large number of features:Log frequency of query word in anchor textQuery term proximityQuery word in color on page?# of images on page# of (out) links on pagePageRank of page?URL length?URL contains “~”?Page edit recency?Page length?The New York Times (2008-06-03) quoted Amit Singhal as saying Google was using over 200 such features.Using ML for ad hoc IR (Approach 1)Well classification seems like a good place to startTake an object and put it in a classWith some confidenceWhat do we have to work with in terms of training data?DocumentsQueriesRelevance judgements01/14/19 CSCI 5417 - IR 601/14/19 CSCI 5417 - IR 7Training data01/14/19 CSCI 5417 - IR 8Using classification for ad hoc IRA linear scoring function on these two features is then Score(d, q) = Score(α, ω) = aα + bω + cAnd the linear classifier isDecide relevant if Score(d, q) > θ… just like when we were doing text classificationSec. 15.4.101/14/19 CSCI 5417 - IR 9Using classification for ad hoc IR02 3 4 50.050.025cosine score Term proximity RRRRRRRRRRRNNNNNNNNNNSec. 15.4.1Decision surfaceDecision surface01/14/19 CSCI 5417 - IR 10More Complex Cases We can generalize this to classifier functions over more featuresWe can use any method we have for learning the linear classifier weights01/14/19 CSCI 5417 - IR 11ProblemThe ranking in this approach is based on the classifier’s confidence in its judgmentIt’s not clear that that should directly determine a ranking between two documentsThat is, it gives a ranking of confidence not a ranking of relevanceMaybe they correlate, maybe not01/14/19 CSCI 5417 - IR 12Learning to RankMaybe classification isn’t the right way to think about approaching ad hoc IR via MLBackground MLClassification problemsMap to a discrete unordered set of classesRegression problemsMap to a real valueOrdinal regression problemsMap to an ordered set of classes01/14/19 CSCI 5417 - IR 13Learning to RankAssume documents can be totally ordered by relevance given a queryThese are totally ordered: d1 < d2 < … < dJThis is the ordinal regression setupAssume training data is available consisting of document-query pairs represented as feature vectors ψi and a relevance ranking between themSuch an ordering can be cast as a set of pair-wise judgements, where the input is a pair of results for a single query, and the class is the relevance ordering relationship between them01/14/19 CSCI 5417 - IR 14Learning to RankBut assuming a total ordering across all docs is a lot to expectThink of all the training dataSo instead assume a smaller number of categories C of relevance existThese are totally ordered: c1 < c2 < … < cJDefinitely rel, relevant, partially, not relevant, really really not relevant... Etc.Indifferent to differences within a categoryAssume training data is available consisting of document-query pairs represented as feature vectors ψi and relevance ranking based on the categories C01/14/19 CSCI 5417 - IR 15The Ranking SVM [Herbrich et al. 1999, 2000; Joachims et al. 2002]Aim is to classify instance pairs as correctly ranked or incorrectly rankedThis turns an ordinal regression problem back into a binary classification problemWe want a ranking function f such thatci > ck iff f(ψi) > f(ψk)Suppose that f is a linear function f(ψi) = wψi01/14/19 CSCI 5417 - IR 16The Ranking SVM [Herbrich et al. 1999, 2000; Joachims et al. 2002]Ranking Model: f(ψi)€ f (ψi)01/14/19 CSCI 5417 - IR 17The Ranking SVM [Herbrich et al. 1999, 2000; Joachims et al. 2002]Thenci > ck iff w(ψi − ψk) > 0So let’s directly create a new instance space from such pairs:Φu = Φ(di, dj, q) = ψi − ψkzu = +1, 0, −1 as ci >,=,< ckFrom training data S = {Φu}, we train an SVM01/14/19 CSCI 5417 - IR 18Limitations of Machine LearningEverything that we have looked at (and most work in this area) produces linear models of features by weighting different base featuresThis contrasts with most of the clever ideas of traditional IR, which are nonlinear scalings and combinations of basic measurementslog term frequency, idf, pivoted length normalizationAt present, ML is good at weighting features, but not at coming
View Full Document