4/15/09'1'Test'Collec0ons'and'IR'Experimenta0on'CISC489/689‐010,'Lecture'#11'Wednesday,'March'18th'Ben'CartereIe'Last'Time'• Search'engine'evalua0on,'esp.'system‐based'evalua0on'• Measures:'– Precision'– Recall'– Average'precision'– R‐precision'– DCG'• All'measures'require'relevance(judgments'• Average'over'a'set'of'queries'4/15/09'2'IR'Experimenta0on'• Comparing'different'engines'requires'a'controlled'experimental'seWng'• Many'different'factors'influence'engine'effec0veness:'– Corpus,'queries,'relevance'judgments'– Parsing'decisions,'stop'list,'stemming'algorithm'– Indexing'method,'compression'algorithm,'query'processing'method'– Retrieval'model,'model'features,'model'parameters'• These'should'be'controlled'to'the'greatest'extent'possible'to'answer'the'relevant'ques0ons'Evalua0on'Corpus'• A'test(collec1on(consists'of:'– a'corpus'of'documents'or'other'things'to'search'– a'set'of'queries'with'underlying'informa0on'needs'– relevance'judgments'on'documents'• The'use'of'test'collec0ons'and'system‐based'effec0veness'evalua0on'is'called'the'Cranfield(methodology.'– Named'a\er'Cranfield'Aeronau0cs,'where'it'was'invented'in'the'60s.'4/15/09'3'Construc0ng'a'Test'Collec0on'• Where'do'the'documents'come'from?'– Depends'on'task,'domain,'availability,'…'• Where'do'the'queries'come'from?'– Query'logs,'users'who'can'describe'their'informa0on'need,'…'• What'do'the'queries'and'informa0on'needs'look'like?'– Depends'on'task,'domain,'…'• Where'do'the'relevance'judgments'come'from?'– Assessors'judge'documents'w.r.t.'informa0on'needs'Corpus'Examples'• News'corpora'– AP:''Associated'Press'ar0cles'from'1988‐1992.'– TDT:''News'ar0cles'from'AP,'Wall'Street'Journal,'NY'Times,'LA'Times,'plus'audio'transcripts'• Web'corpora'– GOV2:''Web'pages'downloaded'from'.gov'domain'in'2004'– Wikipedia:''Top'10%'of'Wikipedia'pages'• Genomics'corpora'– Medical'journals,'clinical'reports'4/15/09'4'Relevance'Judgments'• Assessors'look'at'documents'and'decide'whether'they’re'relevant'to'the'informa0on'need'• Recall'recall:'– #'relevant'&'retrieved'/'#'relevant'– Propor0on'of'relevant'documents'that'were'retrieved'• How'do'we'know'the'#'relevant?'• It'is'not'sufficient'to'limit'search'to'documents'that'contain'query'terms'• Assessors'need'to'read'every'document'Relevance'Judgments'• Obtaining'relevance'judgments'is'an'expensive,'0me‐consuming'process'– who'does'it?'– what'are'the'instruc0ons?'– what'is'the'level'of'agreement?'• It’s'not'possible'to'get'a'judgment'on'every'document'– and'therefore'not'possible'to'really'know'any'recall‐based'measure'• How'can'we'best'target'judging'for'efficient'and'accurate'evalua0on?'4/15/09'5'Pooling'• System'pooling'is'o\en'used'to'select'documents'– top'k(results(from'the'rankings'obtained'by'different'search'engines'(or'retrieval'algorithms)'are'merged'into'a'pool'– duplicates'are'removed'– documents'are'presented'in'some'random'order'to'the'relevance'judges'• Ensures'that'judging'is'targeted'to'the'documents'that'are'least'likely'to'be'nonrelevant'• Produces'a'large'number'of'relevance'judgments'for'each'query,'although's0ll'incomplete'Interac0ve'Searching'and'Judging'• An'assessor'uses'a'search'engine'to'find'relevant'informa0on'– Submits'a'query'and'judges'retrieved'documents'• As'they'learn'about'the'topic,'they'reformulate'queries'and'resubmit'• Over'a'series'of'reformula0ons,'the'assessor'builds'a'collec0on'of'relevance'judgments'4/15/09'6'Algorithmic'Approaches'• Move‐to‐Front'Pooling'– Target'judging'to'the'systems'that'are'returning'a'lot'of'relevant'documents'– Goal'is'to'find'as'many'relevant'documents'as'possible'• Minimal'Test'Collec0ons'– Target'judging'to'the'documents'that'are'most'informa0ve'for'comparing'systems'– No'preference'for'relevant'or'non'relevant'documents'as'long'as'the'document'carries'a'lot'of'informa0on'Sta0s0cal'Approaches'• Sta0s0cal'sampling'– Use'a'sampling'strategy'to'form'a'pool'– Op0mal'sampling'ensures'unbiased'es0mates'of'evalua0on'measures'4/15/09'7'Experimenta0on'• Suppose'we'have'a'test'collec0on'– A'corpus,'some'queries,'and'relevance'judgments'• We'have'a'ques0on'we'want'to'answer'– e.g.'which'of'n'engines'is'best'• What'do'we'do'now?'– Index'the'corpus'with'each'engine'– Run'the'queries'on'each'engine'– Evaluate'the'retrieved'documents'using'some'measure'Experimental'Results'Query&Engine&1&Engine&2&Engine&3&Q1&0.192'0.525'0.778'Q2&0.269'0.595'0.874'Q3&0.187'0.663'0.880'Q4&0.243'0.512'0.808'Q5&0.131'0.564'0.799'Q6&0.208'0.511'0.769'Q7&0.094'0.569'0.787'Q8&0.222'0.518'0.853'Q9&0.104'0.537'0.881'Q10&0.177'0.450'0.781'average&0.183&0.544&0.821&4/15/09'8'Experimental'Results'Query&Engine&1&Engine&2&Engine&3&Q1&0.415'0.311'0.350'Q2&0.286'0.392'0.434'Q3&0.326'0.519'0.568'Q4&0.357'0.436'0.245'Q5&0.409'0.470'0.509'Q6&0.407'0.258'0.449'Q7&0.374'0.391'0.292'Q8&0.341'0.429'0.420'Q9&0.327'0.500'0.517'Q10&0.387'0.464'0.505'average&0.363&0.417&0.429&Significance'Tests'• Given'the'results'from'a'number'of'queries,'how'can'we'conclude'that'ranking'algorithm'A'is'beIer'than'algorithm'B?'• A'significance'test'enables'us'to'reject'the'null(hypothesis((no'difference)'in'favor'of'the'alterna1ve(hypothesis((B'is'beIer'than'A)'– the'power'of'a'test'is'the'probability'that'the'test'will'reject'the'null'hypothesis'correctly'– increasing'the'number'of'queries'in'the'experiment'also'increases'power'of'test'4/15/09'9'Significance'Tests'• Calculate'effec0veness'measure'for'each'query'for'each'engine'• Use'those'values'to'compute'a'test(sta1s1c'that'has'a'certain'distribu0on'when'the'null'hypothesis'is'true'(when'there'is'no'difference'in'system'performance)'• Obtain'a'p‐value'from'that'distribu0on'• If'the'p‐value'is'less'than'a'given'cri1cal(value,'conclude'that'the'null'hypothesis'is'false'– i.e.,'there'is'a'difference'between'the'systems'One‐Sided'Test'• Distribu0on'for'the'possible'values'of'a'test'sta0s0c'assuming'the'null'hypothesis'•
View Full Document