Unformatted text preview:

5/23/10'1'Informa/on'Retrieval'CISC437/637,'Lecture'#23'Ben'CartereAe'1'Copyright'©'Ben'CartereAe'Text'Search'• Consider'a'database'consis/ng'of'long'textual'informa/on'fields'– News'ar/cles,'patents,'web'pages,'books,'…'• What'is'the'best'way'to'search'within'this'data?'– grep'for'keywords'in'it? '– Store'it'in'a'DBMS'and'use'SQL'queries'with''''''''''''LIKE'‘%keyword1%’'AND'LIKE'‘%keyword2%’…?'• If'there’s'enough'data,'these'approaches'are'ver y'slow'and'very'inefficient'– Hash'and'B+`tree'indexes'don’t'help'2'Copyright'©'Ben'CartereAe'5/23/10'2'Informa/on'Retrieval'• Informa(on)retr ieval'(IR)'studies'systems'for'indexing'and'querying'large'full`text'corpora'– Google'is'the'most'widely`known'modern'example'• Work'in'IR'and'DB'has'mostly'been'separate'– IR'has'roots'in'library'science'and'informa/on'science'going'back'to'the'1950s,'today'allied'with'AI'– DB'is'more'firmly'rooted'in'algorithms'and'systems'• In'recent'years'they'have'begun'to 'intersect'via'XML,'text'mining,'data'mining'Copyright'©'Ben'CartereAe' 3'IR'systems'vs'DBMS'• Both'involve'queries'that'are'matched'to'records'(possibly'using'an'index)'to'retrieve'results'• Afer'that,'many'differences:'Copyright'©'Ben'CartereAe' 4'IR)DBMS)Relevance'seman/cs'Rela/onal'seman/cs'Keyword'search'Full'SQL'query'language'Unstructured'data'Structured'data'Read`only'(mostly)'Read/write'Rank'best`matching'results'Return'full'result'set'5/23/10'3'Common'Topics,'Different'Focus'• IR'and'DB'have'many'things'in'common,'but'fo cus'di ffers'between'the'two:'Copyright'©'Ben'CartereAe' 5'DBMS)IR)Users'Users'Query'language'Query'language'Query/record'matching'Query/record'matching'Record'ranking'Record'ranking'Building'indexes'Building'indexes'Indexing'strategies'Indexing'strategies'Query'op/miza/on'Query'op/miza/on'Concurrency'control'Concurrency'control'Relevance'and'Ranking'• In'a'DBMS,'records'either'match'a'SQL'query'or'they'don’t—there'is'no'middle'ground'• In'IR'systems,'some'documents'can'be'beAer'matches'than'o thers'– Matching'documents'may'not'be'relevant'– Documents'that'don’t'match'may'be'relevant'• Relevance'describes'the'usefulness'of'a'document'to'a'par/cular'user)Copyright'©'Ben'CartereAe' 6'5/23/10'4'Copyright'©'Ben'CartereAe' 7'query'top`ranked'results'relevant'results'(for'one'user)'relevant'results'(for'a'different'user)'What'Determines'Relevance?'• Many'factors'can'contribute'to'a'document'being'relevant'to'a'user:'– Frequency'of'search'terms'within'the'document'– Frequency'of'search'terms'in'the'corpus'– Proximity'of'search'terms'in'document'– Popularity'of'a'document'among'other'users'– Popularity'of'a'document'among'content'developers'– User'prior'knowledge'– User'task'• The'system'can'only'make'guesses'about'relevance;'it'can’t'read'the'user’s'mind'– Guesses'are'based'on'computa/ons'of'above'factors'Copyright'©'Ben'CartereAe' 8'5/23/10'5'The'“Bag'of'Words”'Model'• “Bag'of'word s”'refers'to'a'simple'representa/on'scheme'for'text:'– Documents'and'queries'are'simply'unordered'sets'of'words'– Syntac/c'informa/on'(grammar)'stripped'out'• Two'details:'– Very'common'words'(“stopwords”)'are'removed'• E.g.'the,'a,'of,'in,'that,'…'– Words'are'converted'to'their'stems'• E.g.'surfs,'surfing,'surfed''surf'Copyright'©'Ben'CartereAe' 9'Text'Indexes'• The'bag'of'words'model'al lows'a'/me`'and'space`efficient'indexing'model:''the'inver ted)index'• Instead'of'storing'a'rela/on'with'documents'as'records'and'terms'as'aAributes,'store'each'term'with'the'list'o f'do cuments'it'appears'in'– An'inverted)list'or'pos(ng)list'• To'answer'a'one`term'quer y,'simply'retrieve'pos/ng'list'for'that'term'Copyright'©'Ben'CartereAe' 10'5/23/10'6'Longer'Queries'• Boolean'queries'use'AND,'OR,'NOT'syntax'– term1'AND'(term2'OR'term3)'• To'answer'an'AND'query:'– Get'pos/ng'lists'for'all'terms'and'take'the'intersec/on'• To'answer'an'OR'query:'– Take'union'of'all'pos/ng'lists'• To'answer'an'AND'NOT'query:'– Set'subtrac/on'• To'answer'an'OR'NOT'query:'– Union'of'“term1”'and'“NOT'term2”—which'will'be'a'very'large'set'– OR'NOT'usually'not'allowed'Copyright'©'Ben'CartereAe' 11'Ranking'• Querying'the'inverted'index'only'provides'the'set'of'documents'that'match'the'query'• The'next'step'is'to'rank'them'in'order'of'likelihood'of'relevance'to'the'user'• Ranking'algorithms'are'the'subject'of'much'study'in'the'field'– Most'are'based'on 'the'probability)ranking)principle,'which'says'that'the'op/mal'ordering'is'in'decreasing'order'of'probability'of'relevance'Copyright'©'Ben'CartereAe' 12'5/23/10'7'Text'Sta/s/cs'and'Ranking'• Another'advantage'of'inverted'files:'– They'can'store'a'lot'of'informa/on'about'terms'within'documents'and'in'the'corpus'• For'each'term,'also'store'the'following:'– Document'frequency'df,)the'total'number'of'documents'the'term'appears'in'– Collec/on'term'frequency'c=,)the'total'number'of'/mes'the'term'appears'in'all'documents'• For'each'term/document'pair,'store:'– Term'frequency'=,'the'number'of'/mes'the'term'appears'in'the'document'• For'each'document,'store:'– Document'length'N,'the'total'number'of'terms'in'the'document'Copyright'©'Ben'CartereAe' 13'Ranking'Func/ons'• Use'v,'N,'df'to'calculate'a'score'for'each'document'– The'score'indicates'the'likelihood'of'relevance'• One'common'approach'treats'a'document'D'as'a'vector'in'V`dimensional'space'– The'magnitude'along'each'dimension'is'set'to'the'weight'of'the'corresponding'term'– Weights'are'func/ons'of'v,'N,'and'df'• The'query'Q'is'represented'as'a'vector'in'the'same'space'• The'score'S(Q,'D)'is'defined'to'be'the'cosine'of'the'angle'between'the'two'vectors'Copyright'©'Ben'CartereAe' 14'5/23/10'8'Evalua/ng'Ranking'Func/ons'• Different'f unc/o ns'p roduce'different'rankings'of'documents'• How'do'we'choose'among'ranking'func/ons?'• Performance)evalua(on'– Judge'the'relevance'of'each'document'in'the'ranking'to'a'user’s'informa/on'need'– Calculate'a'summary'measure'of'performance'over'the'relevance'judgments'• Common'measures:'– Precision,'recall,'average'precision,'DCG'Copyright'©'Ben'CartereAe'


View Full Document

UD CISC 637 - Information Retrieval

Download Information Retrieval
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Information Retrieval and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Information Retrieval 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?