Unformatted text preview:

5/17/11%1%Informa-on%Retrieval%CISC437/637,%Lecture%#23%Ben%Cartere@e%1%Copyright%©%Ben%Cartere@e%A%Database%for%the%WWW%• Imagine%the%web%stored%in%a%rela-onal%DBMS%• Three%challenges:%– How%would%we%search%it?%– How%would%we%know%whether%we%found%what%we%wanted?%– How%would%we%avoid %geSng%bad%results?%Copyright%©%Ben%Cartere@e% 2%5/17/11%2%Text%Search%• MostUaccessed%field%would%be%long%textual%informa-on%– News%ar-cles,%patents,%web%pages,%books,%…%• What%is%the%best%way%to%search%within%this%data?%– grep%for%keywords%in%it?%– Store%it%in%a%DBMS%and%use%SQL%queries%with%%…%LIKE%‘%keyword1%’%AND%LIKE%‘%keyword2%’…?%• If%there’s%enough%data,%these%approaches%are%very%slow%and%very%inefficient%– Hash%and%B+Utree%indexes%don’t%help%3%Copyright%©%Ben%Cartere@e%Relevance%• In%a%DBMS,%records%either%match%a%SQL%query%or%they%don’t—there%is%no%middle%ground%• In%freeUtext%search,%some%documents%can%be%be@er%match es%than%o thers%– Matching%documents%may%not%be%relevant%– Documents%that%don’t%match%may%be%relevant%• Relevance%describes%the%usefulness%of%a%document%to%a%par-cular%user(Copyright%©%Ben%Cartere@e% 4%5/17/11%3%Copyright%©%Ben%Cartere@e% 5%query%topUranked%results%relevant%results%(for%one%user)%relevant%results%(for%a%different%user)%Spam%• It%is%very%easy%for%anyone%to%get%new%records%into%our%WWW%database%– Just%put%a%page%on%the%web%and%get%it%crawled%• Therefore%it’s%very%easy%to%spam%results%– Just%put%the%query%terms%you%want%to%spam%somewhere%in%your%page’s%content%• Over%-me%this%effec-vely%makes%a%DBMS%useless%Copyright%©%Ben%Cartere@e% 6%5/17/11%4%Informa-on%Retrieval%• Informa.on(retrieval%(IR)%studies%systems%for%indexing%and%querying%large%fullUtext%corpora%– Google%is%the%most%widelyUknown%modern%example%• Work%in%IR%and%DB%has%mostly%been%separate%– IR%has%roots%in%library%science%and%informa-on%science%going%back%to%the%1950s,%today%allied%with%AI%– DB%is%more%firmly%rooted%in%algorithms%and%systems%• In%recent%years%they%have%begun%to%intersect%via%XML,%text%mining,%data%mini ng%Copyright%©%Ben%Cartere@e% 7%IR%systems%vs%DBMS%• Both%involve%queries%that%are%matched%to%records%(possibly%using%an%index)%to%retrieve%results%• Aler%that,%many%differences:%Copyright%©%Ben%Cartere@e% 8%IR(DBMS(Relevance%seman-cs%Rela-onal%seman-cs%Keyword%search%Full%SQL%query%language%Unstructured%data%Structured%data%ReadUonly%(mostly)%Read/write%Rank%bestUmatching%results%Return%full%result%set%5/17/11%5%Common%Topics,%Different%Focus%• IR%and%DB%have%many%things%in%common,%but%fo cus%di ffers%between%the%two:%Copyright%©%Ben%Cartere@e% 9%DBMS(IR(Users%Users%Query%language%Query%language%Query/record%matching%Query/record%matching%Record%ranking%Record%ranking%Building%indexes%Building%indexes%Indexing%strategies%Indexing%strategies%Query%op-miza-on%Query%op-miza-on%Concurrency%control%Concurrency%control%What%Determines%Relevance?%• Many%factors%can%contribute%to%a%document%being%relevant%to%a%user:%– Frequency%of%search%terms%within%the%document%– Frequency%of%search%terms%in%the%corpus%– Proximity%of%search%terms%in%document%– Popularity%of%a%document%among%other%users%– Popularity%of%a%document%among%content%developers%– User%prior%knowledge%– User%task%• The%system%can%only%make%guesses%about%relevance;%it%can’t%read%the%user’s%mind%– Guesses%are%based%on%computa-ons%of%above%factors%stored%in%some%kind%of%database%Copyright%©%Ben%Cartere@e% 10%5/17/11%6%Copyright%©%Ben%Cartere@e% 11%Copyright%©%Ben%Cartere@e% 12%5/17/11%7%Copyright%©%Ben%Cartere@e% 13%The%“Bag%of%Words”%Model%• “Bag%of%word s”%refers%to%a%simple%representa-on%scheme%for%text:%– Documents%and%queries%are%simply%unordered%sets%of%words%– Syntac-c%informa-on%(grammar)%stripped%out%• Two%details:%– Very%common%words%(“stopwords”)%are%removed%• E.g.%the,%a,%of,%in,%that,%…%– Words%are%converted%to%their%stems%• E.g.%surfs,%surfing,%surfed%%surf%Copyright%©%Ben%Cartere@e% 14%5/17/11%8%Copyright%©%Ben%Cartere@e% 15%Text%Indexes%• The%bag%of%words%model%allows%a%-meU%and%spaceUefficient%indexing%model:%%the%inverted(index(• Instead%of%storing%a%rela-on%with%documents%as%records%and%terms%as%a@ributes,%store%each%term%with%the%list%of%documents%it%appears%in%– An%inverted(list%or%pos.ng(list(• To%answer%a%oneUterm%query,%simply%retrieve%inverted%list%for%that%term%Copyright%©%Ben%Cartere@e% 16%5/17/11%9%Text%Indexes%• Query%evalua-on:%– Look%up%query%terms%in%vocabulary%structure%• equality%select%– Find%associated%inverted%lists%on%disk%– Process%inverted%lists%to%get%matching%document%IDs%– Look%up%document%informa-on%in%collec-on%structure%• equality%select%• get%URL,%-tle,%snippet,%etc%Copyright%©%Ben%Cartere@e% 17%Boolean%Query%Processing%• Boolean%queries%use%AND,%OR,%NOT%syntax%– term1%AND%(term2%OR%term3)%• To%answer%an%AND%query:%– Get%inverted%lists%for%all%terms%and%take%the%intersec-on%• To%answer%an%OR%query:%– Take%union%of%all%inverted%lists%• To%answer%an%AND%NOT%query:%– Set%subtrac-on%• To%answer%an%OR%NOT%query:%– Union%of%“term1”%and%“NOT%term2”—which%will%be%a%very%large%set%– OR%NOT%usually%not%allowed%Copyright%©%Ben%Cartere@e% 18%5/17/11%10%Ranking%• Querying%the%inverted%index%only%provides%the%set%of%documents%that%match%the%query%%• The%next%step%is%to%rank%them%in%order%of%likelihood%of%relevance%to%the%user%• Ranking%algorithms%are%the%subject%of%much%study%in%the%field%– Most%are%based%on%the%probability(ranking(principle,%which%says%that%the%op-mal%ordering%is%in%decreasing%order%of%probability%of%relevance%Copyright%©%Ben%Cartere@e% 19%Text%Sta-s-cs%and%Ranking%• Another%advantage%of%inverted%files:%– They%can%store%a%lot%of%informa-on%abou t%terms%within%documents%and%in%the%corpus%• For%each%term,%also%store%the%following:%– Document%frequency%df,(the%total%number%of%documents%the%term%appears%in%– Collec-on%term%frequency%c=,(the%total%number%of%-mes%the%term%appears%in%all%documents%• For%each%term/document%p air,%store:%– Ter m%frequen


View Full Document

UD CISC 637 - Information Retrieval

Download Information Retrieval
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Information Retrieval and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Information Retrieval 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?