3/17/09 1 Text%Processing%CISC489/689‐010,%Lecture%#3%Monday,%Feb.%16%Ben%CartereFe%Indexing%• An%index%is%a%list%of%things%(keys)%with%pointers%to%other%things%(items).%– Keywords%%catalog%numbers%(%shelves).%– Concepts%%page%numbers.%– Terms%%documents.%• Need%for%indexes:%%%– Ease%of%use.%– Speed.%– Scalability.%3/17/09 2 Manual%vs.%AutomaVc%Indexing%• Manual:%– An%“expert”%assigns%keys%to%each%item.%– Example:%%card%catalog.%• AutomaVc:%– Keys%automaVcally%idenVfied%and%assigned.%– Example:%%Google.%• AutomaVc%as%good%as%manual%for%most%purposes.%Text%Processing%• First%step%in%automaVc%indexing.%• ConverVng%documents%into%index&terms.&• Terms%are%not%just%words.%– Not%all%words%are%of%equal%value%in%a%search.%– SomeVmes%not%clear%where%words%begin%and%end.%• Especially%when%not%space‐separated,%e.g.%Chinese,%Korean.%– Matching%the%exact%words%typed%by%the%user%doesn’t%work%very%well%in%terms%of%effecVveness.%3/17/09 3 Text%Processing%Steps%• For%each%document:%– Parse%it%to%locate%the%parts%that%are%important.%– Segment%and%tokenize%the%text%in%the%important%parts%to%get%words.%– Remove%stop&words.%– Stem%words%to%common%roots.%• Advanced%processing%may%included%phrases,%enVty%tagging,%link‐graph%features,%and%more.%Parsing%• Some%parts%of%a%document%are%more%important%than%others.%• Document%parser%recognizes%structure%using%markup&such%as%HTML%tags.%– Headers,%anchor%text,%bolded%text%are%likely%to%be%important.%– JavaScript,%style%informaVon,%navigaVon%links%less%likely%to%be%important.%– Metadata%can%also%be%important. %%3/17/09 4 Example%Wikipedia%Page%Wikipedia%Markup%<title>Tropical fish</title> <text>{{Unreferenced|date=July 2008}} {{Original research|date=July 2008}} ’’’Tropical fish’’’ include [[fish]] found in [[Tropics|topical]] environments around the world, including both [[fresh water|freshwater]] and [[sea water|salt water]] species. [[Fishkeeping|Fishkeepers]] often use the term ’’tropical fish’’ to refer only those requiring fresh water, with saltwater tropical fish referred to as ’’[[list of marine aquarium fish species|marine fish]]’’. …3/17/09 5 Wikipedia%HTML%Document%Parsing%• HTML%pages%organize%into%trees.%<HTML>%<HEAD>%<TITLE>% Tropical%fish%<META>%<BODY>%<H1>% Tropical%fish%<P>%<B>% Tropical%fish%<A>% fish%<A>% tropical%include%found%in%environments%around%the%world%Nodes contain blocks of text.3/17/09 6 End%Result%of%Parsing%• Blocks%of%text%from%important%parts%of%page.%– Tropical%fish%include%fish%found%in%tropical%environments%around%the%world,%including%both%freshwater%and%salt%water%species.%%Fishkeepers%oien%use%the%term%“tropical%fish”%to%refer%only%those%requiring%fresh%water,%with%saltwater%tropical%fish%referred%to%as%“marine%fish”.%• Next%step:%%segmenVng%and%tokenizing.%Tokenizing%• Forming%words%from%sequence%of%characters%in%blocks%of%text.%• Surprisingly%complex%in%English,%can%be%harder%in%other%languages.%• Early%IR%systems:%– Any%sequence%of%alphanumeric%characters%of%length%3%or%more.%– Terminated%by%a%space%or%other%special%character.%– Upper‐case%changed%to%lower‐case.%3/17/09 7 Tokenizing%• Example:%– “Bigcorp's%2007%bi‐annual%report%showed%profits%rose%10%.”%becomes%– “bigcorp%2007%annual%report%showed%profits%rose”%• Too%simple%for%search%applicaVons%or%even%large‐scale%experiments%• Why?%Too%much%informaVon%lost%– Small%decisions%in%tokenizing%can%have%major%impact%on%effecVveness%of%some%queries%Tokenizing%Problems%• Small%words%can%be%important%in%some%queries,%usually%in%combinaVons%• %xp,%ma,%pm,%ben%e%king,%el%paso,%master%p,%gm,%j%lo,%world%war%II%• Both%hyphenated%and%non‐hyphenated%forms%of%many%words%are%common%%– SomeVmes%hyphen%is%not%needed%%• e‐bay,%wal‐mart,%acVve‐x,%cd‐rom,%t‐shirts%%– At%other%Vmes,%hyphens%should%be%considered%either%as%part%of%the%word%or%a%word%separator%• winston‐salem,%mazda%rx‐7,%e‐cards,%pre‐diabetes,%t‐mobile,%spanish‐speaking%3/17/09 8 Tokenizing%Problems%• Special%characters%are%an%important%part%of%tags,%URLs,%code%in%documents%• Capitalized%words%can%have%different%meaning%from%lower%case%words%– Bush,%%Apple%• Apostrophes%can%be%a%part%of%a%word,%a%part%of%a%possessive,%or%just%a%mistake%– rosie%o'donnell,%can't,%don't,%80's,%1890's,%men's%straw%hats,%master's%degree,%england's%ten%largest%ciVes,%shriner's%Tokenizing%Problems%• Numbers%can%be%important,%including%decimals%%– nokia%3250,%top%10%courses,%united%93,%quickVme%6.5%pro,%92.3%the%beat,%288358%%• Periods%can%occur%in%numbers,%abbreviaVons,%URLs,%ends%of%sentences,%and%other%situaVons%– I.B.M.,%Ph.D.,%cis.udel.edu%• Note:%tokenizing%steps%for%queries%must%be%idenVcal%to%steps%for%documents%3/17/09 9 Tokenizing%Process%• Assume%we%have%used%the%parser%to%find%blocks%of%important%text.%• A%word%may%be%any%sequence%of%alphanumeric%characters%terminated%by%a%space%or%special%character.%– everything%converted%to%lower%case.%– everything%indexed.%• Defer%complex%decisions%to%other%components%– example:%92.3%→%92%3%but%search%finds%documents%with%92%and%3%adjacent%– incorporate%some%rules%to%reduce%dependence%on%query%transformaVon%components%End%Result%of%TokenizaVon%• List%of%words%in%blocks%of%text.%– tropical%fish%include%fish%found%in%tropical%environments%around%the%world%including%both%freshwater%and%salt%water%species%fishkeepers%oien%use%the%term%tropical%fish%to%refer%only%those%requiring%fresh%water%with%saltwater%tropical%fish%referred%to%as%marine%fish%• Next%step:%%stopping.%• But%first:%%text%staVsVcs.%3/17/09 10 Text%StaVsVcs%• Huge%variety%of%words%used%in%text%but%• Many%staVsVcal%characterisVcs%of%word%occurrences%are%predictable%– e.g.,%distribuVon%of%word%counts%• Retrieval%models%and%ranking%algorithms%depend%heavily%on%staVsVcal%properVes%of%words%– e.g.,%important%words%occur%oien%in%documents%but%are%not%high%frequency%in%collecVon%Zipf’s%Law%• DistribuVon%of%word%frequencies%is%very%skewed&– a%few%words%occur%very%oien,%many%words%hardly%ever%occur%–
View Full Document