DOC PREVIEW
UD CISC 689 - Text 
Processing


This preview shows page 1-2-3-24-25-26 out of 26 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 26 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 26 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 26 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 26 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 26 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 26 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 26 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

3/17/09 1 Text%Processing%CISC489/689‐010,%Lecture%#3%Monday,%Feb.%16%Ben%CartereFe%Indexing%• An%index%is%a%list%of%things%(keys)%with%pointers%to%other%things%(items).%– Keywords%%catalog%numbers%(%shelves).%– Concepts%%page%numbers.%– Terms%%documents.%• Need%for%indexes:%%%– Ease%of%use.%– Speed.%– Scalability.%3/17/09 2 Manual%vs.%AutomaVc%Indexing%• Manual:%– An%“expert”%assigns%keys%to%each%item.%– Example:%%card%catalog.%• AutomaVc:%– Keys%automaVcally%idenVfied%and%assigned.%– Example:%%Google.%• AutomaVc%as%good%as%manual%for%most%purposes.%Text%Processing%• First%step%in%automaVc%indexing.%• ConverVng%documents%into%index&terms.&• Terms%are%not%just%words.%– Not%all%words%are%of%equal%value%in%a%search.%– SomeVmes%not%clear%where%words%begin%and%end.%• Especially%when%not%space‐separated,%e.g.%Chinese,%Korean.%– Matching%the%exact%words%typed%by%the%user%doesn’t%work%very%well%in%terms%of%effecVveness.%3/17/09 3 Text%Processing%Steps%• For%each%document:%– Parse%it%to%locate%the%parts%that%are%important.%– Segment%and%tokenize%the%text%in%the%important%parts%to%get%words.%– Remove%stop&words.%– Stem%words%to%common%roots.%• Advanced%processing%may%included%phrases,%enVty%tagging,%link‐graph%features,%and%more.%Parsing%• Some%parts%of%a%document%are%more%important%than%others.%• Document%parser%recognizes%structure%using%markup&such%as%HTML%tags.%– Headers,%anchor%text,%bolded%text%are%likely%to%be%important.%– JavaScript,%style%informaVon,%navigaVon%links%less%likely%to%be%important.%– Metadata%can%also%be%important. %%3/17/09 4 Example%Wikipedia%Page%Wikipedia%Markup%<title>Tropical fish</title> <text>{{Unreferenced|date=July 2008}} {{Original research|date=July 2008}} ’’’Tropical fish’’’ include [[fish]] found in [[Tropics|topical]] environments around the world, including both [[fresh water|freshwater]] and [[sea water|salt water]] species. [[Fishkeeping|Fishkeepers]] often use the term ’’tropical fish’’ to refer only those requiring fresh water, with saltwater tropical fish referred to as ’’[[list of marine aquarium fish species|marine fish]]’’. …3/17/09 5 Wikipedia%HTML%Document%Parsing%• HTML%pages%organize%into%trees.%<HTML>%<HEAD>%<TITLE>% Tropical%fish%<META>%<BODY>%<H1>% Tropical%fish%<P>%<B>% Tropical%fish%<A>% fish%<A>% tropical%include%found%in%environments%around%the%world%Nodes contain blocks of text.3/17/09 6 End%Result%of%Parsing%• Blocks%of%text%from%important%parts%of%page.%– Tropical%fish%include%fish%found%in%tropical%environments%around%the%world,%including%both%freshwater%and%salt%water%species.%%Fishkeepers%oien%use%the%term%“tropical%fish”%to%refer%only%those%requiring%fresh%water,%with%saltwater%tropical%fish%referred%to%as%“marine%fish”.%• Next%step:%%segmenVng%and%tokenizing.%Tokenizing%• Forming%words%from%sequence%of%characters%in%blocks%of%text.%• Surprisingly%complex%in%English,%can%be%harder%in%other%languages.%• Early%IR%systems:%– Any%sequence%of%alphanumeric%characters%of%length%3%or%more.%– Terminated%by%a%space%or%other%special%character.%– Upper‐case%changed%to%lower‐case.%3/17/09 7 Tokenizing%• Example:%– “Bigcorp's%2007%bi‐annual%report%showed%profits%rose%10%.”%becomes%– “bigcorp%2007%annual%report%showed%profits%rose”%• Too%simple%for%search%applicaVons%or%even%large‐scale%experiments%• Why?%Too%much%informaVon%lost%– Small%decisions%in%tokenizing%can%have%major%impact%on%effecVveness%of%some%queries%Tokenizing%Problems%• Small%words%can%be%important%in%some%queries,%usually%in%combinaVons%• %xp,%ma,%pm,%ben%e%king,%el%paso,%master%p,%gm,%j%lo,%world%war%II%• Both%hyphenated%and%non‐hyphenated%forms%of%many%words%are%common%%– SomeVmes%hyphen%is%not%needed%%• e‐bay,%wal‐mart,%acVve‐x,%cd‐rom,%t‐shirts%%– At%other%Vmes,%hyphens%should%be%considered%either%as%part%of%the%word%or%a%word%separator%• winston‐salem,%mazda%rx‐7,%e‐cards,%pre‐diabetes,%t‐mobile,%spanish‐speaking%3/17/09 8 Tokenizing%Problems%• Special%characters%are%an%important%part%of%tags,%URLs,%code%in%documents%• Capitalized%words%can%have%different%meaning%from%lower%case%words%– Bush,%%Apple%• Apostrophes%can%be%a%part%of%a%word,%a%part%of%a%possessive,%or%just%a%mistake%– rosie%o'donnell,%can't,%don't,%80's,%1890's,%men's%straw%hats,%master's%degree,%england's%ten%largest%ciVes,%shriner's%Tokenizing%Problems%• Numbers%can%be%important,%including%decimals%%– nokia%3250,%top%10%courses,%united%93,%quickVme%6.5%pro,%92.3%the%beat,%288358%%• Periods%can%occur%in%numbers,%abbreviaVons,%URLs,%ends%of%sentences,%and%other%situaVons%– I.B.M.,%Ph.D.,%cis.udel.edu%• Note:%tokenizing%steps%for%queries%must%be%idenVcal%to%steps%for%documents%3/17/09 9 Tokenizing%Process%• Assume%we%have%used%the%parser%to%find%blocks%of%important%text.%• A%word%may%be%any%sequence%of%alphanumeric%characters%terminated%by%a%space%or%special%character.%– everything%converted%to%lower%case.%– everything%indexed.%• Defer%complex%decisions%to%other%components%– example:%92.3%→%92%3%but%search%finds%documents%with%92%and%3%adjacent%– incorporate%some%rules%to%reduce%dependence%on%query%transformaVon%components%End%Result%of%TokenizaVon%• List%of%words%in%blocks%of%text.%– tropical%fish%include%fish%found%in%tropical%environments%around%the%world%including%both%freshwater%and%salt%water%species%fishkeepers%oien%use%the%term%tropical%fish%to%refer%only%those%requiring%fresh%water%with%saltwater%tropical%fish%referred%to%as%marine%fish%• Next%step:%%stopping.%• But%first:%%text%staVsVcs.%3/17/09 10 Text%StaVsVcs%• Huge%variety%of%words%used%in%text%but%• Many%staVsVcal%characterisVcs%of%word%occurrences%are%predictable%– e.g.,%distribuVon%of%word%counts%• Retrieval%models%and%ranking%algorithms%depend%heavily%on%staVsVcal%properVes%of%words%– e.g.,%important%words%occur%oien%in%documents%but%are%not%high%frequency%in%collecVon%Zipf’s%Law%• DistribuVon%of%word%frequencies%is%very%skewed&– a%few%words%occur%very%oien,%many%words%hardly%ever%occur%–


View Full Document

UD CISC 689 - Text 
Processing


Download Text 
Processing

Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Text 
Processing
 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Text 
Processing
 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?