Text Retrieval Algorithms

Home> Academic Documents> Text Retrieval Algorithms

DOC PREVIEW

This preview shows page 1-2-3-4-27-28-29-30-55-56-57-58 out of 58 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 58 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 58 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 58 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 58 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 58 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 58 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 58 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 58 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 58 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 58 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 58 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 58 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 58 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Text Retrieval AlgorithmsData-Intensive Information Processing Applications ― Session #4Jimmy LinJimmy LinUniversity of MarylandTuesday, February 23, 2010This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United StatesSee http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for detailsSource: Wikipedia (Japanese rock garden)Today’s Agenda| Introduction to information retrieval| Basics of indexing and retrievalas cs o de g a d et e a| Inverted indexing in MapReduce|Retrieval at scale|Retrieval at scaleFirst, nomenclature…| Information retrieval (IR)z Focus on textual information (= text/document retrieval)z Other possibilities include image, video, music, …| What do we search?z Generically, “collections”z Less-frequently used, “corpora”|What do we find?|What do we find?z Generically, “documents”z Even though we may be referring to web pages, PDFs, PowerPoint slides, paragraphs, etc.Information Retrieval CycleSourceSelectionResourceQueryQueryFormulationSearchResultsSelectionExaminationDocumentsSystem discoveryVocabulary discoveryConcept discoveryDocument discoveryExaminationDeliveryInformationsource reselectionDocument discoveryThe Central Problem in SearchSearcherAuthorConceptsConceptsConceptsConceptsQuery TermsDocument Terms“tragic love story” “fateful star-crossed romance”Do these represent the same concepts?Abstract IR ArchitectureDocumentsQueryRepresentationFunctionRepresentationFunctionofflineonlineFunctionFunctionQuery Representation Document RepresentationComparisonFunctionIndexHitsHow do we represent text?| Remember: computers don’t “understand” anything!| “Bag of words”ag o o dsz Treat all the words in a document as index termsz Assign a “weight” to each term based on “importance” (or in simplest case presence/absence of word)(or, in simplest case, presence/absence of word)z Disregard order, structure, meaning, etc. of the wordsz Simple, yet effective!| Assumptionsz Term occurrence is independentz Document relevance is independentz “Words” are well-definedWhat’s a word?天主教教宗若望保祿二世因感冒再度住進醫院。這是他今年第二度因同樣的病因住院。 ﻒﻴﺠﻳر كرﺎﻣ لﺎﻗو- ﻢﺳﺎﺑ ﻖﻃﺎﻨﻟاﺔﻴﻠﻴﺋاﺮﺳﻹا ﺔﻴﺟرﺎﺨﻟا-ﻞﺒﻗ نورﺎﺷ نإةرﺎﻳﺰﺑ ﻰﻟوﻷا ةﺮﻤﻠﻟ مﻮﻘﻴﺳو ةﻮﻋﺪﻟاﺮﻘﻤﻟا ﺔﻠﻳﻮﻃ ةﺮﺘﻔﻟ ﺖﻧﺎآ ﻲﺘﻟا ،ﺲﻧﻮﺗ مﺎﻋ نﺎﻨﺒﻟ ﻦﻣ ﺎﻬﺟوﺮﺧ ﺪﻌﺑ ﺔﻴﻨﻴﻄﺴﻠﻔﻟا ﺮﻳﺮﺤﺘﻟا ﺔﻤﻈﻨﻤﻟ ﻲﻤﺳﺮﻟا1982 .Выступая в Мещанском суде Москвы экс-глава ЮКОСа заявил не совершал ничего противозаконного, в чем обвиняет его генпрокуратура России.        2005-06                 日米連合で台頭中国に対処…アーミテージ前副長官提言조재영 기자= 서울시는 25일 이명박 시장이 `행정중심복합도시'' 건설안에대해`군대라도동원해막고싶은심정''이라고말했다는일부언론의에대해군대라도동원해막고싶은심정이라고말했다는일부언론의보도를 부인했다.Sample DocumentMcDonald's slims down spudsFast-food chain to reduce certain types of fat in its french fries with new cooking oil.14MD ld“Bag of Words”NEW YORK (CNN/Money) - McDonald's Corp. is cutting the amount of "bad" fat in its french fries nearly in half, the fast-food chain said Tuesday as it moves to make all its fried menu items healthier.But does that mean the popular shoestring fries14 ×McDonalds12 × fat11×friesBut does that mean the popular shoestring fries won't taste the same? The company says no. "It's a win-win for our customers because they are getting the same great french-fry taste along with an even healthier nutrition profile," said Mike Roberts, president of McDonald's USA.11 ×fries8 × new7×frenchBut others are not so sure. McDonald's will not specifically discuss the kind of oil it plans to use, but at least one nutrition expert says playing with the formula could mean a different taste.Shares of Oak Brook, Ill.-based McDonald's (MCD d $0 54 t $23 22 R h7 french6 × company, said, nutrition5 × food, oil, percent, reduce, (MCD: down $0.54 to $23.22, Research, Estimates) were lower Tuesday afternoon. It was unclear Tuesday whether competitors Burger King and Wendy's International (WEN: down $0.80 to $34.91, Research, Estimates) would follow suit. Neither company could immediately ,,p , ,taste, Tuesday…be reached for comment.…Counting Words…DocumentsDocumentsBag of Wordscase folding, tokenization, stopword removal, stemmingsyntax, semantics, word knowledge, etc.InvertedIndexBoolean Retrieval| Users express queries as a Boolean expressionz AND, OR, NOTz Can be arbitrarily nested| Retrieval is based on the notion of setsz Any given query divides the collection into two sets: retrieved, not-retrievedz Pure Boolean systems do not define an ordering of the resultsInverted Index: Boolean Retrievalone fish, two fishDoc 1red fish, blue fishDoc 2cat in the hatDoc 3green eggs and hamDoc 411234blue 2blue1111categgfish341categgfish21111fishgreenham144fishgreenham211hatone31hatone1red2red1red1two2red1twoBoolean Retrieval| To execute a Boolean query:z Build query syntax treeORzFor each clause, look up postings( blue AND fish ) OR hambluefishANDhamORFor each clause, look up postingsbluefish12bluefish 2z Traverse postings and apply Boolean operator| Efficiency analysisyyz Postings traversal is linear (assuming sorted postings)z Start with shortest posting firstStrengths and Weaknesses| Strengthsz Precise, if you know the right strategiesz Precise, if you have an idea of what you’re looking forz Implementations are fast and efficient|Weaknesses|Weaknessesz Users must learn Boolean logicz Boolean logic insufficient to capture the richness of languagez No control over size of result set: either too many hits or nonez When do you stop reading? All documents in the result set are considered “equally good”co s de ed equa y goodz What


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-3-4-27-28-29-30-55-56-57-58 out of 58 pages.

Please select your school