Unformatted text preview:

INFM 700: Session 6 Unstructured Information (Part I)Today’s TopicsLevels of StructureWhat is search?The Information Retrieval CycleThe Central Problem in IRArchitecture of IR SystemsHow do we represent text?What’s a word?Sample DocumentWhat’s the point?Why does “bag of words” work?Boolean RetrievalAND/OR/NOTLogic TablesRepresenting DocumentsBoolean View of a CollectionSample QueriesInverted IndexSlide 20Proximity OperatorsWhy Boolean Retrieval WorksThe Perfect Query ParadoxWhy Boolean Retrieval FailsStrengths and WeaknessesRanked RetrievalVector RepresentationVector Space ModelSimilarity MetricComponents of SimilarityTerm WeightingTF.IDF Term WeightingTF.IDF ExampleDocument Scoring AlgorithmIndexing: Performance AnalysisVocabulary Size: Heaps’ LawPostings Size: Zipf’s LawWord Frequency in EnglishDoes it fit Zipf’s Law?Summary thus far…Slide 41Tokenization ProblemIndexing N-GramsMorphological VariationStemmingStemmersDoes Stemming Work?Stemming in Other LanguagesBeyond Words…Slide 50INFM 700: Session 6Unstructured Information (Part I)Jimmy LinThe iSchoolUniversity of MarylandMonday, March 3, 2008This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United StatesSee http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for detailsiSchoolToday’s TopicsIntroduction to Information RetrievalBoolean retrievalRanked retrievalTokenization issuesIR IntroBoolean Vector SpaceTokenizationiSchoolLevels of StructureDifferent types of dataStructured dataSemi-structured dataUnstructured dataHow do you provide access to unstructured data?Manually develop an organization systemProvide search capabilitiesIR IntroBoolean Vector SpaceTokenizationiSchoolWhat is search?Search is query-based accessHow is this different from browsing?Things one can search on:ContentMetadataOrganization systemsLabels…IR IntroBoolean Vector SpaceTokenizationiSchoolThe Information Retrieval CycleSourceSelectionSearchQuerySelectionResultsExaminationDocumentsDeliveryInformationQueryFormulationResourcesource reselectionSystem discoveryVocabulary discoveryConcept discoveryDocument discoveryTodayIR IntroBoolean Vector SpaceTokenizationiSchoolThe Central Problem in IRSearcherAuthorsConcepts ConceptsQueryDocumentsDo these represent the same concepts?IR IntroBoolean Vector SpaceTokenizationiSchoolArchitecture of IR SystemsDocumentsQueryHitsRepresentationFunctionRepresentationFunctionQuery Representation Document RepresentationComparisonFunctionIndexofflineonlineIR IntroBoolean Vector SpaceTokenizationiSchoolHow do we represent text?Remember: computers don’t “understand” documents or queriesSimple, yet effective approach: “bag of words”Treat all the words in a document as index termsAssign a “weight” to each term based on “importance”Disregard order, structure, meaning, etc. of the wordsAssumptionsTerm occurrence is independentDocument relevance is independent“Words” are well-definedIR IntroBoolean Vector SpaceTokenizationiSchoolWhat’s a word?天天天天天天天天天天天天天天天天天天天天天天天天天天天天天天天天天天天天天天 - مساب قطانلا فيجير كرام لاقو - لبق نوراش نإ ةيليئارسلا ةيجراخلا ةرايزب ىلولا ةرملل موقيسو ةوعدلا رقملا ةليوط ةرتفل تناك يتلا ،سنوت نانبل نم اهجورخ دعب ةينيطسلفلا ريرحتلا ةمظنمل يمسرلا ماع1982 . Выступая в Мещанском суде Москвы экс-глава ЮКОСа заявил не совершал ничего противозаконного, в чем обвиняет его генпрокуратура России.           2005-06                    !"   天天天天天天天天天天天天…天天天天天天天天天天天天 天天天 天天 = 天天天天 25 天 天天天 天天天 ` 天天天天天天天天 '' 天天天天 天天 ` 天天天天 天天天 天天天天 天天 '' 天天天 天天天天 天天 天天天 天天天 天天天天 .IR IntroBoolean Vector SpaceTokenizationiSchoolSample DocumentMcDonald's slims down spudsFast-food chain to reduce certain types of fat in its french fries with new cooking oil.NEW YORK (CNN/Money) - McDonald's Corp. is cutting the amount of "bad" fat in its french fries nearly in half, the fast-food chain said Tuesday as it moves to make all its fried menu items healthier.But does that mean the popular shoestring fries won't taste the same? The company says no. "It's a win-win for our customers because they are getting the same great french-fry taste along with an even healthier nutrition profile," said Mike Roberts, president of McDonald's USA.But others are not so sure. McDonald's will not specifically discuss the kind of oil it plans to use, but at least one nutrition expert says playing with the formula could mean a different taste.Shares of Oak Brook, Ill.-based McDonald's (MCD: down $0.54 to $23.22, Research, Estimates) were lower Tuesday afternoon. It was unclear Tuesday whether competitors Burger King and Wendy's International (WEN: down $0.80 to $34.91, Research, Estimates) would follow suit. Neither company could immediately be reached for comment.…14 × McDonald’s12 × fat11 × fries8 × new6 × company, french, nutrition5 × food, oil, percent, reduce, taste, Tuesday…“Bag of Words”IR IntroBoolean Vector SpaceTokenizationiSchoolWhat’s the point?Retrieving relevant information is hard!Evolving, ambiguous user needs, context, etc.Complexities of languageTo operationalize information retrieval, we must vastly simplify the pictureBag-of-words approach:Information retrieval is all (and only) about matching words in documents with words in queriesObviously, not true…But it works pretty well!IR IntroBoolean Vector SpaceTokenizationiSchoolWhy does “bag of words” work?Words alone tell us a lot about contentIt is relatively easy to come up with words that describe an information needRandom: beating takes points falling another Dow 355Alphabetical: 355 another


View Full Document

UMD INFM 700 - Unstructured Information (Part I)

Download Unstructured Information (Part I)
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Unstructured Information (Part I) and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Unstructured Information (Part I) 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?