INFM 700: Session 6 Unstructured Information (Part I)Today’s TopicsLevels of StructureWhat is search?The Information Retrieval CycleThe Central Problem in IRArchitecture of IR SystemsHow do we represent text?What’s a word?Sample DocumentWhat’s the point?Why does “bag of words” work?Boolean RetrievalAND/OR/NOTLogic TablesRepresenting DocumentsBoolean View of a CollectionSample QueriesInverted IndexSlide 20Proximity OperatorsWhy Boolean Retrieval WorksThe Perfect Query ParadoxWhy Boolean Retrieval FailsStrengths and WeaknessesRanked RetrievalVector RepresentationVector Space ModelSimilarity MetricComponents of SimilarityTerm WeightingTF.IDF Term WeightingTF.IDF ExampleDocument Scoring AlgorithmIndexing: Performance AnalysisVocabulary Size: Heaps’ LawPostings Size: Zipf’s LawWord Frequency in EnglishDoes it fit Zipf’s Law?Summary thus far…Slide 41Tokenization ProblemIndexing N-GramsMorphological VariationStemmingStemmersDoes Stemming Work?Stemming in Other LanguagesBeyond Words…Slide 50INFM 700: Session 6Unstructured Information (Part I)Jimmy LinThe iSchoolUniversity of MarylandMonday, March 3, 2008This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United StatesSee http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for detailsiSchoolToday’s TopicsIntroduction to Information RetrievalBoolean retrievalRanked retrievalTokenization issuesIR IntroBoolean Vector SpaceTokenizationiSchoolLevels of StructureDifferent types of dataStructured dataSemi-structured dataUnstructured dataHow do you provide access to unstructured data?Manually develop an organization systemProvide search capabilitiesIR IntroBoolean Vector SpaceTokenizationiSchoolWhat is search?Search is query-based accessHow is this different from browsing?Things one can search on:ContentMetadataOrganization systemsLabels…IR IntroBoolean Vector SpaceTokenizationiSchoolThe Information Retrieval CycleSourceSelectionSearchQuerySelectionResultsExaminationDocumentsDeliveryInformationQueryFormulationResourcesource reselectionSystem discoveryVocabulary discoveryConcept discoveryDocument discoveryTodayIR IntroBoolean Vector SpaceTokenizationiSchoolThe Central Problem in IRSearcherAuthorsConcepts ConceptsQueryDocumentsDo these represent the same concepts?IR IntroBoolean Vector SpaceTokenizationiSchoolArchitecture of IR SystemsDocumentsQueryHitsRepresentationFunctionRepresentationFunctionQuery Representation Document RepresentationComparisonFunctionIndexofflineonlineIR IntroBoolean Vector SpaceTokenizationiSchoolHow do we represent text?Remember: computers don’t “understand” documents or queriesSimple, yet effective approach: “bag of words”Treat all the words in a document as index termsAssign a “weight” to each term based on “importance”Disregard order, structure, meaning, etc. of the wordsAssumptionsTerm occurrence is independentDocument relevance is independent“Words” are well-definedIR IntroBoolean Vector SpaceTokenizationiSchoolWhat’s a word?天天天天天天天天天天天天天天天天天天天天天天天天天天天天天天天天天天天天天天 - مساب قطانلا فيجير كرام لاقو - لبق نوراش نإ ةيليئارسلا ةيجراخلا ةرايزب ىلولا ةرملل موقيسو ةوعدلا رقملا ةليوط ةرتفل تناك يتلا ،سنوت نانبل نم اهجورخ دعب ةينيطسلفلا ريرحتلا ةمظنمل يمسرلا ماع1982 . Выступая в Мещанском суде Москвы экс-глава ЮКОСа заявил не совершал ничего противозаконного, в чем обвиняет его генпрокуратура России. 2005-06 !" 天天天天天天天天天天天天…天天天天天天天天天天天天 天天天 天天 = 天天天天 25 天 天天天 天天天 ` 天天天天天天天天 '' 天天天天 天天 ` 天天天天 天天天 天天天天 天天 '' 天天天 天天天天 天天 天天天 天天天 天天天天 .IR IntroBoolean Vector SpaceTokenizationiSchoolSample DocumentMcDonald's slims down spudsFast-food chain to reduce certain types of fat in its french fries with new cooking oil.NEW YORK (CNN/Money) - McDonald's Corp. is cutting the amount of "bad" fat in its french fries nearly in half, the fast-food chain said Tuesday as it moves to make all its fried menu items healthier.But does that mean the popular shoestring fries won't taste the same? The company says no. "It's a win-win for our customers because they are getting the same great french-fry taste along with an even healthier nutrition profile," said Mike Roberts, president of McDonald's USA.But others are not so sure. McDonald's will not specifically discuss the kind of oil it plans to use, but at least one nutrition expert says playing with the formula could mean a different taste.Shares of Oak Brook, Ill.-based McDonald's (MCD: down $0.54 to $23.22, Research, Estimates) were lower Tuesday afternoon. It was unclear Tuesday whether competitors Burger King and Wendy's International (WEN: down $0.80 to $34.91, Research, Estimates) would follow suit. Neither company could immediately be reached for comment.…14 × McDonald’s12 × fat11 × fries8 × new6 × company, french, nutrition5 × food, oil, percent, reduce, taste, Tuesday…“Bag of Words”IR IntroBoolean Vector SpaceTokenizationiSchoolWhat’s the point?Retrieving relevant information is hard!Evolving, ambiguous user needs, context, etc.Complexities of languageTo operationalize information retrieval, we must vastly simplify the pictureBag-of-words approach:Information retrieval is all (and only) about matching words in documents with words in queriesObviously, not true…But it works pretty well!IR IntroBoolean Vector SpaceTokenizationiSchoolWhy does “bag of words” work?Words alone tell us a lot about contentIt is relatively easy to come up with words that describe an information needRandom: beating takes points falling another Dow 355Alphabetical: 355 another
View Full Document