PowerPoint PresentationRecall the basic indexing pipelineParsing a documentComplications: Format/languageComplications: What is a document?Slide 6TokenizationSlide 8NumbersTokenization: language issuesSlide 11Slide 12Slide 13Stop wordsNormalization to termsNormalization: other languagesSlide 17Case foldingSlide 19Thesauri and soundexSlide 21LemmatizationStemmingPorter’s algorithmTypical rules in PorterOther stemmersLanguage-specificityDoes stemming help?Slide 29Recall basic mergeAugment postings with skip pointers (at indexing time)Query processing with skip pointersWhere do we place skips?Placing skipsIntroduction to Information RetrievalIntroduction to Information Retrieval Introduction toInformation RetrievalDocument ingestionIntroduction to Information RetrievalIntroduction to Information Retrieval Recall the basic indexing pipelineTokenizerToken streamFriendsRomans CountrymenLinguistic modulesModified tokensfriendroman countrymanIndexerInverted indexfriendromancountryman2 4213161Documents tobe indexedFriends, Romans, countrymen.Introduction to Information RetrievalIntroduction to Information Retrieval Parsing a documentWhat format is it in?pdf/word/excel/html?What language is it in?What character set is in use?(CP1252, UTF-8, …)Each of these is a classification problem, which we will study later in the course.But these tasks are often done heuristically …Sec. 2.1Introduction to Information RetrievalIntroduction to Information Retrieval Complications: Format/languageDocuments being indexed can include docs from many different languagesA single index may contain terms from many languages.Sometimes a document or its components can contain multiple languages/formatsFrench email with a German pdf attachment.French email quote clauses from an English-language contractThere are commercial and open source libraries that can handle a lot of this stuffSec. 2.1Introduction to Information RetrievalIntroduction to Information Retrieval Complications: What is a document?We return from our query “documents” but there are often interesting questions of grain size:What is a unit document?A file?An email? (Perhaps one of many in a single mbox file)What about an email with 5 attachments?A group of files (e.g., PPT or LaTeX split over HTML pages)Sec. 2.1Introduction to Information RetrievalIntroduction to Information Retrieval Introduction toInformation RetrievalTokensIntroduction to Information RetrievalIntroduction to Information Retrieval TokenizationInput: “Friends, Romans and Countrymen”Output: TokensFriendsRomansCountrymenA token is an instance of a sequence of charactersEach such token is now a candidate for an index entry, after further processingDescribed belowBut what are valid tokens to emit?Sec. 2.2.1Introduction to Information RetrievalIntroduction to Information Retrieval TokenizationIssues in tokenization:Finland’s capital Finland AND s? Finlands? Finland’s?Hewlett-Packard Hewlett and Packard as two tokens?state-of-the-art: break up hyphenated sequence. co-educationlowercase, lower-case, lower case ?It can be effective to get the user to put in possible hyphensSan Francisco: one token or two? How do you decide it is one token?Sec. 2.2.1Introduction to Information RetrievalIntroduction to Information Retrieval Numbers3/20/91 Mar. 12, 199120/3/9155 B.C.B-52My PGP key is 324a3df234cb23e(800) 234-2333Often have embedded spacesOlder IR systems may not index numbersBut often very useful: think about things like looking up error codes/stacktraces on the web(One answer is using n-grams: IIR ch. 3)Will often index “meta-data” separatelyCreation date, format, etc.Sec. 2.2.1Introduction to Information RetrievalIntroduction to Information Retrieval Tokenization: language issuesFrenchL'ensemble one token or two?L ? L’ ? Le ?Want l’ensemble to match with un ensembleUntil at least 2003, it didn’t on GoogleInternationalization!German noun compounds are not segmentedLebensversicherungsgesellschaftsangestellter‘life insurance company employee’German retrieval systems benefit greatly from a compound splitter moduleCan give a 15% performance boost for German Sec. 2.2.1Introduction to Information RetrievalIntroduction to Information Retrieval Tokenization: language issuesChinese and Japanese have no spaces between words:莎莎莎莎莎莎莎莎莎莎莎莎莎莎莎莎莎莎莎莎Not always guaranteed a unique tokenization Further complicated in Japanese, with multiple alphabets intermingledDates/amounts in multiple formatsフフフフフフ 500 フフフフフフフフフフフフフ $500K( フ 6,000 フフ )Katakana Hiragana Kanji RomajiEnd-user can express query entirely in hiragana!Sec. 2.2.1Introduction to Information RetrievalIntroduction to Information Retrieval Tokenization: language issuesArabic (or Hebrew) is basically written right to left, but with certain items like numbers written left to rightWords are separated, but letter forms within a word form complex ligatures ← → ← → ← start‘Algeria achieved its independence in 1962 after 132 years of French occupation.’With Unicode, the surface presentation is complex, but the stored form is straightforwardSec. 2.2.1Introduction to Information RetrievalIntroduction to Information Retrieval Introduction toInformation RetrievalTermsThe things indexed in an IR systemIntroduction to Information RetrievalIntroduction to Information Retrieval Stop wordsWith a stop list, you exclude from the dictionary entirely the commonest words. Intuition:They have little semantic content: the, a, and, to, beThere are a lot of them: ~30% of postings for top 30 wordsBut the trend is away from doing this:Good compression techniques (IIR 5) means the space for including stop words in a system is very smallGood query optimization techniques (IIR 7) mean you pay little at query time for including stop words.You need them for:Phrase queries: “King of Denmark”Various song titles, etc.: “Let it be”, “To be or not to be”“Relational” queries: “flights to London”Sec. 2.2.2Introduction to Information RetrievalIntroduction to Information Retrieval Normalization to
View Full Document