CS276 Information RetrievalRecap of the previous lecturePlan for this lectureRecall basic indexing pipelineParsing a documentFormat/language strippingTokenizationSlide 8Slide 9NumbersTokenization: Language issuesTokenization: language issuesSlide 13NormalizationCase foldingNormalizing PunctuationThesauri and soundexSoundexLemmatizationStemmingPorter’s algorithmTypical rules in PorterOther stemmersLanguage-specificityNormalization: other languagesSlide 26Dictionary entries – first cutFaster postings merges: Skip pointersRecall basic mergeAugment postings with skip pointers (at indexing time)Query processing with skip pointersWhere do we place skips?Placing skipsPhrase queriesSlide 35A first attempt: Biword indexesLonger phrase queriesExtended biwordsIssues for biword indexesSolution 2: Positional indexesPositional index exampleProcessing a phrase queryProximity queriesPositional index sizeSlide 45Rules of thumbCombination schemesResources for today’s lectureCS276Information RetrievalLecture 2Recap of the previous lectureBasic inverted indexes:Structure: Dictionary and PostingsKey step in construction: SortingBoolean query processingSimple optimizationLinear time mergingOverview of course topicsPlan for this lectureFinish basic indexingTokenizationWhat terms do we put in the index?Query processing – speedupsProximity/phrase queriesRecall basic indexing pipelineTokenizerToken stream.FriendsRomans CountrymenLinguistic modulesModified tokens.friendroman countrymanIndexerInverted index.friendromancountryman2 4213161Documents tobe indexed.Friends, Romans, countrymen.Parsing a documentWhat format is it in?pdf/word/excel/html?What language is it in?What character set is in use?Each of these is a classification problem, which we will study later in the course.But there are complications …Format/language strippingDocuments being indexed can include docs from many different languagesA single index may have to contain terms of several languages.Sometimes a document or its components can contain multiple languages/formatsFrench email with a Portuguese pdf attachment.What is a unit document?An email?With attachments?An email with a zip containing documents?TokenizationTokenizationInput: “Friends, Romans and Countrymen”Output: TokensFriendsRomansCountrymenEach such token is now a candidate for an index entry, after further processingDescribed belowBut what are valid tokens to emit?TokenizationIssues in tokenization:Finland’s capital Finland? Finlands? Finland’s?Hewlett-Packard Hewlett and Packard as two tokens?State-of-the-art: break up hyphenated sequence. co-education ?the hold-him-back-and-drag-him-away-maneuver ?San Francisco: one token or two? How do you decide it is one token?Numbers3/12/91Mar. 12, 199155 B.C.B-52My PGP key is 324a3df234cb23e100.2.86.144Generally, don’t index as text.Will often index “meta-data” separatelyCreation date, format, etc.Tokenization: Language issuesL'ensemble one token or two?L ? L’ ? Le ?Want ensemble to match with un ensembleGerman noun compounds are not segmentedLebensversicherungsgesellschaftsangestellter‘life insurance company employee’Tokenization: language issuesChinese and Japanese have no spaces between words:Not always guaranteed a unique tokenizationFurther complicated in Japanese, with multiple alphabets intermingledDates/amounts in multiple formatsフフフフフフ 500 フフフフフフフフフフフフフ $500K( フ 6,000 フフ )Katakana Hiragana Kanji “Romaji”End-user can express query entirely in hiragana!Tokenization: language issuesArabic (or Hebrew) is basically written right to left, but with certain items like numbers written left to rightWords are separated, but letter forms within a word form complex ligatures ةنس يف رئازجلا تلقتسا1962 دعب132 للتحلا نم اماعيسنرفلا. ← → ← → ← start‘Algeria achieved its independence in 1962 after 132 years of French occupation.’With Unicode, the surface presentation is complex, but the stored form is straightforwardNormalizationNeed to “normalize” terms in indexed text as well as query terms into the same formWe want to match U.S.A. and USAWe most commonly implicitly define equivalence classes of termse.g., by deleting periods in a termAlternative is to do limited expansion:Enter: window Search: window, windowsEnter: windows Search: Windows, windowsEnter: Windows Search: WindowsPotentially more powerful, but less efficientCase foldingReduce all letters to lower caseexception: upper case (in mid-sentence?)e.g., General MotorsFed vs. fedSAIL vs. sailOften best to lower case everything, since users will use lowercase regardless of ‘correct’ capitalizationNormalizing PunctuationNe’er vs. never: use language-specific, handcrafted “locale” to normalize.Which language?Most common: detect/apply language at a pre-determined granularity: doc/paragraph.U.S.A. vs. USA – remove all periods or use locale.a.outThesauri and soundexHandle synonyms and homonymsHand-constructed equivalence classese.g., car = automobilecolor = colourRewrite to form equivalence classesIndex such equivalencesWhen the document contains automobile, index it under car as well (usually, also vice-versa)Or expand query?When the query contains automobile, look under car as wellSoundexTraditional class of heuristics to expand a query into phonetic equivalentsLanguage specific – mainly for namesE.g., chebyshev tchebychefMore on this later ...LemmatizationReduce inflectional/variant forms to base formE.g.,am, are, is becar, cars, car's, cars' carthe boy's cars are different colors the boy car be different colorLemmatization implies doing “proper” reduction to dictionary headword formStemmingReduce terms to their “roots” before indexing“Stemming” suggest crude affix choppinglanguage dependente.g., automate(s), automatic, automation all reduced to automat.for example compressed and compression are both accepted as equivalent to compress.for exampl compress andcompress ar both acceptas equival to compressPorter’s algorithmCommonest algorithm for
View Full Document