Dictionary and Postings; Query ProcessingThis lecture agendaRecall basic indexing pipelineParsing a documentComplications: Format/languageTokenizationSlide 7Slide 8NumbersTokenization: Language issuesSlide 11Slide 12NormalizationNormalization: Other languagesSlide 15Case foldingStop wordsThesauri and soundexSoundexLemmatizationStemmingPorter’s algorithmTypical rules in PorterOther stemmersLanguage-specificityDictionary entries – first cutFaster postings merges: Skip pointersRecall basic mergeAugment postings with skip pointers (at indexing time)Query processing with skip pointersSection 2.3 : Page 35(print)/37 (online)Where do we place skips?Placing skipsPhrase queriesSlide 35Solution 1: Biword indexesLonger phrase queriesExtended biwordsIssues for biword indexesSolution 2: Positional indexesPositional index exampleProcessing a phrase queryProximity queriesPositional index sizeSlide 45Rules of thumbPrasad L4DictonaryAndQP 1Dictionary and Postings;Query ProcessingAdapted from Lectures byPrabhakar Raghavan (Yahoo and Stanford) and Christopher Manning (Stanford)This lecture agendaDifficulties with tokenizationImproving efficiency using enhanced data structures and algorithms : Skip pointersPhrasal queries Generalizing indexing structures2Recall basic indexing pipelineTokenizerToken stream.FriendsRomans CountrymenLinguistic modulesModified tokens.friendroman countrymanIndexerInverted index.friendromancountryman2 4213161Documents tobe indexed.Friends, Romans, countrymen.4Parsing a documentWhat format is it in?pdf/word/excel/html?What language is it in?What character set is in use?Each of these is a classification problem.But these tasks are often done heuristically …5Complications: Format/languageDocuments being indexed can include docs from many different languagesA single index may have to contain terms of several languages.Sometimes a document or its components can contain multiple languages/formatsFrench email with a German pdf attachment.What is a unit document?A file?An email? (Perhaps one of many in an mbox.)An email with 5 attachments?A group of files (PPT or LaTeX in HTML)6Tokenization7TokenizationInput: “Friends, Romans and Countrymen”Output: TokensFriendsRomansCountrymenEach such token is now a candidate for an index entry, after further processingBut what are valid tokens to emit?8TokenizationIssues in tokenization:Finland’s capital Finland? Finlands? Finland’s?Hewlett-Packard Hewlett and Packard as two tokens?State-of-the-art: break up hyphenated sequence. co-education ?It’s effective to get the user to put in possible hyphensSan Francisco: one token or two? How do you decide it is one token?9Numbers3/12/91 Mar. 12, 199155 B.C.B-52My PGP key is 324a3df234cb23e100.2.86.144Often, don’t index as text.But often very useful: think about things like looking up error codes/stacktraces on the web(One answer is using n-grams.)Will often index “meta-data” separatelyCreation date, format, etc.10Tokenization: Language issuesL'ensemble one token or two?L ? L’ ? Le ?Want l’ensemble to match with un ensembleGerman noun compounds are not segmentedLebensversicherungsgesellschaftsangestellter‘life insurance company employee’11Tokenization: Language issuesChinese and Japanese have no spaces between words:Cannot always guarantee a unique tokenization Further complicated in Japanese, with multiple alphabets intermingledDates/amounts in multiple formatsフフフフフフ 500 フフフフフフフフフフフフフ $500K( フ 6,000 フフ )Katakana Hiragana Kanji RomajiEnd-user can express query entirely in hiragana!12Tokenization: Language issuesArabic (or Hebrew) is basically written right to left, but with certain items like numbers written left to rightWords are separated, but letter forms within a word form complex ligatures ةنس يف رئازجلا تلقتسا1962 دعب132 للتحلا نم اماعيسنرفلا. ← → ← → ← start‘Algeria achieved its independence in 1962 after 132 years of French occupation.’With Unicode, the surface presentation is complex, but the stored form is straightforward13NormalizationNeed to “normalize” terms in indexed text as well as query terms into the same formWe want to match U.S.A. and USAWe most commonly implicitly define equivalence classes of termse.g., by deleting periods in a termAlternative is to do asymmetric expansion:Enter: window Search: window, windowsEnter: windows Search: Windows, windowsEnter: Windows Search: WindowsPotentially more powerful, but less efficient14Normalization: Other languagesAccents: résumé vs. resume.Most important criterion:How are your users likely to write their queries for these words?Even in languages that have accents, users often may not type themGerman: Tuebingen vs. TübingenShould be equivalent15Normalization: Other languagesNeed to “normalize” indexed text as well as query terms into the same formCharacter-level alphabet detection and conversionTokenization not separable from this.Sometimes ambiguous:7 フ 30 フ vs. 7/30Morgen will ich in MIT … Is thisGerman “mit”?16Case foldingReduce all letters to lower caseexception: upper case (in mid-sentence?)e.g., General MotorsFed vs. fedSAIL vs. sailOften best to lower case everything, since users will use lowercase regardless of ‘correct’ capitalization…17Stop wordsWith a stop list, you exclude from dictionary entirely, the commonest words. Intuition:They have little semantic content: the, a, and, to, beThey take a lot of space: ~30% of postings for top 30But the trend is away from doing this indiscriminately:Good compression techniques mean the space for including stopwords in a system is very smallGood query optimization techniques mean you pay little at query time for including stop words.You need them for:Phrase queries: “King of Denmark”Various song titles, etc.: “Let it be”, “To be or not to be”“Relational” queries: “flights to London”18Thesauri and soundexHandle synonyms and homonymsHand-constructed equivalence classese.g., car = automobilecolor = colourRewrite to form equivalence classesIndex such equivalencesWhen the
View Full Document