Wright CS 707 - Dictionary and Postings- Query Processing

Unformatted text preview:

Dictionary and Postings; Query ProcessingThis lecture agendaRecall basic indexing pipelineParsing a documentComplications: Format/languageTokenizationSlide 7Slide 8NumbersTokenization: Language issuesSlide 11Slide 12NormalizationNormalization: Other languagesSlide 15Case foldingStop wordsThesauri and soundexSoundexLemmatizationStemmingPorter’s algorithmTypical rules in PorterOther stemmersLanguage-specificityDictionary entries – first cutFaster postings merges: Skip pointersRecall basic mergeAugment postings with skip pointers (at indexing time)Query processing with skip pointersSection 2.3 : Page 35(print)/37 (online)Where do we place skips?Placing skipsPhrase queriesSlide 35Solution 1: Biword indexesLonger phrase queriesExtended biwordsIssues for biword indexesSolution 2: Positional indexesPositional index exampleProcessing a phrase queryProximity queriesPositional index sizeSlide 45Rules of thumbPrasad L4DictonaryAndQP 1Dictionary and Postings;Query ProcessingAdapted from Lectures byPrabhakar Raghavan (Yahoo and Stanford) and Christopher Manning (Stanford)This lecture agendaDifficulties with tokenizationImproving efficiency using enhanced data structures and algorithms : Skip pointersPhrasal queries Generalizing indexing structures2Recall basic indexing pipelineTokenizerToken stream.FriendsRomans CountrymenLinguistic modulesModified tokens.friendroman countrymanIndexerInverted index.friendromancountryman2 4213161Documents tobe indexed.Friends, Romans, countrymen.4Parsing a documentWhat format is it in?pdf/word/excel/html?What language is it in?What character set is in use?Each of these is a classification problem.But these tasks are often done heuristically …5Complications: Format/languageDocuments being indexed can include docs from many different languagesA single index may have to contain terms of several languages.Sometimes a document or its components can contain multiple languages/formatsFrench email with a German pdf attachment.What is a unit document?A file?An email? (Perhaps one of many in an mbox.)An email with 5 attachments?A group of files (PPT or LaTeX in HTML)6Tokenization7TokenizationInput: “Friends, Romans and Countrymen”Output: TokensFriendsRomansCountrymenEach such token is now a candidate for an index entry, after further processingBut what are valid tokens to emit?8TokenizationIssues in tokenization:Finland’s capital  Finland? Finlands? Finland’s?Hewlett-Packard  Hewlett and Packard as two tokens?State-of-the-art: break up hyphenated sequence. co-education ?It’s effective to get the user to put in possible hyphensSan Francisco: one token or two? How do you decide it is one token?9Numbers3/12/91 Mar. 12, 199155 B.C.B-52My PGP key is 324a3df234cb23e100.2.86.144Often, don’t index as text.But often very useful: think about things like looking up error codes/stacktraces on the web(One answer is using n-grams.)Will often index “meta-data” separatelyCreation date, format, etc.10Tokenization: Language issuesL'ensemble  one token or two?L ? L’ ? Le ?Want l’ensemble to match with un ensembleGerman noun compounds are not segmentedLebensversicherungsgesellschaftsangestellter‘life insurance company employee’11Tokenization: Language issuesChinese and Japanese have no spaces between words:Cannot always guarantee a unique tokenization Further complicated in Japanese, with multiple alphabets intermingledDates/amounts in multiple formatsフフフフフフ 500 フフフフフフフフフフフフフ $500K( フ 6,000 フフ )Katakana Hiragana Kanji RomajiEnd-user can express query entirely in hiragana!12Tokenization: Language issuesArabic (or Hebrew) is basically written right to left, but with certain items like numbers written left to rightWords are separated, but letter forms within a word form complex ligatures ةنس يف رئازجلا تلقتسا1962 دعب132 للتحلا نم اماعيسنرفلا.  ← → ← → ← start‘Algeria achieved its independence in 1962 after 132 years of French occupation.’With Unicode, the surface presentation is complex, but the stored form is straightforward13NormalizationNeed to “normalize” terms in indexed text as well as query terms into the same formWe want to match U.S.A. and USAWe most commonly implicitly define equivalence classes of termse.g., by deleting periods in a termAlternative is to do asymmetric expansion:Enter: window Search: window, windowsEnter: windows Search: Windows, windowsEnter: Windows Search: WindowsPotentially more powerful, but less efficient14Normalization: Other languagesAccents: résumé vs. resume.Most important criterion:How are your users likely to write their queries for these words?Even in languages that have accents, users often may not type themGerman: Tuebingen vs. TübingenShould be equivalent15Normalization: Other languagesNeed to “normalize” indexed text as well as query terms into the same formCharacter-level alphabet detection and conversionTokenization not separable from this.Sometimes ambiguous:7 フ 30 フ vs. 7/30Morgen will ich in MIT … Is thisGerman “mit”?16Case foldingReduce all letters to lower caseexception: upper case (in mid-sentence?)e.g., General MotorsFed vs. fedSAIL vs. sailOften best to lower case everything, since users will use lowercase regardless of ‘correct’ capitalization…17Stop wordsWith a stop list, you exclude from dictionary entirely, the commonest words. Intuition:They have little semantic content: the, a, and, to, beThey take a lot of space: ~30% of postings for top 30But the trend is away from doing this indiscriminately:Good compression techniques mean the space for including stopwords in a system is very smallGood query optimization techniques mean you pay little at query time for including stop words.You need them for:Phrase queries: “King of Denmark”Various song titles, etc.: “Let it be”, “To be or not to be”“Relational” queries: “flights to London”18Thesauri and soundexHandle synonyms and homonymsHand-constructed equivalence classese.g., car = automobilecolor = colourRewrite to form equivalence classesIndex such equivalencesWhen the


View Full Document

Wright CS 707 - Dictionary and Postings- Query Processing

Download Dictionary and Postings- Query Processing
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Dictionary and Postings- Query Processing and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Dictionary and Postings- Query Processing 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?