DOC PREVIEW
Villanova CSC 9010 - Lecture 2

This preview shows page 1-2-3-23-24-25-26-46-47-48 out of 48 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 48 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 48 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 48 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 48 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 48 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 48 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 48 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 48 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 48 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 48 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 48 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CS276 Information RetrievalRecap of the previous lecturePlan for this lectureRecall basic indexing pipelineParsing a documentFormat/language strippingTokenizationSlide 8Slide 9NumbersTokenization: Language issuesTokenization: language issuesSlide 13NormalizationCase foldingNormalizing PunctuationThesauri and soundexSoundexLemmatizationStemmingPorter’s algorithmTypical rules in PorterOther stemmersLanguage-specificityNormalization: other languagesSlide 26Dictionary entries – first cutFaster postings merges: Skip pointersRecall basic mergeAugment postings with skip pointers (at indexing time)Query processing with skip pointersWhere do we place skips?Placing skipsPhrase queriesSlide 35A first attempt: Biword indexesLonger phrase queriesExtended biwordsIssues for biword indexesSolution 2: Positional indexesPositional index exampleProcessing a phrase queryProximity queriesPositional index sizeSlide 45Rules of thumbCombination schemesResources for today’s lectureCS276Information RetrievalLecture 2Recap of the previous lectureBasic inverted indexes:Structure: Dictionary and PostingsKey step in construction: SortingBoolean query processingSimple optimizationLinear time mergingOverview of course topicsPlan for this lectureFinish basic indexingTokenizationWhat terms do we put in the index?Query processing – speedupsProximity/phrase queriesRecall basic indexing pipelineTokenizerToken stream.FriendsRomans CountrymenLinguistic modulesModified tokens.friendroman countrymanIndexerInverted index.friendromancountryman2 4213161Documents tobe indexed.Friends, Romans, countrymen.Parsing a documentWhat format is it in?pdf/word/excel/html?What language is it in?What character set is in use?Each of these is a classification problem, which we will study later in the course.But there are complications …Format/language strippingDocuments being indexed can include docs from many different languagesA single index may have to contain terms of several languages.Sometimes a document or its components can contain multiple languages/formatsFrench email with a Portuguese pdf attachment.What is a unit document?An email?With attachments?An email with a zip containing documents?TokenizationTokenizationInput: “Friends, Romans and Countrymen”Output: TokensFriendsRomansCountrymenEach such token is now a candidate for an index entry, after further processingDescribed belowBut what are valid tokens to emit?TokenizationIssues in tokenization:Finland’s capital  Finland? Finlands? Finland’s?Hewlett-Packard  Hewlett and Packard as two tokens?State-of-the-art: break up hyphenated sequence. co-education ?the hold-him-back-and-drag-him-away-maneuver ?San Francisco: one token or two? How do you decide it is one token?Numbers3/12/91Mar. 12, 199155 B.C.B-52My PGP key is 324a3df234cb23e100.2.86.144Generally, don’t index as text.Will often index “meta-data” separatelyCreation date, format, etc.Tokenization: Language issuesL'ensemble  one token or two?L ? L’ ? Le ?Want ensemble to match with un ensembleGerman noun compounds are not segmentedLebensversicherungsgesellschaftsangestellter‘life insurance company employee’Tokenization: language issuesChinese and Japanese have no spaces between words:Not always guaranteed a unique tokenizationFurther complicated in Japanese, with multiple alphabets intermingledDates/amounts in multiple formatsフフフフフフ 500 フフフフフフフフフフフフフ $500K( フ 6,000 フフ )Katakana Hiragana Kanji “Romaji”End-user can express query entirely in hiragana!Tokenization: language issuesArabic (or Hebrew) is basically written right to left, but with certain items like numbers written left to rightWords are separated, but letter forms within a word form complex ligatures ةنس يف رئازجلا تلقتسا1962 دعب132 للتحلا نم اماعيسنرفلا.  ← → ← → ← start‘Algeria achieved its independence in 1962 after 132 years of French occupation.’With Unicode, the surface presentation is complex, but the stored form is straightforwardNormalizationNeed to “normalize” terms in indexed text as well as query terms into the same formWe want to match U.S.A. and USAWe most commonly implicitly define equivalence classes of termse.g., by deleting periods in a termAlternative is to do limited expansion:Enter: window Search: window, windowsEnter: windows Search: Windows, windowsEnter: Windows Search: WindowsPotentially more powerful, but less efficientCase foldingReduce all letters to lower caseexception: upper case (in mid-sentence?)e.g., General MotorsFed vs. fedSAIL vs. sailOften best to lower case everything, since users will use lowercase regardless of ‘correct’ capitalizationNormalizing PunctuationNe’er vs. never: use language-specific, handcrafted “locale” to normalize.Which language?Most common: detect/apply language at a pre-determined granularity: doc/paragraph.U.S.A. vs. USA – remove all periods or use locale.a.outThesauri and soundexHandle synonyms and homonymsHand-constructed equivalence classese.g., car = automobilecolor = colourRewrite to form equivalence classesIndex such equivalencesWhen the document contains automobile, index it under car as well (usually, also vice-versa)Or expand query?When the query contains automobile, look under car as wellSoundexTraditional class of heuristics to expand a query into phonetic equivalentsLanguage specific – mainly for namesE.g., chebyshev  tchebychefMore on this later ...LemmatizationReduce inflectional/variant forms to base formE.g.,am, are, is  becar, cars, car's, cars'  carthe boy's cars are different colors  the boy car be different colorLemmatization implies doing “proper” reduction to dictionary headword formStemmingReduce terms to their “roots” before indexing“Stemming” suggest crude affix choppinglanguage dependente.g., automate(s), automatic, automation all reduced to automat.for example compressed and compression are both accepted as equivalent to compress.for exampl compress andcompress ar both acceptas equival to compressPorter’s algorithmCommonest algorithm for


View Full Document

Villanova CSC 9010 - Lecture 2

Documents in this Course
Lecture 2

Lecture 2

46 pages

Load more
Download Lecture 2
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture 2 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture 2 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?