DOC PREVIEW
Stanford CS 276 - Information Retrieval

This preview shows page 1-2-16-17-18-33-34 out of 34 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 34 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 34 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 34 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 34 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 34 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 34 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 34 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 34 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

PowerPoint PresentationRecall the basic indexing pipelineParsing a documentComplications: Format/languageComplications: What is a document?Slide 6TokenizationSlide 8NumbersTokenization: language issuesSlide 11Slide 12Slide 13Stop wordsNormalization to termsNormalization: other languagesSlide 17Case foldingSlide 19Thesauri and soundexSlide 21LemmatizationStemmingPorter’s algorithmTypical rules in PorterOther stemmersLanguage-specificityDoes stemming help?Slide 29Recall basic mergeAugment postings with skip pointers (at indexing time)Query processing with skip pointersWhere do we place skips?Placing skipsIntroduction to Information RetrievalIntroduction to Information Retrieval Introduction toInformation RetrievalDocument ingestionIntroduction to Information RetrievalIntroduction to Information Retrieval Recall the basic indexing pipelineTokenizerToken streamFriendsRomans CountrymenLinguistic modulesModified tokensfriendroman countrymanIndexerInverted indexfriendromancountryman2 4213161Documents tobe indexedFriends, Romans, countrymen.Introduction to Information RetrievalIntroduction to Information Retrieval Parsing a documentWhat format is it in?pdf/word/excel/html?What language is it in?What character set is in use?(CP1252, UTF-8, …)Each of these is a classification problem, which we will study later in the course.But these tasks are often done heuristically …Sec. 2.1Introduction to Information RetrievalIntroduction to Information Retrieval Complications: Format/languageDocuments being indexed can include docs from many different languagesA single index may contain terms from many languages.Sometimes a document or its components can contain multiple languages/formatsFrench email with a German pdf attachment.French email quote clauses from an English-language contractThere are commercial and open source libraries that can handle a lot of this stuffSec. 2.1Introduction to Information RetrievalIntroduction to Information Retrieval Complications: What is a document?We return from our query “documents” but there are often interesting questions of grain size:What is a unit document?A file?An email? (Perhaps one of many in a single mbox file)What about an email with 5 attachments?A group of files (e.g., PPT or LaTeX split over HTML pages)Sec. 2.1Introduction to Information RetrievalIntroduction to Information Retrieval Introduction toInformation RetrievalTokensIntroduction to Information RetrievalIntroduction to Information Retrieval TokenizationInput: “Friends, Romans and Countrymen”Output: TokensFriendsRomansCountrymenA token is an instance of a sequence of charactersEach such token is now a candidate for an index entry, after further processingDescribed belowBut what are valid tokens to emit?Sec. 2.2.1Introduction to Information RetrievalIntroduction to Information Retrieval TokenizationIssues in tokenization:Finland’s capital  Finland AND s? Finlands? Finland’s?Hewlett-Packard  Hewlett and Packard as two tokens?state-of-the-art: break up hyphenated sequence. co-educationlowercase, lower-case, lower case ?It can be effective to get the user to put in possible hyphensSan Francisco: one token or two? How do you decide it is one token?Sec. 2.2.1Introduction to Information RetrievalIntroduction to Information Retrieval Numbers3/20/91 Mar. 12, 199120/3/9155 B.C.B-52My PGP key is 324a3df234cb23e(800) 234-2333Often have embedded spacesOlder IR systems may not index numbersBut often very useful: think about things like looking up error codes/stacktraces on the web(One answer is using n-grams: IIR ch. 3)Will often index “meta-data” separatelyCreation date, format, etc.Sec. 2.2.1Introduction to Information RetrievalIntroduction to Information Retrieval Tokenization: language issuesFrenchL'ensemble  one token or two?L ? L’ ? Le ?Want l’ensemble to match with un ensembleUntil at least 2003, it didn’t on GoogleInternationalization!German noun compounds are not segmentedLebensversicherungsgesellschaftsangestellter‘life insurance company employee’German retrieval systems benefit greatly from a compound splitter moduleCan give a 15% performance boost for German Sec. 2.2.1Introduction to Information RetrievalIntroduction to Information Retrieval Tokenization: language issuesChinese and Japanese have no spaces between words:莎莎莎莎莎莎莎莎莎莎莎莎莎莎莎莎莎莎莎莎Not always guaranteed a unique tokenization Further complicated in Japanese, with multiple alphabets intermingledDates/amounts in multiple formatsフフフフフフ 500 フフフフフフフフフフフフフ $500K( フ 6,000 フフ )Katakana Hiragana Kanji RomajiEnd-user can express query entirely in hiragana!Sec. 2.2.1Introduction to Information RetrievalIntroduction to Information Retrieval Tokenization: language issuesArabic (or Hebrew) is basically written right to left, but with certain items like numbers written left to rightWords are separated, but letter forms within a word form complex ligatures ← → ← → ← start‘Algeria achieved its independence in 1962 after 132 years of French occupation.’With Unicode, the surface presentation is complex, but the stored form is straightforwardSec. 2.2.1Introduction to Information RetrievalIntroduction to Information Retrieval Introduction toInformation RetrievalTermsThe things indexed in an IR systemIntroduction to Information RetrievalIntroduction to Information Retrieval Stop wordsWith a stop list, you exclude from the dictionary entirely the commonest words. Intuition:They have little semantic content: the, a, and, to, beThere are a lot of them: ~30% of postings for top 30 wordsBut the trend is away from doing this:Good compression techniques (IIR 5) means the space for including stop words in a system is very smallGood query optimization techniques (IIR 7) mean you pay little at query time for including stop words.You need them for:Phrase queries: “King of Denmark”Various song titles, etc.: “Let it be”, “To be or not to be”“Relational” queries: “flights to London”Sec. 2.2.2Introduction to Information RetrievalIntroduction to Information Retrieval Normalization to


View Full Document

Stanford CS 276 - Information Retrieval

Documents in this Course
Load more
Download Information Retrieval
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Information Retrieval and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Information Retrieval 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?