Stanford CS 276 - Document Ingestion - D3623808

Home> Schools> Stanford University> Computer Science (CS) > CS 276> Document Ingestion

Stanford CS 276 - Document Ingestion

Course Cs 276- Information Retrieval and Web Search

Pages 34

Download Save

Unformatted text preview:

Introduction to Information Retrieval Introduction to Information Retrieval Document ingestion Introduction to Information Retrieval Introduction to Information Retrieval Recall the basic indexing pipeline Documents to be indexed Friends Romans countrymen Tokenizer Linguistic modules Indexer Token stream Friends Romans Countrymen Modified tokens friend roman friend roman countryman countryman 4 2 16 2 1 13 Inverted index Introduction to Information Retrieval Introduction to Information Retrieval Sec 2 1 Parsing a document What format is it in pdf word excel html What language is it in What character set is in use CP1252 UTF 8 Each of these is a classification problem which we will study later in the course But these tasks are often done heuristically Introduction to Information Retrieval Introduction to Information Retrieval Sec 2 1 Complications Format language Documents being indexed can include docs from many different languages A single index may contain terms from many languages Sometimes a document or its components can contain multiple languages formats French email with a German pdf attachment French email quote clauses from an English language contract There are commercial and open source libraries that can handle a lot of this stuff Introduction to Information Retrieval Introduction to Information Retrieval Sec 2 1 Complications What is a document We return from our query documents but there are often interesting questions of grain size What is a unit document A file An email Perhaps one of many in a single mbox file What about an email with 5 attachments A group of files e g PPT or LaTeX split over HTML pages Introduction to Information Retrieval Introduction to Information Retrieval Introduction to Information Retrieval Tokens Introduction to Information Retrieval Introduction to Information Retrieval Sec 2 2 1 Tokenization Input Friends Romans and Countrymen Output Tokens Friends Romans Countrymen A token is an instance of a sequence of characters Each such token is now a candidate for an index entry after further processing Described below But what are valid tokens to emit Introduction to Information Retrieval Introduction to Information Retrieval Sec 2 2 1 Tokenization Issues in tokenization Finland s capital Finland AND s Finlands Finland s Hewlett Packard Hewlett and Packard as two tokens state of the art break up hyphenated sequence co education lowercase lower case lower case It can be effective to get the user to put in possible hyphens San Francisco one token or two How do you decide it is one token Introduction to Information Retrieval Introduction to Information Retrieval Sec 2 2 1 Mar 12 199120 3 91 Numbers 3 20 91 55 B C B 52 My PGP key is 324a3df234cb23e 800 234 2333 Often have embedded spaces Older IR systems may not index numbers But often very useful think about things like looking up error codes stacktraces on the web One answer is using n grams IIR ch 3 Will often index meta data separately Creation date format etc Introduction to Information Retrieval Introduction to Information Retrieval Sec 2 2 1 Tokenization language issues French L ensemble one token or two L L Le Want l ensemble to match with un ensemble Until at least 2003 it didn t on Google Internationalization German noun compounds are not segmented Lebensversicherungsgesellschaftsangestellter life insurance company employee German retrieval systems benefit greatly from a compound splitter module Can give a 15 performance boost for German Introduction to Information Retrieval Introduction to Information Retrieval Sec 2 2 1 Tokenization language issues Chinese and Japanese have no spaces between words Not always guaranteed a unique tokenization Further complicated in Japanese with multiple alphabets intermingled Dates amounts in multiple formats 500 500K 6 000 Katakana Hiragana Kanji Romaji End user can express query entirely in hiragana Introduction to Information Retrieval Introduction to Information Retrieval Sec 2 2 1 Tokenization language issues Arabic or Hebrew is basically written right to left but with certain items like numbers written left to right Words are separated but letter forms within a word form complex ligatures start Algeria achieved its independence in 1962 after 132 years of French occupation With Unicode the surface presentation is complex but the stored form is straightforward Introduction to Information Retrieval Introduction to Information Retrieval Introduction to Information Retrieval Terms The things indexed in an IR system Introduction to Information Retrieval Introduction to Information Retrieval Sec 2 2 2 Stop words With a stop list you exclude from the dictionary entirely the commonest words Intuition They have little semantic content the a and to be There are a lot of them 30 of postings for top 30 words But the trend is away from doing this Good compression techniques IIR 5 means the space for including stop words in a system is very small Good query optimization techniques IIR 7 mean you pay little at query time for including stop words You need them for Phrase queries King of Denmark Various song titles etc Let it be To be or not to be Relational queries flights to London Introduction to Information Retrieval Introduction to Information Retrieval Sec 2 2 3 Normalization to terms We may need to normalize words in indexed text as well as query words into the same form We want to match U S A and USA Result is terms a term is a normalized word type which is an entry in our IR system dictionary We most commonly implicitly define equivalence classes of terms by e g deleting periods to form a term U S A USA USA deleting hyphens to form a term anti discriminatory antidiscriminatory antidiscriminatory Introduction to Information Retrieval Introduction to Information Retrieval Sec 2 2 3 Normalization other languages Accents e g French r sum vs resume Umlauts e g German Tuebingen vs T bingen Should be equivalent Most important criterion How are your users like to write their queries for these words Even in languages that standardly have accents users often may not type them Often best to normalize to a de accented term Tuebingen T bingen Tubingen Tubingen Introduction to Information Retrieval Introduction to Information Retrieval Sec 2 2 3 Normalization other languages Normalization of things like date forms 7 30 vs 7 30 Japanese use of kana vs Chinese characters Tokenization and normalization may depend on the language and so is intertwined with language

View Full Document


School:
Email:
New Password:
Confirm Password:

Stanford CS 276 - Document Ingestion

Sign up for free to view:

Please select your school