Mt Holyoke CS 336 - Inverted Files, Signature Files and Bitmaps

Unformatted text preview:

CS336 Lecture 4:Generating Document RepresentationsIndex LanguagesSlide 4How do we retrieve information?Indexing techniquesInverted List: most common indexing techniqueInverted FilesSlide 9The inverted listsInverted Files: CoarseInverted Files: MediumInverted Files: FineIndex GranularityIndexes: BitmapsIndexes: Signature FilesSlide 17Signature FilesSlide 19Slide 20Inverted FileThe DictionaryThe dictionarySlide 24The inverted fileSlide 26Building an Inverted FileWhat are the challenges?Slide 29Inverted Files/Signature Files/BitmapsSlide 31Inverted Files, Signature Files, BitmapsCS336 Lecture 4:2Generating Document Representations•Use significant terms to build representations of documents–referred to as indexing•Manual indexing: professional indexers–Assign terms from a controlled vocabulary–Typically phrases•Automatic indexing: machine selects–Terms can be single words, phrases, or other features from the text of documents3Index Languages•Language used to describe docs and queries•Exhaustivity # of different topics indexed, completeness or breadth–increased exhaustivity => higher recall/ lower precision•Specificity - accuracy of indexing, detail–increased specificity => higher precision/lower recall•retrieved output size increases because documents are indexed by any remotely connected content information• When doc represented by fewer terms, content may be lost. A query that refers to the lost content,will fail to retrieve the document4Index Languages•Pre-coordinate indexing – combinations of terms (e.g. phrases) used as an indexing term•Post-coordinate indexing - combinations generated at search time•Faceted classification - group terms into facets that describe basic structure of a domain, less rigid than predefined hierarchy•Enumerative classification - an alphabetic listing, underlying order less clear–e.g. Library of Congress class for “socialism, communism and anarchism” at end of schedule for social sciences, after social pathology and criminology5How do we retrieve information?1. Search the whole text sequentially (i.e., on-line search)–A good strategy if •the text is small•the only choice•unaffordable index space overhead2. Build data structures over the text (indices) to speed up the search–A good strategy if•the text collection is large•the text is semi-static6Indexing techniques•Inverted files•best choice for most applications•Signature files & bitmaps•word-oriented index structures based on hashing•Arrays•faster for phrase searches & less common queries•harder to build & maintain•Design issues:•Search cost & space overhead•Cost of building & updating7Inverted List: most common indexing technique•Source file: collection, organized by document•Inverted file: collection organized by term–one record per term, listing locations where term occurs•Searching: traverse lists for each query term–OR: the union of component lists–AND: an intersection of component lists–Proximity: an intersection of component lists–SUM: the union of component lists; each entry has a score8Inverted Files•Contains inverted lists–one for each word in the vocabulary–identifies locations of all occurrences of a word in the original text•which ‘documents’ contain the word•Perhaps locations of occurrence within documents•Requires a lexicon or vocabulary list–provides mapping between word and its inverted list•Single term query could be answered by 1. scan the term’s inverted list 2. return every doc on the list9Inverted Files•Index granularity refers to the accuracy with which term locations are identified–coarse grained may identify only a block of text•each block may contain several documents–moderate grained will store locations in terms of document numbers–finely grained indices will return a sentence, word number, or byte number (location in original text)10The inverted lists•Data stored in inverted list:–The term, document frequency (df), list of DocIds•government, 3, <5, 18, 26,>–List of pairs of DocId and term frequency (tf)•government, 3 <(5, 2), (18, 1)(26, 2)>–List of DocId and positions•government, 3 <5, 25, 56><18, 4><26, 12, 43>11Inverted Files: CoarseBlock Document Text 1 1 Pease porridge hot, pease porridge cold 1 2 Pease porridge in the pot 1 3 Nine days old 2 4 Some like it hot, some like it cold 2 5 Some like it in the pot 2 6 Nine days old Term Number Term Block 1 cold <1,2> 2 days <1,2> 3 hot <1,2> 4 in <1,2> 5 it <1,2> 6 like <2> 7 nine <1,2> 8 old <1,2> 9 pease <1> 10 porridge <1> 11 pot <1,2> 12 some <2> 13 the <1,2>12Inverted Files: MediumDocument Text 1 Pease porridge hot, pease porridge cold 2 Pease porridge in the pot 3 Nine days old 4 Some like it hot, some like it cold 5 Some like it in the pot 6 Nine days old Number Term Documents 1 cold <2; 1,4> 2 days <2; 3,6> 3 hot <2; 1,4> 4 in <2; 2,5> 5 it <2; 4,5> 6 like <2; 4,5> 7 nine <2; 3,6> 8 old <2; 3,6> 9 pease <2; 1,2> 10 porridge <2; 1,2> 11 pot <2; 2,5> 12 some <2; 4,5> 13 the <2; 2,5>13Inverted Files: FineDocument Text 1 Pease porridge hot, pease porridge cold 2 Pease porridge in the pot 3 Nine days old 4 Some like it hot, some like it cold 5 Some like it in the pot 6 Nine days old Number Term Documents 1 cold <2; (1;6),(4;8)> 2 days <2; (3;2),(6;2)> 3 hot <2; (1;3),(4;4)> 4 in <2; (2;3),(5;4)> 5 it <2; (4;3,7),(5;3)> 6 like <2; (4;2,6),(5;2)> 7 nine <2; (3;1),(6;1)> 8 old <2; (3;3),(6;3)> 9 pease <2; (1;1,4),(2;1)> 10 porridge <2; (1;2,5),(2;2)> 11 pot <2; (2;5),(5;6)> 12 some <2; (4;1,5),(5;1)> 13 the <2; (2;4),(5;5)>14Index Granularity•Can you think of any differences between these in terms of storage needs or search effectiveness?–coarse: identify a block of text (potentially many docs)–fine : store sentence, word or byte number• less storage space, but more searching of plain text to find exact locations of search terms• more false matches when multiple words. Why?• Enables queries to contain proximity information• e.g.) “green house” versus green AND house • Proximity info increases index size 2-3x•only include doc info if proximity will not be used15Indexes: Bitmaps•Bag-of-words index only: term x document array•For each term, allocate vector with 1 bit per document•If term present in document n, set n’th bit to


View Full Document

Mt Holyoke CS 336 - Inverted Files, Signature Files and Bitmaps

Documents in this Course
Load more
Download Inverted Files, Signature Files and Bitmaps
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Inverted Files, Signature Files and Bitmaps and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Inverted Files, Signature Files and Bitmaps 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?