Unformatted text preview:

Inverted IndicesInverted FilesExampleInverted Files with TF-IDFSpace RequirementsBlock AddressingSlide 7Block Effect on Inverted File SizeSearchingSlide 10Vocabulary ConstructionIndex File ConstructionFaster Large Index ConstructionSlide 14Large Index Construction TimeConclusionIR System: What Do You Need?Inverted IndicesInverted FilesDefinition: an inverted file is a word-oriented mechanism for indexing a text collection in order to speed up the searching task.Structure of inverted file:–Vocabulary: is the set of all distinct words in the text–Occurrences: lists containing all information necessary for each word of the vocabulary (text position, frequency, documents where the word appears, etc.)ExampleText: Inverted file1 6 12 16 18 25 29 36 40 45 54 58 66 70That house has a garden. The garden has many flowers. The flowers are beautifulbeautifulflowersgardenhouse7045, 5818, 296Vocabulary OccurrencesInverted Files with TF-IDFPrior example allows for boolean queries.Need the document frequency and term frequency.Vocabulary entryPosting file entryk dkdoc1f1kdoc2f2k…dk : document frequency of term kdoci : i-th document that contains term kfik : term frequency of term k in document iSpace RequirementsThe space required for the vocabulary is rather small. According to Heaps’ law the vocabulary grows as O(n), where  is a constant between 0.4 and 0.6 in practice–TREC-2: 1 GB text, 5 MB lexiconOn the other hand, the occurrences demand much more space. Since each word appearing in the text is referenced once in that structure, the extra space is O(n)To reduce space requirements, a technique called block addressing is usedBlock AddressingThe text is divided in blocksThe occurrences point to the blocks where the word appearsAdvantages:–the number of pointers is smaller than positions–all the occurrences of a word inside a single block are collapsed to one referenceDisadvantages:–online search over the qualifying blocks if exact positions are requiredExampleText: Inverted filebeautifulflowersgardenhouse4321Vocabulary OccurrencesBlock 1 Block 2 Block 3 Block 4That house has a garden. The garden has many flowers. The flowers are beautifulBlock Effect on Inverted File SizeHow big are inverted files?–In relation to original collection size•right column indexes stopwords while left removes stopwordsBlocks require text to be available for location of terms within blocks. 45%27%18%73%41%25%36%18%1.7%64%32%2.4%35%5%0.5%63%9%0.7%Addressing wordsAddressing 256 blocksAddressing 64K blocksIndexSmall collection(1Mb)Medium collection(200Mb)Large collection(2Gb)SearchingThe search algorithm on an inverted index follows three steps:1. Vocabulary search: the words present in the query are located in the vocabulary2. Retrieval occurrences: the lists of the occurrences of all query words found are retrieved3. Manipulation of occurrences: the occurrences are processed to solve the querySearchingSearching inverted files starts with vocabulary –store the vocabulary in a separate fileStructures used to store the vocabulary include–Hashing : O (1) lookup, does not support range queries–Tries : O (c) lookup, c = length (word)–B-trees : O (log v) lookupAn alternative is simply storing the words in lexicographical order –cheaper in space and very competitive with O(log v) costVocabulary ConstructionAll the vocabulary is kept in a suitable data structure storing for each word and a list of its occurrencesEach word of each text in the corpus is read and searched for in the vocabularyIf it is not found, it is added to the vocabulary with a empty list of occurrences The new position is added to the end of its list of occurrences for the wordIndex File ConstructionOnce the text is exhausted the vocabulary is written to disk with the list of occurrences. Two files are created:–in the first file, each list of word occurrences is stored contiguously–in the second file, the vocabulary is stored in lexicographical order and, for each word, a pointer to its list in the first file is also included. This allows the vocabulary to be kept in memory at search timeThe overall process is O(n) worst-case timeFaster Large Index ConstructionAn option is to use the previous algorithm until the main memory is exhausted. When no more memory is available, the partial index Ii obtained up to now is written to disk and erased the main memory before continuing with the rest of the textOnce the text is exhausted, a number of partial indices Ii exist on diskThe partial indices are merged to obtain the final indexExampleI 1...8I 1...4I 5...8I 1...2I 3...4I 5...6I 7...8I 1I 2I 3I 4I 5I 6I 7I 81 2 4 53 67final indexinitial dumpslevel 1level 2level 3Large Index Construction TimeThe total time to generate partial indices is O(n)The number of partial indices is O(n/M)To merge the O(n/M) partial indices are necessary log2(n/M) merging levelsThe total cost of this algorithm is O(n log(n/M))ConclusionInverted files are used to index textThe indices are appropriate when the text collection is large and semi-staticIf the text collection is volatile online searching is the only optionSome techniques combine online and indexed searchingIR System: What Do You Need?Vocabulary List–Text preprocessing modules•lexical analysis, stemming, stopwordsOccurrences of Vocabulary Terms–Inverted index creation•term frequency in documents, document frequencyRetrieval and Ranking AlgorithmQuery and Ranking InterfacesBrowsing/Visualization


View Full Document

TAMU CSCE 315 - inverted-index

Download inverted-index
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view inverted-index and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view inverted-index 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?