Today’s TopicsBoolean IRBoolean OperatorsBoolean IR(implementation)Problems with Boolean IRSlide 6Signature FilesFalse Drop ProblemEfficiency ProblemVertical PartitioningHorizontal PartitioningInverted FilesSlide 13AND’s using Inverted FilesPowerPoint PresentationProximity SearchSlide 17Slide 18Interpolation SearchSlide 20Slide 21Cost of Computing Inverted IndexK-pass IndexingVector Models for IRSlide 25Slide 26QUERIES and Documents share same vector representaionSimilarity FunctionsToday’s Topics•Boolean IR•Signature files•Inverted files•PAT trees•Suffix arraysBoolean IR•Documents composed of TERMS(words, stems)•Express result in set-theoretic termsDoc’s containingterm Aterm Bterm CDoc’s containingterm Aterm Bterm CA AND B (A AND B) OR C- Pre 1970’s- Dominant industrial model through 1994 (Lexis-Nexis, DIALOG)Boolean OperatorsA AND BA OR B(A AND B) OR CA AND ( NOT B )Doc’s containing term AAdjacent AND “ A B ” e.g. “Johns Hopkins”“The Who”Proximity window A w/10 B A and B within +/- 10 words A w/sent B A + B in same sentenceProximityOperators(ExtendedANDs) (in +/- K words)Boolean IR(implementation)• Bit vectors• Inverted files(a.k.a. Index)• PAT tree(more powerful index)0 0 0 0 0 1 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 00 0 0 0 0 1 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0Impractical very sparse(wastefully big) costly to compare V1V2TermiProblems with Boolean IR•Does not effectively support relevance ranking of returned documents•Base model : expression satisfaction is BooleanA document matches expression or it doesn’t•Extension to permit ordering : (A AND B) OR C–Supermatches(5 terms/doc > 3 terms/doc)–Partial matches (expression incompletely satisfied – give partial credit)–Importance weighting(10A OR 5B) Weight/importanceBoolean IR•Advantages : Can directly control searchGood for precise queries in structured data(e.g. database search or legal index)•Disadvantages : Must directly control search–Users should be familiar with domain and term space(know what to ask for and exclude)–Poor at relevance ranking–Poor at weighted query expansion, user modelling etc.Signature Files0 0 0 0 0 1 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 00 0 1 0 0 1 0 0 0 0 0Problem : several different document bit vectors(i.e. different words) get mapped to same signature.(use stoplist to help avoid common words from overwhelming signatures)DocumentBitvectorSignatureMappingfunction f( )SuperimposedCodingUsing some mapping/Hash functionfewer bitsFalse Drop Problem•On retrieval, all documents/bit vectors mapped to the same signature are retrieved(returned)•Only a portion are relevant•Need to do secondary validation step to make sure target words actually matchProb(False Drop) = Prob(Signature qualifies & Text does not)Efficiency ProblemTesting for signature match may require linear scan through all document signaturesVertical Partitioning•Improves sig1, sig2 comparison speed, but still requires O(N) linear search of all signatures•Options : sig- Bit sliced onto different devices for parallel comparision- And together matches on each segmentsig1sig2compANDAND resultHorizontal Partitioning•Goal : avoid sequential scanning of the signature fileSignatureDatabaseInput signatureHash functionor indexyielding specific candidates to tryInverted Files•Like an index to a book1415161737383940143915639451562904186156217TermsBaumBayesViterbiindexDocumentsInverted Files•Very efficient for single word queriesJust enumerate documents pointed to by index O( |A| ) = O(SA) •Efficient for OR’sJust enumerate both lists and remove duplicates O(SA + SB)AND’s using Inverted Files143915622731939455896156208Method 1:• Begin with two pointers(i, j) on list # is in index(A,B)• if A[ i ] = B[ i ], write A[ i ] to output• if A[ i ] < B[ i ], i++ else j++AiBjIndex forBayesIndex forViterbiijO(SA + SB )same as OR,but smaller output(meet search)AND’s using Inverted Files3922715252839455896156Method 2: Useful if one index is smaller than the other(SA << SB )AiBj(Johns)(Hopkins)For all members of Absearch (A[ i ], B)(do binary searchinto larger index)for all members ofsmaller indexA AND B AND COrder by smaller list pairwiseCost : SA * log2 (SB )can achieve SA * log log (SB )Proximity SearchAHJHHAAnthonyJohnsHopkinsDocument level indexes not adequateOption 1 :Size of corpus = size of indexDoc 1Doc 2Doc 3Doc iIndex to corpus Position offsetBefore :Match if ptrA = ptrB Now :“A B” = match if ptrA = ptrB -1A w/10 B = match if | ptrA - ptrB | 10Variations 1Don’t index function wordsXTheJohnsHopkinsindexwordlist*JohnsTheDo linear match search in corpus savings on 50% index size potential speed improvement given data access costsVariations 2 : Multilevel IndexesAnthonyJohnsHopkinsJohnsHopkinsJohnsHopkinsAnthonyHopkinsAnthonyDoc levelPosition level Supports parallel search May have paging cost advantage Cost – large index N + dVAvg. Doc/vocab sizeInterpolation Search174195* 211 *2262302312464834965215269951718192021222348495051100Bi cellvalueUseful when data are numeric and uniformly distributed# of cells in index : 100Values range from 0 … 1000Goal : looking for the value 211Binary search : begin looking at cell 50Interpolation search : better guess for 1st cell to examine? bins#sizemax K Binary SearchBsearch(low, high, key)mid = (high + low) / 2If (key = A[mid]) return midElse if (key < A[mid]) Bsearch (low, mid-1, key)Else Bsearch(mid+1, high, key)Interpolation SearchIsearch(low, high, key)mid = best estimate of posmid = low + (high – low) * (expected % of way through range) low] A[ - ]high A[] low A[ -keyBinary Search50251218222119.Interpolation Search2119. go directly to expected regionTypical sequence of cell’s tested :log log (N)ComparisonCost of Computing Inverted Index1. Simple word position pairs and sort2. If N >> memory size1) Tokenize(words integers)2) Create histogram3) Allocate space in index4) Do multipass(K-pass) through corpus only adding tokens in K binsCorpus size N log NK-pass IndexingindexW1W2W3W4Block1(passK = 1)K = 2Time = KN + 1But big win overN log N on pagingVector Models for IR•Gerald Salton, Cornell(Salton + Lesk, 68)(Salton, 71)(Salton + McGill, 83)•SMART SystemChris Buckely, CornellCurrent keeper of the flameSalton’s magical
View Full Document