New version page

Johns Hopkins CS 600.466 - LECTURE CS 466

Upgrade to remove ads
Upgrade to remove ads
Unformatted text preview:

Today’s TopicsBoolean IRBoolean OperatorsBoolean IR(implementation)Problems with Boolean IRSlide 6Signature FilesFalse Drop ProblemEfficiency ProblemVertical PartitioningHorizontal PartitioningInverted FilesSlide 13AND’s using Inverted FilesPowerPoint PresentationProximity SearchSlide 17Slide 18Interpolation SearchSlide 20Slide 21Cost of Computing Inverted IndexK-pass IndexingVector Models for IRSlide 25Slide 26QUERIES and Documents share same vector representaionSimilarity FunctionsToday’s Topics•Boolean IR•Signature files•Inverted files•PAT trees•Suffix arraysBoolean IR•Documents composed of TERMS(words, stems)•Express result in set-theoretic termsDoc’s containingterm Aterm Bterm CDoc’s containingterm Aterm Bterm CA AND B (A AND B) OR C- Pre 1970’s- Dominant industrial model through 1994 (Lexis-Nexis, DIALOG)Boolean OperatorsA AND BA OR B(A AND B) OR CA AND ( NOT B )Doc’s containing term AAdjacent AND “ A B ” e.g. “Johns Hopkins”“The Who”Proximity window A w/10 B A and B within +/- 10 words  A w/sent B A + B in same sentenceProximityOperators(ExtendedANDs) (in +/- K words)Boolean IR(implementation)• Bit vectors• Inverted files(a.k.a. Index)• PAT tree(more powerful index)0 0 0 0 0 1 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 00 0 0 0 0 1 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0Impractical  very sparse(wastefully big) costly to compare V1V2TermiProblems with Boolean IR•Does not effectively support relevance ranking of returned documents•Base model : expression satisfaction is BooleanA document matches expression or it doesn’t•Extension to permit ordering : (A AND B) OR C–Supermatches(5 terms/doc > 3 terms/doc)–Partial matches (expression incompletely satisfied – give partial credit)–Importance weighting(10A OR 5B) Weight/importanceBoolean IR•Advantages : Can directly control searchGood for precise queries in structured data(e.g. database search or legal index)•Disadvantages : Must directly control search–Users should be familiar with domain and term space(know what to ask for and exclude)–Poor at relevance ranking–Poor at weighted query expansion, user modelling etc.Signature Files0 0 0 0 0 1 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 00 0 1 0 0 1 0 0 0 0 0Problem : several different document bit vectors(i.e. different words) get mapped to same signature.(use stoplist to help avoid common words from overwhelming signatures)DocumentBitvectorSignatureMappingfunction f( )SuperimposedCodingUsing some mapping/Hash functionfewer bitsFalse Drop Problem•On retrieval, all documents/bit vectors mapped to the same signature are retrieved(returned)•Only a portion are relevant•Need to do secondary validation step to make sure target words actually matchProb(False Drop) = Prob(Signature qualifies & Text does not)Efficiency ProblemTesting for signature match may require linear scan through all document signaturesVertical Partitioning•Improves sig1, sig2 comparison speed, but still requires O(N) linear search of all signatures•Options : sig- Bit sliced onto different devices for parallel comparision- And together matches on each segmentsig1sig2compANDAND  resultHorizontal Partitioning•Goal : avoid sequential scanning of the signature fileSignatureDatabaseInput signatureHash functionor indexyielding specific candidates to tryInverted Files•Like an index to a book1415161737383940143915639451562904186156217TermsBaumBayesViterbiindexDocumentsInverted Files•Very efficient for single word queriesJust enumerate documents pointed to by index O( |A| ) = O(SA) •Efficient for OR’sJust enumerate both lists and remove duplicates O(SA + SB)AND’s using Inverted Files143915622731939455896156208Method 1:• Begin with two pointers(i, j) on list # is in index(A,B)• if A[ i ] = B[ i ], write A[ i ] to output• if A[ i ] < B[ i ], i++ else j++AiBjIndex forBayesIndex forViterbiijO(SA + SB )same as OR,but smaller output(meet search)AND’s using Inverted Files3922715252839455896156Method 2: Useful if one index is smaller than the other(SA << SB )AiBj(Johns)(Hopkins)For all members of Absearch (A[ i ], B)(do binary searchinto larger index)for all members ofsmaller indexA AND B AND COrder by smaller list pairwiseCost : SA * log2 (SB )can achieve SA * log log (SB )Proximity SearchAHJHHAAnthonyJohnsHopkinsDocument level indexes not adequateOption 1 :Size of corpus = size of indexDoc 1Doc 2Doc 3Doc iIndex to corpus Position offsetBefore :Match if ptrA = ptrB Now :“A B” = match if ptrA = ptrB -1A w/10 B = match if | ptrA - ptrB | 10Variations 1Don’t index function wordsXTheJohnsHopkinsindexwordlist*JohnsTheDo linear match search in corpus savings on 50% index size potential speed improvement given data access costsVariations 2 : Multilevel IndexesAnthonyJohnsHopkinsJohnsHopkinsJohnsHopkinsAnthonyHopkinsAnthonyDoc levelPosition level Supports parallel search May have paging cost advantage Cost – large index N + dVAvg. Doc/vocab sizeInterpolation Search174195* 211 *2262302312464834965215269951718192021222348495051100Bi cellvalueUseful when data are numeric and uniformly distributed# of cells in index : 100Values range from 0 … 1000Goal : looking for the value 211Binary search : begin looking at cell 50Interpolation search : better guess for 1st cell to examine? bins#sizemax K Binary SearchBsearch(low, high, key)mid = (high + low) / 2If (key = A[mid]) return midElse if (key < A[mid]) Bsearch (low, mid-1, key)Else Bsearch(mid+1, high, key)Interpolation SearchIsearch(low, high, key)mid = best estimate of posmid = low + (high – low) * (expected % of way through range) low] A[ - ]high A[] low A[ -keyBinary Search50251218222119.Interpolation Search2119. go directly to expected regionTypical sequence of cell’s tested :log log (N)ComparisonCost of Computing Inverted Index1. Simple word position pairs and sort2. If N >> memory size1) Tokenize(words  integers)2) Create histogram3) Allocate space in index4) Do multipass(K-pass) through corpus only adding tokens in K binsCorpus size N log NK-pass IndexingindexW1W2W3W4Block1(passK = 1)K = 2Time = KN + 1But big win overN log N on pagingVector Models for IR•Gerald Salton, Cornell(Salton + Lesk, 68)(Salton, 71)(Salton + McGill, 83)•SMART SystemChris Buckely, CornellCurrent keeper of the flameSalton’s magical


View Full Document
Download LECTURE CS 466
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view LECTURE CS 466 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view LECTURE CS 466 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?