Unformatted text preview:

3/17/09'1'Queries'and'Indexes'CISC489/689‐010,'Lecture'#7'Wednesday,'March'4'Ben'CartereCe'Project'Notes'• Next'worksheet:'– Inverted'lists'for'terms'in'the'wiki000'documents.'– For'each'term,'store:'• The'list'of'document'numbers'it'occurs'in.'• The'term'frequencies'in'those'documents.'• The'document'frequency'(total'number'of'documents'it'occurs'in).'– If'inclined,'you'may'store'other'informaVon:'• Term'posiVons,'field'informaVon,'etc.'3/17/09'2'Project'Notes'• Inverted'list'compression:'– You'should'compress'the'inverted'lists.'– Use'd‐gaps'for'document'numbers.'– Compress'integers'using'one'of'the'methods'discussed'in'class.'• Store'everything'in'memory.'– WriVng'to'disk'will'be'the'next'part'of'the'project.'– I'strongly'recommend'using'ir.cis'to'run'your'code.'• It'has'a'total'of'128Gb'of'RAM'(8'nodes,'16Gb'per'node).'Indexing'Process'Documents'(E‐mails,'web'pages,''Word'docs,'news'arVcles,'…)'Text'acquisiVon'(Crawler,'feeds,''filter,'…)'Corpus'Accessible'data'store'Text'transformaVon'(Parsing,'stopping,''stemming,'extracVon,'…)'Index'creaVon'(Document/term'stats,''weighVng,'inversion,'…)'Server(s)'3/17/09'3'Query'Process'Corpus'Accessible'data'store'Server(s)'Ranking'f(Q,D) EvaluaVon'(Precision,'recall,''clicks,'…)'Query'Process'• We'have'an'index'stored'on'disk.'– Inverted'file,'vocabulary,'collecVon.'– Contains'features'of'terms'and'documents:'• Term'frequencies'in'documents,'document'frequencies,'term'posiVons,'link‐graph'features,'…'• User'inputs'a'query.'• Engine'computes'features'of'the'query.'• Engine'accesses'index'to'respond'to'query.'– Matches'query'features'to'document/term'features'in'index'to'score'each'document.'• Returns'a'ranked'list'of'documents.'3/17/09'4'Abstract'Model'of'Ranking'More'Concrete'Model'3/17/09'5'Example'“CollecVon”'Query:'tropical'fish'pigmented'fish'saltwater'species'bright'coloraVon'3/17/09'6'gi(Q)'='#'of'occurrences'of'i'in'Q'fi(D)'='#'of'occurrences'of'i'in'D''Query:'tropical'fish'pigmented'fish'saltwater'species'bright'coloraVon'Query'Processing'• Term‐at‐a‐Vme'– Accumulates'scores'for'documents'by'processing'term'lists'one'at'a'Vme'• Document‐at‐a‐Vme'– Calculates'complete'scores'for'documents'by'processing'all'term'lists,'one'document'at'a'Vme'• Both'approaches'have'opVmizaVon'techniques'that'significantly'reduce'Vme'required'to'generate'scores'3/17/09'7'Term‐At‐A‐Time'Term‐At‐A‐Time'3/17/09'8'Document‐At‐A‐Time'Document‐At‐A‐Time'3/17/09'9'OpVmizaVon'Techniques'• Inverted'lists'can'be'very'long'– Decompression'Vme'+'processing'Vme'can'add'up'fast'• OpVmizaVons'are'used'to'speed'up'processing'Vme'• Two'classes'of'opVmizaVon'– Read'less'data'from'inverted'lists'• e.g.,'skip'lists'• beCer'for'simple'feature'funcVons'– Calculate'scores'for'fewer'documents'• e.g.,'conjuncVve'processing'• beCer'for'complex'feature'funcVons'ConjuncVve''Term‐at‐a‐Time'3/17/09'10'ConjuncVve''Document‐at‐a‐Time'Threshold'Methods'• Threshold'methods'use'number'of'top‐ranked'documents'needed'(k)'to'opVmize'query'processing'– for'most'applicaVons,'k'is'small'• For'any'query,'there'is'a'minimum&score&that'each'document'needs'to'reach'before'it'can'be'shown'to'the'user'– score'of'the'kth‐highest'scoring'document'– gives'threshold&score'τ&– opVmizaVon'methods'esVmate'τ’&to'ignore'documents'3/17/09'11'Threshold'Methods'• For'document‐at‐a‐Vme'processing,'use'score'of'lowest‐ranked'document'so'far'for'τ′&&– for'term‐at‐a‐Vme,'have'to'use'kth‐largest'score'in'the'accumulator'table'• MaxScore'method'compares'the'maximum'score'that'remaining'documents'could'have'to'τ′&– safe&opVmizaVon'in'that'ranking'will'be'the'same'without'opVmizaVon'MaxScore'Example'• Indexer'computes'μtree&&– maximum'score'for'any'document'containing'just'“tree”'• Assume'k'=3,'τ’&is'lowest'score'ajer'first'three'docs'• Likely'that'τ’&&>&μtree&– τ'&&is'the'score'of'a'document'that'contains'both'query'terms'• Can'safely'skip'over'all'gray'posVngs'3/17/09'12'Other'Approaches'• Early'terminaVon'of'query'processing'– ignore'high‐frequency'word'lists'in'term‐at‐a‐Vme'– ignore'documents'at'end'of'lists'in'doc‐at‐a‐Vme'– unsafe'opVmizaVon'• List'ordering'– order'inverted'lists'by'quality'metric'(e.g.,'PageRank)'or'by'parVal'score'– makes'unsafe'(and'fast)'opVmizaVons'more'likely'to'produce'good'documents'Review'• Query'processing:'– Document‐at‐a‐Vme'– Term‐at‐a‐Vme'– OpVmizaVons:'• ConjuncVve'processing'• Thresholding'• How'do'you'define'fi'and'gi'in'the'scoring'funcVon?'– What'is'the'actual'goal?'3/17/09'13'InformaVon'Needs'• An'informa;on&need'is'the'underlying'cause'of'the'query'that'a'person'submits'to'a'search'engine'– someVmes'called'informa;on&problem&to'emphasize'that'informaVon'need'is'generally'related'to'a'task'• Categorized'using'variety'of'dimensions''– e.g.,'number'of'relevant'documents'being'sought'– type'of'informaVon'that'is'needed'– type'of'task'that'led'to'the'requirement'for'informaVon'Queries'and'InformaVon'Needs'• A'query'can'represent'very'different'informaVon'needs''– May'require'different'search'techniques'and'ranking'algorithms'to'produce'the'best'rankings'• A'query'can'be'a'poor'representaVon'of'the'informaVon'need'– User'may'find'it'difficult'to'express'the'informaVon'need'– User'is'encouraged'to'enter'short'queries'both'by'the'search'engine'interface,'and'by'the'fact'that'long'queries'don’t'work'3/17/09'14'Retrieval'Models'• Provide'a'mathemaVcal'framework'for'defining'the'search'process'– includes'explanaVon'of'assumpVons'– basis'of'many'ranking'algorithms'– can'be'implicit'• Theories'about'relevance'Relevance'• Complex'concept'that'has'been'studied'for'some'Vme'– Many'factors'to'consider''– People'ojen'disagree'when'making'relevance'judgments'• Retrieval'models'make'various'assumpVons'about'relevance'to'simplify'problem'– e.g.,&topical'vs.'user'relevance'– e.g.,'binary'vs.'mul;‐valued'relevance'3/17/09'15'Retrieval'Model'Overview'• Older'models'–


View Full Document

UD CISC 689 - Queries 
and
 Indexes


Download Queries 
and
 Indexes

Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Queries 
and
 Indexes
 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Queries 
and
 Indexes
 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?