DOC PREVIEW
Duke CPS 214 - Algorithms in the Real World

This preview shows page 1-2-3-4-29-30-31-32-33-60-61-62-63 out of 63 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 63 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 63 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 63 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 63 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 63 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 63 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 63 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 63 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 63 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 63 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 63 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 63 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 63 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 63 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

15-853:Algorithms in the Real WorldIndexing and Searching OutlineSlide 3Basic ModelHow big is an Index?Sizes over timePrecision and RecallSlide 8Main ApproachesQueriesTechnique used Across MethodsOther MethodsSlide 13Documents as Bipartite GraphSlide 151. Space for Posting ListsSome CodesGlobal vs. Local ProbabilitiesPerformance2. Accessing the LexiconFront CodingPrefix and Wildcard Queriesn-gramRotated Lexicon3. Merging Posting ListsUnion, Intersection, and MergingMerging: Upper boundsSplit and JoinTime for Split and JoinWill also useUnion with Split and JoinRuntime of UnionIntersection with Split and JoinCase Study: AltaVistaAltaVista: the indexAltaVista: query mergingAltaVista: query merging (cont.)Slide 38Link AnalysisThe Link GraphThe Link MatrixSlide 42Google: Page Rank (1)Google: Page Rank (2)Side Topic: Markov ChainsSlide 46Slide 47Slide 48Google: Page Rank (3)Hubs and Authorities (1)Hubs and Authorities (2)Hubs and Authorities (3)Hubs and Authorities (4)Hubs and Authorities (5)Slide 55Syntacting Clustering of the WebSyntactic Clusteringw-ShinglingResemblance and ContainmentCalculating Resemblance/ContainmentUsing Min Valued ShinglesHashBack to similarity15-853 Page 115-853:Algorithms in the Real WorldIndexing and Searching (how google and the likes work)15-853 Page 2Indexing and Searching OutlineIntroduction: model, query typesInverted File Indices: Compression, Lexicon, MergingLink Analysis: PageRank (Google), Hubs & AuthoritiesDuplicate Removal15-853 Page 3Indexing and Searching OutlineIntroduction: –model–query types–common techniques (stop words, stemming, …)Inverted File Indices: Compression, Lexicon, MergingLink Analysis: PageRank (Google), Hubs and AuthoritiesDuplicate Removal:15-853 Page 4Basic ModelApplications:–Web, mail, and dictionary searches–Law and patent searches–Information filtering (e.g., NYT articles)Goal: Speed, Space, Accuracy, Dynamic Updates“Document Collection”IndexQueryDocument List15-853 Page 5How big is an Index?Sep 2003, self proclaimed sizes (gg = google, atw = alltheweb, ink = inktomi, tma = teoma)Source: Search Engine WatchBillionPages15-853 Page 6Sizes over time15-853 Page 7Precision and RecallTypically a tradeoff between the two.number retrieved that are relevanttotal number retrievednumber relevant that are retrievedtotal number relevant Precision:Recall:15-853 Page 8Precision and RecallDoes the black or the blue circle have higher precision?15-853 Page 9Main ApproachesFull Text Searching–e.g. grep, agrep (used by many mailers)Inverted File Indices–good for short queries–used by most search enginesSignature Files–good for longer queries with many termsVector Space Models–good for better accuracy–used in clustering, SVD, …15-853 Page 10QueriesTypes of Queries on Multiple “terms”–boolean (and, or, not, andnot)–proximity (adj, within <n>)–keyword sets–in relation to other documentsAnd within each term–prefix matches–wildcards–edit distance bounds15-853 Page 11Technique used Across MethodsCase foldingLondon -> londonStemmingcompress = compression = compressed(several off-the-shelf English Language stemmers are freely available)Stop wordsto, the, it, be, or, … how about “to be or not to be”Thesaurus fast -> rapid15-853 Page 12Other MethodsDocument Ranking:Returning an ordered ranking of the results–A priori ranking of documents (e.g. Google)–Ranking based on “closeness” to query–Ranking based on “relevance feedback”Clustering and “Dimensionality Reduction”–Return results grouped into clusters –Return results even if query terms does not appear but are clustered with documents that doDocument Preprocessing–Removing near duplicates–Detecting spam15-853 Page 13Indexing and Searching OutlineIntroduction: model, query typesInverted File Indices:–Index compression–The lexicon–Merging terms (unions and intersections)Link Analysis: PageRank (Google), HITS Duplicate Removal15-853 Page 14Documents as Bipartite GraphCalled an “Inverted File” indexCan be stored using adjacency lists, also called–posting lists (or files)–inverted file entryExample size of TREC database(Text REtrieval Conference)–538K terms–742K documents–333,856K edgesFor the web, multiply by 10K……termsDocumentsAardvarkDoc 115-853 Page 15Documents as Bipartite GraphImplementation Issues:1. Space for posting liststhese take almost all the space2. Access to lexicon–btrees, tries, hashing–prefix and wildcard queries3. Merging posting list–multiple term queries……termsDocumentsAardvarkDoc 1Lexicon15-853 Page 161. Space for Posting ListsPosting lists can be as large as the document data–saving space and the time to access the space is critical for performanceWe can compress the lists,but, we need to uncompress on the fly.Difference encoding:Lets say the term elephant appears in documents: [3, 5, 20, 21, 23, 76, 77, 78]then the difference code is [3, 2, 15, 1, 2, 53, 1, 1]15-853 Page 17Some CodesGamma code: if most significant bit of n is in location k, then gamma(n) = 0k n[k..0]2 log(n) – 1 bitsDelta code:gamma(k)n[k..0]2 log(log(n)) + log(n) - 1 bitsFrequency coded:base on actual probabilities of each distance15-853 Page 18Global vs. Local ProbabilitiesGlobal:–Count # of occurrence of each distance–Use Huffman or arithmetic codeLocal:generate counts for each list elephant: [3, 2, 1, 2, 53, 1, 1]Problem: counts take too much spaceSolution: batching group into buckets by blog(length) c15-853 Page 19PerformanceBits per edge based on the TREC document collectionTotal size = 333M * .66 bytes = 222MbytesGlobal bits/edge Binary 20.00 Gamma 6.43 Delta 6.19 Huffman 5.83Local Skewed Bernoulli5.28 Batched Huffman5.2715-853 Page 202. Accessing the LexiconWe all know how to store a dictionary, BUT…–it is best if lexicon fits in memory---can we avoid storing all characters of all words–what about prefix or wildcard queries?Some possible data structures–Front Coding–Tries–Perfect Hashing–B-trees15-853 Page 21Front CodingFor large lexicons can save 75% of spaceBut what about random access?Word front coding7,jezebel 0,7,jezebel5,jezer 4,1,r7,jezerit 5,2,it6,jeziah 3,3,iah6,jeziel 4,2,el7,jezliah 3,4,liah15-853 Page 22Prefix and Wildcard QueriesPrefix queries–Handled by all access methods except hashingWildcard queries–n-gram–rotated lexicon15-853 Page 23n-gramConsider every block of n characters in a term:e.g. 2-gram of jezebel ->


View Full Document

Duke CPS 214 - Algorithms in the Real World

Download Algorithms in the Real World
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Algorithms in the Real World and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Algorithms in the Real World 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?