DOC PREVIEW
CMU CS 15826 - Approximate Counting

This preview shows page 1-2-3-4-5 out of 15 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 15 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 15 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 15 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 15 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 15 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 15 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

C. Faloutsos1CMU SCS15-826: Multimedia Databasesand Data MiningApproximate CountingC. Faloutsos15-826 (c) 2005, C. Faloutsos and C. Palmer 2CMU SCSOutlineGoal: ‘Find similar / interesting things’• Intro to DB• Indexing - similarity search• Data Mining15-826 (c) 2005, C. Faloutsos and C. Palmer 3CMU SCSData Mining - Detailed outline• Statistics• AI - decision trees• DB– data warehouses; data cubes; OLAP– classifiers– association rules– misc. topics:• ...• approximate counting15-826 (c) 2005, C. Faloutsos and C. Palmer 4CMU SCSOutline• Flajolet-Martin (and Cohen) –vocabulary size (Problem #1)• Application: Approximate Neighborhoodfunction (ANF)• other, powerful approximate counting tools(Problem #2, #3)15-826 (c) 2005, C. Faloutsos and C. Palmer 5CMU SCSProblem #1• Given a multiset (eg., words in a document)• find the vocabulary size (#, after dup.elimination)15-826 (c) 2005, C. Faloutsos and C. Palmer 6CMU SCSThanks to• Chris Palmer (Vivisimo)C. Faloutsos215-826 (c) 2005, C. Faloutsos and C. Palmer 7CMU SCSProblem #2• Given a multiset• compute approximate high-end histogram =hot-list query = (k most common words, andtheir counts)15-826 (c) 2005, C. Faloutsos and C. Palmer 8CMU SCSProblem #3• Given two documents• compute quickly their similarity (#commonwords/ #total-words) == Jaccard coefficient15-826 (c) 2005, C. Faloutsos and C. Palmer 9CMU SCSProblem #1• Given a multiset (eg., words in a document)• find the vocabulary size V (#, after dup.elimination)• using space O(V), or O(log(V))(Q1: Applications?)(Q2: How would you solve it?)15-826 (c) 2005, C. Faloutsos and C. Palmer 10CMU SCSBasic idea (Cohen)large bit string, initially all zerosAAC15-826 (c) 2005, C. Faloutsos and C. Palmer 11CMU SCSBasic idea (Cohen)large bit string, initially all zerosAAChash!15-826 (c) 2005, C. Faloutsos and C. Palmer 12CMU SCSBasic idea (Cohen)large bit stringAACC. Faloutsos315-826 (c) 2005, C. Faloutsos and C. Palmer 13CMU SCSBasic idea (Cohen)large bit stringAAC15-826 (c) 2005, C. Faloutsos and C. Palmer 14CMU SCSBasic idea (Cohen)large bit stringAACthe rightmost position depends on the vocabulary size(and so does the left-most)Repeat, with several hashing functions, and merge the estimates15-826 (c) 2005, C. Faloutsos and C. Palmer 15CMU SCSBasic idea (Cohen)large bit stringAACthe rightmost position depends on the vocabulary size(and so does the left-most)Can we do it in less space?? 15-826 (c) 2005, C. Faloutsos and C. Palmer 16CMU SCSBasic idea (Cohen)large bit stringAACthe rightmost position depends on the vocabulary size(and so does the left-most)Can we do it in less space?? YES15-826 (c) 2005, C. Faloutsos and C. Palmer 17CMU SCSHow?15-826 (c) 2005, C. Faloutsos and C. Palmer 18CMU SCSBasic idea (Flajolet-Martin)O(log(V)) bit string (V: voc. size)AACfirst bit: with prob. ½second: with prob. ¼...i-th: with prob. ½**iC. Faloutsos415-826 (c) 2005, C. Faloutsos and C. Palmer 19CMU SCSBasic idea (Flajolet-Martin)O(log(V)) bit string (V: voc. size)AACagain, the rightmost bit‘reveals’ the vocabulary size15-826 (c) 2005, C. Faloutsos and C. Palmer 20CMU SCSBasic idea (Flajolet-Martin)O(log(V)) bit string (V: voc. size)AACagain, the rightmost bit‘reveals’ the vocabulary sizeEg.: V=4, will probably set the 2nd bit, etc15-826 (c) 2005, C. Faloutsos and C. Palmer 21CMU SCSFlajolet-Martin• Hash multiple values of X to same signature– Hash each x to a bit, using exponential distr.– ½ map to bit 0, ¼ map to bit 1, …• Do several different mappings and average– Gives better accuracy– Estimate is: 2b / .77351 / BIAS• b ~ rightmost ‘1’, and actually:15-826 (c) 2005, C. Faloutsos and C. Palmer 22CMU SCSFlajolet-Martin• Hash multiple values of X to same signature– Hash each x to a bit, using exponential distr.– ½ map to bit 0, ¼ map to bit 1, …• Do several different mappings and average– Gives better accuracy– Estimate is: 2b / .77351 / BIAS• b : average least zero bit in the bitmask• bias : 1+.31/k for k different mappings• Flajolet & Martin prove this works15-826 (c) 2005, C. Faloutsos and C. Palmer 23CMU SCSFM Approx. Counting Alg.• How many bits? log V + small constant• What hash functions?Assume X = { 0, 1, …, V-1 }FOR i = 1 to k DO bitmask[i] = 0000…00Create k random hash functions, hashiFOR each element x of M DO FOR i = 1 to k DO h = hashi(x) bitmask[i] = bitmask[i] LOR hEstimate: b = average least zero bit in bitmask[i] 2b/.77351/(1+.31/k)15-826 (c) 2005, C. Faloutsos and C. Palmer 24CMU SCSRandom Hash Functions• Can use linear hash functions. Pick random(ai,, bi) and then the hash function is:– lhashi(x) = ai * x + bi• Gives uniform distribution over the bits• To make this exponential, define– hashi(x) = least zero bit in lhashi(x)• Hash functions easy to create and fast to useC. Faloutsos515-826 (c) 2005, C. Faloutsos and C. Palmer 25CMU SCSConclusions• Want to measure # of distinct elements• Approach #1: (Flajolet-Martin)– Map elements to random bits– Keep bitmask of bits– Estimate is O(2b) for least zero-bit b• Approach #2: (Cohen)– Create random permutation of elements– Keep least element seen– Estimate is: O(1/le) for least rank le15-826 (c) 2005, C. Faloutsos and C. Palmer 26CMU SCSApproximate counting• Flajolet-Martin (and Cohen) – vocabularysize• Application: Approximate Neighborhoodfunction (ANF)• other, powerful approximate counting toolsCMU SCSChristopher R. PalmerPhillip B. GibbonsChristos FaloutsosKDD 2001Fast Approximation of the“neighborhood” Function for MassiveGraphs15-826 (c) 2005, C. Faloutsos and C. Palmer 28CMU SCSMotivation• What is the diameter of the Web?• What is the effective diameter of the Web?• Are the telephone caller-callee graphs forthe U.S. similar to the ones in Europe?• Is the citation graph for physics differentfrom the one for computer science?• Are users in India further away from thecore of the Internet than those in the U.S.?15-826 (c) 2005, C. Faloutsos and C. Palmer 29CMU SCSProposed Tool: neighborhoodGiven graph G=(V,E)N(h) = # pairs within h hops or less = neighborhood function15-826 (c) 2005, C. Faloutsos and C. Palmer 30CMU SCSProposed Tool: neighborhoodGiven graph G=(V,E)N(h) = # pairs within h hops or less = neighborhood function N(u,h) = # neighbors of node u, within h hops or lessC. Faloutsos615-826 (c) 2005, C.


View Full Document

CMU CS 15826 - Approximate Counting

Download Approximate Counting
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Approximate Counting and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Approximate Counting 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?