Unformatted text preview:

Bloom Filters and Summary Cache presented by Timur Chabuk 04 24 2007 Bloom Filters Bloom filters offer a succinct way to represent a set of items Representation reduces space but pays with it with false positives Bloom Filter Principle Wherever a list or a set is used and space is at a premium consider using a Bloom filter if the effect of false positives can be mitigated Standard Bloom Filter goal represent a set of S x1 x2 xn representation is an array of m bits uses k independent hash functions h1 h2 hk hash functions have output range 1 m Entering Data To enter a set S into a bloom filter for every x in S for every hash function hi i 1 to k set hi x th bit of m to 1 Example entering x1 and x2 into bloom filter h1 x1 2 h2 x1 5 h3 x1 9 h1 x2 5 h2 x2 7 h3 x2 11 Checking Membership To check membership of y check hi y th bit of m for every i 1 to k if any bit is 0 y is not in the set if all bits are 1 y is in set or false positive Example checking membership of y1 and y2 h1 y1 2 h2 y1 4 h3 y1 8 h1 y2 5 h2 y2 7 h3 y2 11 False Positives 1 m chance specific hash of specific element hit specific bit 1 1 m chance specific hash of specific element missed specific bit 1 1 m k chance all hashes of specific element missed specific bit 1 1 m kn chance all hashes of all elements missed specific bit After all elements of S have been hashed into the Bloom filter P a specific bit is 0 p 1 1 m kn e kn m p Let proportion of all bits that are 0 after all n elements entered E p False Positives False positives occur when both of the following hold y is not an element in the set bits corresponding to all hashes of y are set in the bit vector 1 chance corresponding bit of a specific hash of y 1 1 k chance all corresponding bits of all hashes of y 1 Let f and f represent the probability of a false positive f 1 p k 1 k 1 p k f f 1 1 1 m kn k f 1 e kn m k Alternate Forms Bloom filters are sometimes implemented such that each hash function has an output range of m k consecutive bits disjoint from all other hash function outputs P specific bit is 0 after all elements added 1 k m n e kn m Note that for all k greater than or equal to 1 1 k m n 1 1 m kn Probability of a false positive is always at least as large in this form Choosing of Hash Functions given m and n how many hash functions to use more hash functions yields more chances to find a 0 bit for elements not in S fewer hash functions increases the fraction of the bits that are 0 to find optimal number minimize probability of false positive f with repsect to number of hash functions k minimum occurs when p 1 2 equivalently when k ln2 m n number of 0 bits in Bloom filter may not exactly equal p with high probability it will be very close to p for large arrays Lower Bound How many bits m are necessary to represent all sets of n elements in a manner that allows false positives for at most a fraction of the universe and no false negatives u size of universe u choose n number of possible sets with n elements F X m bit string to which our representation maps set X Given an m bit string s and an element x s accepts x if there exists a set X st x is in X and F X s s rejects x otherwise Lower Bound Consider a specific set X of n elements Any string s that is used to represent X must accept every element in X may accept u n other elements Therefore n u n number of elements that a string s can accept a string s can represent any subset of these elements that is of size n n u n choose n number of such subsets Lower Bound We must be able to represent all possible sets using the distinct strings that we have distinct strings sets a string can represent sets in universe Bloom vs Lower Bound Probability of a false positive in Bloom filter f requires Bloom filter is within log2e of the lower bound space wise Keeping space constant using n j bits for table false positives are Lower Bound false positive rate 0 5 j Bloom Filter false positive rate 0 6185 j Hashing vs Bloom Filters Hashing each item in set is hashed into log n bits sorted list of hash values represents set very small error probabilities using 2log2n bits element probability that the hash of an element not in the set matches hash of an element in the set is at most 1 n Bloom filters generalization of hashing more tradeoffs between of bits used and false positives example constant false positive rate for constant number of bits element not interesting theoretically very important practically constant false positives are worth it to keep number of bits per element constant Bloom Filter Tricks Given two bloom filters representing sets S1 and S2 respectively S1 S2 OR of the bit vectors from the bloom filters S1 S2 can be approximated E magnitude of inner product Given S1 S2 k m and magnitude of inner product one can calculate an estimate of the S1 S2 Bloom filters can be easily halved in size OR the first and second halves when hashing to do a lookup mask the higher bit Counting Bloom Filters How do we handle sets that change over time insertion is easy compute hashes and set corresponding bits to 1 deletion is harder compute hashes and set corresponding bits to 0 problem what if we are unsetting a bit that is used by a different element in the set Counting Bloom filters each entry in the Bloom filter is a counter insertion compute hashes and increment corresponding counters deletion compute hashes and decrement corresponding counters Counting Bloom Filters Counter size should be large enough to prevent overflow How likely is overflow P a specific counter j P a specific counter j P any counter j m P a specific counter j Plugging in k ln 2 m n P any counter j Counting Bloom Filters Plugging in for 4 hash functions P max c i 16 1 37 10 15 m For a Bloom filter that holds t different sets of at most n elements probability of an overflow is 1 37 10 15 mt What happens when counter does overflow Just leave it at maximum value Only a problem if counter gets decremented all the way down to 0 Expected time for this to …


View Full Document

UMD CMSC 818S - Bloom Filters and Summary Cache

Loading Unlocking...
Login

Join to view Bloom Filters and Summary Cache and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Bloom Filters and Summary Cache and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?