Bloom Filter

A bloom filter is a probabilistic data structure that is based on hashing. It is extremely space efficient and is typically used to add elements to a set and test if an element is in a set. Though, the elements themselves are not added to a set. Instead a hash of the elements is added to the set.

When testing if an element is in the bloom filter, false positives are possible. It will either say that an element is definitely not in the set or that it is possible the element is in the set.

A bloom filter is very much like a hash table in that it will use a hash function to map a key to a bucket. However, it will not store that key in that bucket, it will simply mark it as filled. So, many keys might map to same filled bucket, creating false positives.

Properties

An empty bloom filter is a bit array of \(m\) bits, all of which are initially set to zero. A bit array is an extremely space-efficient data structure that has each position in the array set to either 0 or 1.

A bloom filter also includes a set of \(k\) hash functions with which we hash incoming values. These hash functions must all have a range of 0 to \(m - 1\). If these hash functions match an incoming value with an index in the bit array, the bloom filter will make sure the bit at that position in the array is 1. Take a look at this gif showing the insertion of the strings "Hello" and "Bloom" into a bloom filter of 3 bits and 2 hash functions.

Bloom Filter

In this example, "Hello" was hashed to 1 by the first hash function and 3 by the second hash function. So, the bloom filter made sure the bits at index 1 and 3 were flipped to 1. Then, "Bloom" was hashed to 1 and 2. The bloom filter made sure those were both a 1 as well (even though position 1 already had a 1).

When a query happens in a bloom filter, we once again hash the key with all \(k\) of our hash functions. We then check all of the output bits to make sure they are all 1. If any of them is a 0, we know for sure that the key we are searching for is not in our list. If they are all 1, we know that it might be.

False Positives

Let's think about false positives in the bloom filter in the gif. A false positive is when we query the bloom filter to see if a certain word is in it, and it tells us that it is even when it isn't. Imagine that "House" also is hashed to the two indexes 1 and 3? What would happen if we put "Hello" and "Bloom" into our filter, then asked if "House" was inside our filter?

The bloom filter would tell us that "House" might be in it. This is because the bits that would have been activated when "House" was inserted are already flipped to 1. The bloom filter might also tell us a percent certainty, but we'll get into that in the next section.

False Positives Analysis

We have two choices of parameters when building a bloom filter, \(m\) and \(k\). They should each be chosen to dampen the number of false positives as much as possible while still maintaining whatever space requirement the filter needs.

If we have a bloom filter with \(m\) bits and \(k\) hash functions, the probability that a certain bit will still be zero after one insertion is

\[(1 - 1/m)^{k} .\]

Then, after \(n\) insertions, the probability of it still being zero after \(n\) insertions is

\[(1 - 1/m)^{nk} .\]

So, that means the probability of a false positive is

\[(1 - (1 - 1/m)^{nk})^{k} \]

because we are looking for the probability that \(k\) different bits were flipped to 1. This is approximately equal to

\[(1 - e^{-kn/m})^{k} .\]

In each of these equations, raising the value of k (the number of hash functions) will make the probability of a false positive less likely. However, it is not computationally efficient to have an enormous value for k. To minimize this equation, we must choose the best \(k\). We do it this way because we assume that the programmer has already chosen an \(m\) based on their space constraints and that they have some idea what their potential \(n\) will be. So the \(k\) value that minimizes that equation is

\[k = \ln(2) \cdot m/n .\]

Time and Space Complexity

A bloom filter is extremely efficient in both time and space usage. These features are so important that it gives up accuracy to maintain them.

Time

If we are using a bloom filter with \(m\) bits and \(k\) hash function, insertion and search will both take \(O(k)\) time. In both cases, we just need to run the input through all of the hash functions. Then we just check the output bits.

Operation	Complexity
insertion	\(O(k)\)
search	\(O(k)\)

Note that this is one of the few data structures whose time complexity does not at all depend on the number of elements in it. This is because the elements never really enter the structure, only their hashes.

Space

To truly understand the space complexity of the bloom filter, you have to first choose your parameters. You could make a bloom filter with \(k = 1\) and it would just be a hash table that ignores collisions. However, you would have a very large \(m\) if you wanted to keep your false positive rate low. The space of the actual data structure (what holds the data) is simply \(O(m)\).

	Complexity
space	\(O(m)\)

Sample Python Implementation

Below is a python implementation. The bit array component is omitted for simplicity; here it is just an array initialized to all zeros. There are only two hash function that are written for the sake of brevity. This implementation is for academic purposes only.

import hashlib

class BloomFilter:
    def __init__(self, m, k):
        self.m = m
        self.k = k
        self.data = [0]*m
        self.n = 0
    def insert(self, element):
        if self.k == 1:
            hash1 = h1(element) % self.m
            self.data[hash1] = 1
        elif self.k == 2:
            hash1 = h1(element) % self.m
            hash2 = h2(element) % self.m
            self.data[hash1] = 1
            self.data[hash2] = 1
        self.n += 1
    def search(self, element):
        if self.k == 1:
            hash1 = h1(element) % self.m
            if self.data[hash1] == 0:
                return "Not in Bloom Filter"
        elif self.k == 2:
            hash1 = h1(element) % self.m
            hash2 = h2(element) % self.m
            if self.data[hash1] == 0 or self.data[hash2] == 0:
                return "Not in Bloom Filter"
        prob = (1.0 - ((1.0 - 1.0/self.m)**(self.k*self.n))) ** self.k
        return "Might be in Bloom Filter with false positive probability "+str(prob)

def h1(w):
    h = hashlib.md5(w)
    return hash(h.digest().encode('base64')[:6])%10

def h2(w):
    h = hashlib.sha256(w)
    return hash(h.digest().encode('base64')[:6])%10

Testing Out the Bloom Filter

Let's use the above to see how the bloom filter would look once we start inserting elements.

>>> bf = BloomFilter(10, 1)
>>> bf.insert("Hello")
>>> bf.data
[0, 0, 0, 0, 0, 1, 0, 0, 0, 0]

As you can see, the array has flipped just one of the 0's to a 1 because we are only using one hash function in this instance.

>>> bf.search("Hello")
'Might be in Bloom Filter with false positive probability 0.1'
>>> bf.search("No Way")
'Not in Bloom Filter'
>>> bf.search("ge")
'Might be in Bloom Filter with false positive probability 0.1'

Here, the string "No Way" does not hash to the same index as "Hello". However, the string "ge" does. Let's try this with more hash functions.

>>> bf = BloomFilter(10, 2)
>>> bf.insert("Hello")
>>> bf.data
[0, 0, 1, 0, 0, 1, 0, 0, 0, 0]

Now we have two indexes flipped to one because we are using two hash functions.

>>> bf.search("Hello")
'Might be in Bloom Filter with false positive probability 0.0361'
>>> bf.search("ge")
'Not in Bloom Filter'

Now that we're using more hash functions, the probability that we have a false positive drops. Plus, we no longer recognize "ge" as a false positive.

Contents

False Positives