Home CPSC 340

Bloom Filters

 

Introduction

Today we will look at the Bloom Filter, which is an interesting data structure developed by Burton Bloom in 1970. The goal of the data structure is to keep track of whether or not an element is in the structure or not.

There are a few things that make bloom filters different from the other data structures we have studied.

First off, bloom filters do not actually store any data. They just store whether the data is part of the data set or not. So we could not use a bloom filter to store a set of strings, but we could use it to remember which strings are part of the set and which are not.

Secondly, bloom filters are probabilistic. They are actually not going to give us the right answers 100% of the time. A bloom filter will actually give one of two answers when we check if a piece of data is part of the set or not:

  1. The data is probably part of the set.
  2. The data definitely is not part of the set.

The bloom filter can give "false positives" and tell us something is part of the set when it's not, but will never give us "false negatives", telling us something isn't part of the set when it actually is.

The goal of bloom filters is to give us the right answer to the question "Is this item in the set?" most of the time, while being very efficient in terms of both time and memory space.

A common use of bloom filters is to sit in front of another data structure and "filter out" missing items quickly. For example, if we want to search a database stored on disk for an item, we can first check a bloom filter. If the bloom filter says the item is not found, we know it's not in the database without having to check it (which would take much longer). If the bloom filter says it's probably there, then we will check it.


 

Main Idea

The way that a bloom filter works is similar to a hash table. Except instead of storing an array of values, we store an array of bits (or boolean values). Like a hash table, we will make this array quite a bit bigger than the number of values we expect to have. For instance, if we want to keep track of about 100 values, then we can make an array of 367 bits. We begin with all of the bits set to 0.

There is an array of several cells, all of which are storing 0
An array of bits

Like a hash table, we are going to use a hash function. But unlike a hash table, we are actually going to use several different hash functions. Suppose we want to add a string like "rhino" into the filter. What we will do is run "rhino" through, let's say, three hash functions, to get 3 different integers between 0 and 366:

The word rhino is being put into three different boxes labeled hash1, hash2,
and hash3.  hash1 produces the value 312.  hash2 produces 109 and hash3 produces 41.
Three different hash functions being used

Then what we will do is set the bits at these three indices to 1:

The array of bits is shown.  All cells are 0 except for cells at index 41, 109,
and 312, which are shown being set to 1
These bits are set to 1

Now the word "rhino" has been inserted into our bloom filter. Note that we are not actually storing the word itself any place, we are just setting three bits in an array to 1.

When it comes time to check if the word "rhino" is in the filter or not, we will again hash it by the same three hash functions and get the same three indices. Then we check to make sure they are all 1. If so, we say that the word is in the filter. If any of the bits are 0, we say it's not.

Just like a hash table, it's possible that we have some collisions. It's possible that a combination of words will set the three bits that a word hashes to, and we will say a word is in the filter when it really is not. These are the "false positives" that a bloom filter can produce. Like hash tables, we should design a bloom filter in such a way that collisions are rare.


 

Hash Functions

We can use any hash function for our bloom filter, but we need to make sure all three give different values. We need three different indices for each data element. For example, we could use the following three different hash strategies for a string:


private int hash1(String input) {
    int value = 3;

    for (int i = 0; i < input.length(); i++) {
        value = value * input.charAt(i);
    }

    return Math.abs(value) % size;
}

private int hash2(String input) {
    int value = 0;
    
    for (int i = 0; i < input.length(); i++) {
        value = (value*28) + input.charAt(i);
    }
    return Math.abs(value) % size;
}

private int hash3(String input) {
    int value = 0;

    for (int i = 0; i < input.length(); i++) {
        if( i == 0){
            value = value + input.charAt(i);
        }
        else{
            value = value + (input.charAt(i) * input.charAt(i - 1));
    
        }
    }
    return Math.abs(value) % size;
}

If we pass the word rhino to these three functions with a table size of 367, we get:


hash1("rhino") = 312
hash2("rhino") = 109
hash3("rhino") = 41

Of course it's possible that a particular word will produce the same hash value with these three functions, but it should be extremely rare. Just like with hash tables, our hash functions should also produce a uniform distribution of hash values.


 

Inserting Elements

To insert an element, we simply use the three hash values as indices into our array of bits, and set each one to 1. Here we are using an array of booleans, and set all three to true:


// insert a String into the filter
public void insert(String value) {
    // set all three positions
    bits[hash1(value)] = true;
    bits[hash2(value)] = true;
    bits[hash3(value)] = true;
}

 

Checking Elements

Checking to see if an element is in the filter is similarly easy. We just check if all the bits at all three hash locations are true:


// lookup an item
public boolean lookup(String value) {
    return bits[hash1(value)] &&
        bits[hash2(value)] &&
        bits[hash3(value)];
}

 

Example

A bloom filter class is available in BloomFilter.java. The class creates an array of boolean values of a given size, and contains the methods discussed above.

An example using this to read in a text file and report if the items were found or not is available in BloomTest.java. The program creates a large bloom filter, with 500,000 entries.

If we use it with a dictionary file containing 100,000 words, we will see that it works pretty well:

hello
Yes
testing
Yes
tower
Yes
beans
Yes
frankfurter
Yes
fakeword
No
jsdfhjs
No
bnbnn
No

In this test, all of the real words were found, and no incorrect ones were. That is because the hash functions work reasonably well, and our filter is much bigger than the number of words being stored.

However, the bloom filter is still quite space efficient. Because we only need to store 1 bit for every slot in the array, the 500,000 slots only really need about 62 kilobytes of space. Just storing the actual words themselves takes 917 kilobytes.


 

Big-O Analysis

The goal of bloom filters is to be small, fast, and be very (but not perfectly) accurate. We will address each of these to see how well it does.

First, let's look at the insert method again:


// insert a String into the filter
public void insert(String value) {
    // set all three positions
    bits[hash1(value)] = true;
    bits[hash2(value)] = true;
    bits[hash3(value)] = true;
}

The hash functions have loops in them, but they only loop through the words given, they don't loop through the whole array. For instance, in the example above there are 500,000 words but each one has a length of at most 23 letters (the longest word is "electroencephalograph's"). Most words are shorted than that. So here our $N$ is the number of words, not the length of a word. So the hash functions themselves are $O(1)$, they don't increase with the size of the table. Likewise, the insert method itself is $O(1)$.

Now let's look at the lookup method:


// lookup an item
public boolean lookup(String value) {
    return bits[hash1(value)] &&
        bits[hash2(value)] &&
        bits[hash3(value)];
}

Like insert, this method also takes a constant amount of time. No matter how many items are in the table, the time taken to insert one does not grow. Therefore it is $O(1)$.


 

Accuracy Analysis

Now let's consider the question of how man false positives we can expect our bloom filter to give us. We will assume that our hash functions produce a uniform distribution of values.

Let's say we have $B$ bits in our array, and $H$ hash functions. Then we insert $N$ values into the bloom filter. We want to know the probability that a value not in the filter will return true.

The probability that any one bit is still 0 after inserting all of our elements is given by:

$(1 - \frac{1}{B}) ^ {HN}$

That's because the chance of randomly picking a given bit is $\frac{1}{B}$. So the chance of it not being picked is $1 - \frac{1}{B}$. But then we pick $H \times N$ bits because each element is hashed by each hash function.

We can use this formula to find the probability that a bit did get set to 1 by taking the inverse:

$1 - (1 - \frac{1}{B}) ^ {HN}$

So that's the probability that any one bit is set in our bit array. But for a false positive, we need $H$ many bits to be set by happenstance, because we are hashing by each hash function on a lookup too. That gives us the probability of a false positive as:

$[1 - (1 - \frac{1}{B}) ^ {HN}] ^ H$


 

False Positive Calculator

Here is a calculator which uses this formula so you can test the probability of false positives given different bloom filter configurations:


False Positive Percentage =

By playing with these parameters, we can design a bloom filter which will give as few false positives as needed.


 

Applications

Copyright © 2019 Ian Finlayson | Licensed under a Creative Commons Attribution 4.0 International License.