Today we will look at the *Bloom Filter*, which is an interesting
data structure developed by Burton Bloom in 1970. The goal of the data
structure is to keep track of whether or not an element is in the structure or
not.

There are a few things that make bloom filters different from the other data structures we have studied.

First off, bloom filters do not actually store any data. They just store whether the data is part of the data set or not. So we could not use a bloom filter to store a set of strings, but we could use it to remember which strings are part of the set and which are not.

Secondly, bloom filters are *probabilistic*. They are actually not
going to give us the right answers 100% of the time. A bloom filter will
actually give one of two answers when we check if a piece of data is part of
the set or not:

- The data is
*probably*part of the set. - The data definitely is not part of the set.

The bloom filter can give "false positives" and tell us something is part of the set when it's not, but will never give us "false negatives", telling us something isn't part of the set when it actually is.

The goal of bloom filters is to give us the right answer to the question "Is this item in the set?" most of the time, while being very efficient in terms of both time and memory space.

A common use of bloom filters is to sit in front of another data structure and "filter out" missing items quickly. For example, if we want to search a database stored on disk for an item, we can first check a bloom filter. If the bloom filter says the item is not found, we know it's not in the database without having to check it (which would take much longer). If the bloom filter says it's probably there, then we will check it.

The way that a bloom filter works is similar to a hash table. Except instead of storing an array of values, we store an array of bits (or boolean values). Like a hash table, we will make this array quite a bit bigger than the number of values we expect to have. For instance, if we want to keep track of about 100 values, then we can make an array of 367 bits. We begin with all of the bits set to 0.

Like a hash table, we are going to use a hash function. But unlike a hash
table, we are actually going to use *several* different hash functions.
Suppose we want to add a string like "rhino" into the filter. What we will do
is run "rhino" through, let's say, three hash functions, to get 3 different
integers between 0 and 366:

Then what we will do is set the bits at these three indices to 1:

Now the word "rhino" has been inserted into our bloom filter. Note that we are not actually storing the word itself any place, we are just setting three bits in an array to 1.

When it comes time to check if the word "rhino" is in the filter or not, we
will again hash it by the same three hash functions and get the same three
indices. Then we check to make sure they are all 1. If so, we say that the
word *is* in the filter. If any of the bits are 0, we say it's
*not*.

Just like a hash table, it's possible that we have some collisions. It's possible that a combination of words will set the three bits that a word hashes to, and we will say a word is in the filter when it really is not. These are the "false positives" that a bloom filter can produce. Like hash tables, we should design a bloom filter in such a way that collisions are rare.

We can use any hash function for our bloom filter, but we need to make sure all three give different values. We need three different indices for each data element. For example, we could use the following three different hash strategies for a string:

```
private int hash1(String input) {
int value = 3;
for (int i = 0; i < input.length(); i++) {
value = value * input.charAt(i);
}
return Math.abs(value) % size;
}
private int hash2(String input) {
int value = 0;
for (int i = 0; i < input.length(); i++) {
value = (value*28) + input.charAt(i);
}
return Math.abs(value) % size;
}
private int hash3(String input) {
int value = 0;
for (int i = 0; i < input.length(); i++) {
if( i == 0){
value = value + input.charAt(i);
}
else{
value = value + (input.charAt(i) * input.charAt(i - 1));
}
}
return Math.abs(value) % size;
}
```

If we pass the word rhino to these three functions with a table size of 367, we get:

```
hash1("rhino") = 312
hash2("rhino") = 109
hash3("rhino") = 41
```

Of course it's *possible* that a particular word will produce the
same hash value with these three functions, but it should be extremely rare.
Just like with hash tables, our hash functions should also produce a uniform
distribution of hash values.

To insert an element, we simply use the three hash values as indices into our array of bits, and set each one to 1. Here we are using an array of booleans, and set all three to true:

```
// insert a String into the filter
public void insert(String value) {
// set all three positions
bits[hash1(value)] = true;
bits[hash2(value)] = true;
bits[hash3(value)] = true;
}
```

Checking to see if an element is in the filter is similarly easy. We just check if all the bits at all three hash locations are true:

```
// lookup an item
public boolean lookup(String value) {
return bits[hash1(value)] &&
bits[hash2(value)] &&
bits[hash3(value)];
}
```

A bloom filter class is available in BloomFilter.java. The class creates an array of boolean values of a given size, and contains the methods discussed above.

An example using this to read in a text file and report if the items were found or not is available in BloomTest.java. The program creates a large bloom filter, with 500,000 entries.

If we use it with a dictionary file containing 100,000 words, we will see that it works pretty well:

helloYestestingYestowerYesbeansYesfrankfurterYesfakewordNojsdfhjsNobnbnnNo

In this test, all of the real words were found, and no incorrect ones were. That is because the hash functions work reasonably well, and our filter is much bigger than the number of words being stored.

However, the bloom filter is still quite space efficient. Because we only need to store 1 bit for every slot in the array, the 500,000 slots only really need about 62 kilobytes of space. Just storing the actual words themselves takes 917 kilobytes.

The goal of bloom filters is to be small, fast, and be very (but not perfectly) accurate. We will address each of these to see how well it does.

First, let's look at the insert method again:

```
// insert a String into the filter
public void insert(String value) {
// set all three positions
bits[hash1(value)] = true;
bits[hash2(value)] = true;
bits[hash3(value)] = true;
}
```

The hash functions have loops in them, but they only loop through the words given, they don't loop through the whole array. For instance, in the example above there are 500,000 words but each one has a length of at most 23 letters (the longest word is "electroencephalograph's"). Most words are shorted than that. So here our $N$ is the number of words, not the length of a word. So the hash functions themselves are $O(1)$, they don't increase with the size of the table. Likewise, the insert method itself is $O(1)$.

Now let's look at the lookup method:

```
// lookup an item
public boolean lookup(String value) {
return bits[hash1(value)] &&
bits[hash2(value)] &&
bits[hash3(value)];
}
```

Like insert, this method also takes a constant amount of time. No matter how many items are in the table, the time taken to insert one does not grow. Therefore it is $O(1)$.

Now let's consider the question of how man false positives we can expect our bloom filter to give us. We will assume that our hash functions produce a uniform distribution of values.

Let's say we have $B$ bits in our array, and $H$ hash functions. Then we insert $N$ values into the bloom filter. We want to know the probability that a value not in the filter will return true.

The probability that any one bit is still 0 after inserting all of our elements is given by:

$(1 - \frac{1}{B}) ^ {HN}$

That's because the chance of randomly picking a given bit is $\frac{1}{B}$. So
the chance of it *not* being picked is $1 - \frac{1}{B}$. But then we
pick $H \times N$ bits because each element is hashed by each hash function.

We can use this formula to find the probability that a bit *did* get set
to 1 by taking the inverse:

$1 - (1 - \frac{1}{B}) ^ {HN}$

So that's the probability that any one bit is set in our bit array. But for a false positive, we need $H$ many bits to be set by happenstance, because we are hashing by each hash function on a lookup too. That gives us the probability of a false positive as:

$[1 - (1 - \frac{1}{B}) ^ {HN}] ^ H$

Here is a calculator which uses this formula so you can test the probability of false positives given different bloom filter configurations:

False Positive Percentage =

By playing with these parameters, we can design a bloom filter which will give as few false positives as needed.

- Database systems including PostreSQL use bloom filters to filter out queries that are not in the database. Because searching the bloom filter is faster, it avoids searching the database when the bloom filter guarantees the record won't be found.
- The Chrome web browser uses bloom filters to check if URLs the user goes to are marked as malicious. Because most URLs are safe, the use of a local check in a bloom filter means that the browser only needs to query an online database on the (rare) times the bloom filter reports that the URL might be malicious.
- The Medium blog platform uses bloom filters to quickly check if a user (probably) has already read an article before recommending it.

Copyright © 2019 Ian Finlayson | Licensed under a Creative Commons Attribution 4.0 International License.