Efficiently searchable data is important for many applications. We have seen a few data structures capable of searching. The following table shows the efficiency of searching and inserting into three data structures:

Data Structure | Searching | Insertion |

Unsorted Array/Linked List | $O(N)$ | $O(1)$ |

Sorted Array | $O(\log N)$ | $O(N)$ |

Binary Search Tree | $O(\log N)$ | $O(\log N)$ |

Like we talked about last time, binary search trees excel in this area. However,
there is actually a data structure that can do better (with some caveats we will
address later), which is the *hash table*.

The absolute best case scenario for this would be if searching and insertion were $O(1)$. This is the goal of hash tables.

The idea behind a hash table is that we make an array of our items. Then we put each item into an array slot where we will know where to get it when we need it.

For example, if we have the people "Alice", "Bob", and "Claire", then we can put "Alice" in slot 0, "Bob" in slot 1 and "Claire" in slot 2. Then if we need to look up one of their numbers, we can just look in the same place we stored them.

A hash table has *keys* and *values*. The keys are whatever we
will use to search. In a phonebook program, the keys will be names.

The values are the actual data we want to store inside of the hash table. In a phonebook program, this would be the phone numbers and addresses.

However, the problem is how can we decide what slot to put our data in?

The way that we can decide where to put each entry is by using a *hash function*.
A hash function takes our keys that we want to store, and picks a number for it.
That number will be used as an index into the array.

The hash function will map the keys to integers. It must take any key and return the integer to store it at.

The hash function must always return the same number for the same object. That way, when we store the object, we will put it into a given location. Then, when we are searching for it, all we will need to do is run the hash function which will tell us which slot to look in.

If the keys are integers, then the problem is fairly easy. We can just use the keys directly as indices into the table.

However, one issue we have to contend with is making sure that the indices we use are never out of range of the table. For example, if we are using social security numbers as our keys, then the keys will be 9-digit numbers. To use a 9-digit index, the table would need to be one billion entries big.

To avoid this, we can do two things. First we take the absolute value of the number. That will force it to be positive (which is not a problem for social security numbers, but could be for other integers). Next we modulus by the size of the table. That will ensure the number is between 0 and the table size minus one:

```
public static int hash(int key) {
return Math.abs(key) % TABLE_SIZE;
}
```

But what about using strings as a key? There are many strategies we can use. A simple one is to add up the ASCII values of each character in the string. Then we can take that total and then again modulus by the size of the table:

Code to do this in Java is shown below:

```
private static int hash(String name) {
int value = 0;
// add up all the ASCII values
for (int i = 0; i < name.length(); i++) {
value = value + name.charAt(i);
}
// mod by size to prevent overflow
return value % TABLE_SIZE;
}
```

We can use these ideas to build a simple Hash Table example. This example will create a hash table where the keys are names (a String) and the values are phone numbers (another String). We start by making a PhoneBook class, which stores an array of 117 Strings:

```
class PhoneBook {
private static int TABLE_SIZE = 117;
private String[] table;
public PhoneBook() {
table = new String[TABLE_SIZE];
}
```

Next, we have an insert method which uses the hash function for Strings given above to insert a phone number into the table, using the name as the key:

```
// insert a name
public void insert(String name, String number) {
int index = hash(name);
table[index] = number;
}
```

We also have a method to lookup a phone number given a name:

```
// lookup a name
public String lookup(String name) {
// get the index they should be at
int index = hash(name);
// return the data
return table[index];
}
```

The full code for this example is given in Hashing1.java.

Looking at the insert and search methods above, you will notice that they have no loops in them. They both call the hash function, which does loop through the characters of the String it is given.

However, usually the key of the hash table will be of a fixed or relatively small size. For instance, nearly all names are less than, say, 100 characters. Likewise you could use a nine-digit social security number, or other unique identifier as a key while keeping it relatively small. So with a hash table, the size of the table is our $N$, not the size of the keys. Given that, the hash function does not scale up with the input size, and we can say both the insert and lookup methods are $O(1)$.

The example above has an issue, which is that it doesn't handle *collisions*.
A collision occurs when the hash function returns the same index for two different keys.
In our simple hash method, this happens when two names have the same letters, just
rearranged. For instance "Elena" and "Elane" are given the same hash code.

Because the possible number of keys is typically much larger than the number of valid indices, collisions are always possible.

One way of handling collisions is to check if the place we want to store an element is occupied.
If it is, we move onto the next element, repeatedly until we find an empty slot to insert our data.
This is called *linear probing* since we go in a line looking for an empty slot.

For example, we can rewrite our insert method like this:

```
// insert a person
public void insert(String name, String number) {
// get the index by hashing the key
int index = hash(name);
// linear probe until we find an empty slot
while (table[index] != null) {
index = (index + 1) % maxSize;
}
// now store it
table[index] = number;
}
```

Likewise, we will need to handle this in the lookup method as well. However, right now we are only storing the values (phone numbers) in the table. As it is, we won't really have any way of knowing if the number in the table is right!

For example, if we are given "Elane" as the key, and get the hash index, and then find a number there, we won't know if that's Elane's number or Elena's.

For that reason, we will need to store the values *and* the keys in our
hash table. Then we can ensure that the key's match before returning the data.

This example extends the simpler example above in a few ways:

- It implements storing keys and values, so we can use linear probing as discussed above.
- It uses Java's generics, so we can put in any type of key and value.
- It uses Java's built-in .hashCode() method on the keys. This is a method all objects in Java have.

We start by declaring the generic class, and making arrays for storing both the keys and values:

```
public class HashTable<Key, Value> {
private Value[] values;
private Key[] keys;
private int maxSize;
public HashTable(int size) {
maxSize = size;
// make the arrays, with the workaround for Java not allowing generic arrays
values = (Value []) new Object[maxSize];
keys = (Key[]) new Object[maxSize];
}
```

Here we also allow the user to pass in the size of the table.

Next, we will write code to insert a new key, value pair.

```
// insert a key/value pair
public void insert(Key key, Value value) {
// get the index by hashing the key
int index = Math.abs(key.hashCode()) % maxSize;
// linear probe until we find an empty name
while (values[index] != null) {
index = (index + 1) % maxSize;
}
// insert the value and key here
values[index] = value;
keys[index] = key;
}
```

This code does linear probing to find an empty slot, then inserts both the key and value at that index. Notice we use the key's .hashCode() method, and ensure this is valid for our array.

Finally, we will lookup a value given its key:

```
// lookup a value by its key
public Value lookup(Key key) {
// the index they should be at
int index = Math.abs(key.hashCode()) % maxSize;
// linear probe until we find the key matches, or are sure it's not here
while (true) {
// if this is null, it must not have been found
if (keys[index] == null) {
return null;
}
// if it's equal, we found it
if (keys[index].equals(key)) {
return values[index];
}
// move to next one
index = (index + 1) % maxSize;
}
}
```

Here we get the hashed index using the same scheme as when inserting. Then we begin looping at that index. As we go, we check if the key in this slot equals the key we are looking for. If so, we return the associated value. We also check if this slot is null. If so, we must not be storing this key, so we return null.

The code for this is available in HashTable.java.

One problem with this code is that we don't handle the case when the array is full. As we'll discuss, we wouldn't really want to use a hash table if we think this is likely.

Another approach to solving collisions, is called "chaining". What we do here is keep the hash table as an array of linked lists. Each time we want to insert an item into the hash table, we will hash the key, which will tell us which slot in the array to use. Then the item will be appended to the given linked list.

Chaining has the advantage that the hash table can never really fill up, it just gets to be less efficient as it gets bigger (because we have to search through the growing lists).

However searching through a linked list is slower than searching through an array. Also, if you think the hash table will get close to full, you should not use a hash table anyway (because you won't achieve $O(1)$ performance anyway then). So linear probing is generally more widely used.

The performance of a hash table heavily depends on the size of the table, and how good the hash function is.

In the worst case, your hash function returns the same index every time, and you have to essentially search through the whole array each time (or linked list if using chaining). That would be $O(N)$ time.

In the best case, the hash function will return a different code for each key, so all insertions and lookups are $O(1)$. It really only makes sense to use hash tables in situations when you can guarantee there will be few collisions.

You do this by making sure the hash table is bigger than the maximum amount of data being stored. It also works out better if the size of the hash table is a prime number. For example, if we wanted to store 5000 values, we could pick 71411 as the table size. This is a prime number and gives us enough empty space that there will be few runs of collisions.

Also the hash function should provide a uniform distribution of indices:

With a large table, and a well-designed hash function, there will be minimal collisions, so searching a hash table is $O(1)$ in general.

Another method for resolving collisions is called *rehashing*.
With this technique, we would have a second hash function that is used if the first one causes a collision.
If the second one also causes a collision, then we have to resort to linear probing or chaining.
Rehashing avoids problems where our input function has a "spike" in one particular area of the table.
Running them through a second hash function can help smooth this out.

Hash tables require more tweaking than most other data structures. The size of the table must be carefully chosen. If the table fills up too much, it slows down. Also, unlike most data structures, hash tables cannot be easily resized since the hash function would need to return new results (because we modulus by the table size).

The hash function also must be carefully chosen. If it does a poor job, performance will suffer.

If all of the data is known ahead of time, one can write a *perfect*
hash function that assigns a different index to each piece of data. Generally
though, the hash function doesn't know what data will be used ahead of time,
and must try to use as uniform a distribution as possible.

If the hash table is well-tuned, it is the fastest way to store searchable data.

Copyright © 2019 Ian Finlayson | Licensed under a Creative Commons Attribution 4.0 International License.