Hash Tables
In an ideal world, we would want to be able to access data with using the data's unique identifier (key).
For this, to work we need to be able to generate a unique hash code from the key. From this hash code (a number) we then want to get an index in a hash table by using a hash function. For this approach to work, two conditions must be met. Firstly we need to be able to know if two objects are the same (the equals
function) secondly we need to be able to generate a hash code from the unique identifier which can consist of a combination of attributes or just one.
Importantly the following must be true:
So if two objects are the same then their hash Code must be the same as well. However, if two hash codes are the same it does not necessarily mean that the objects are the same, this is a so-called collision.
Hashing Function
We want to be able to calculate the index as fast as possible. From the above requirements, we also want the same keys to produce the same indices. We also want the hash codes and therefore the indices to be evenly distributed to minimize collisions.
For starters we could use the following hashing function:
Hash Code
We want the generated hash code to be randomly and if possible evenly spread across the entire range of possible numbers.
If the unique identifier is a 32-bit data type, like boolean, byte, short, int, char and float we can just take its value straight as an int.
If the unique identifier is a 64-bit data type, like long or double we can use an exclusive or (XOR, only true if they are different) between the two 32-bit parts.
public int hashCode() {
// XOR of two 32-bit parts
return (int)(value ^ (value >>> 32));
}
For strings, it gets a bit harder. You might think it would be a good idea to add the characters represented as integers together. However, this is a very bad idea because for example AUS and USA would then have the same hash code. Instead, we create a polynomial using the character values as coefficients.
public final class String {
private final char value[];
/** Cache the hash code for the string, to avoid recalculation */
private int hash; // Default to 0
...
public int hashCode() {
int h = hash;
if (h == 0 && value.length > 0) {
char val[] = value;
for (int i = 0; i < value.length; i++) {
h = 31 * h + val[i];
}
hash = h;
}
return h;
}
...
}
HashMap
In Java, a HashMap always has a size equal to a power of 2. This leads to the map reserving in the worst case twice as much memory as it needs. However, the advantage of this implementation is that it is very easy to calculate powers of 2 with bit shifts. It also allows us to change the hash function (hashCode() & 0x7FFFFFFF) % length
to hashCode() & (length -1)
. The bitmask with 0x7FFFFFFF
ensures that the hash code is positive.
public HashMap(int initialCapacity) {
int capacity = 1;
while (capacity < initialCapacity)
capacity <<= 1;
table = new Entry[capacity];
}
private int indexFor(int h) {
return h & (table.length - 1);
}
Collision Resolution
As mentioned before collisions are when different objects have the same hash code and therefore the same index. This can happen and can't be avoided. This is why they need to be handled.
Separate Chaining
With this strategy when there is a collision, the colliding elements are chained together just like with a linked list. The advantage of this strategy is that it is very simple and the table never becomes full. The problem however is that it needs additional memory and the memory needs to be dynamic.
The class for a HashMap would then look something like this:
public class HashMap < K, V > implements Map < K, V > {
Node <K, V> [] table;
...
static class Node <K, V> implements Map.Entry <K, V> {
final K key;
V value;
Node <K, V> next;
...
}
}
If the table has the size and we insert elements we can calculate the probability of a collision using the following formula:
From this we can then also calcualte the probability of there being at least 1 collision:
Open Addressing
With this strategy when there is a collision, we look for a free space in the hash table. The advantage of this strategy is that it does not need any additional space however the table can become full. The process of finding a free space is called probing.
Linear Probing
When using linear probing we try the next highest index until we find a free space. If we reach the end of the table we restart the search at index 0 until we are back to the initial area of collision which means the table is full.
So if the hash code is and the table has the size the index after collisions is:
public void add(T elem) {
int i = (elem.hashCode() & 0x7FFFFFFF) % size;
while (array[i] != null)
i = (i + 1) % size;
array[i] = elem;
}
The above code however doesn't check if the hash table should only hold unique values (set semantic) or if the table is already full. However, with this strategy clusters of values can form. When adding a value you then just make the cluster even bigger and therefore also the probability of hitting a cluster.
When inserting into a table of size with a cluster of size we can calculate the probability of hitting the cluster and therefore also increasing the size of the cluster:
We can also calculate the probability of needing at least 3 probe steps when adding which is:
Double Hashing
The idea here is that we don't look at the next highest free space, which is equivalent to a step size of 1 but that each element calculates a step size for itself. This is done to avoid creating clusters. This strategy is called double hashing as you have a hash function for the index and the step size.
So if the hash code is and the table has the size the index after collisions is:
public void add(T elem) {
int i = (elem.hashCode() & 0x7FFFFFFF) % size;
int step = ...?
while (array[i] != null) {
i = (i + step) % size;
}
array[i] = elem;
}
However, we need to be very careful when choosing the step size otherwise the problem of clusters becomes even worse. Some obvious bad examples would be a step size of 0 or the size of the table. To avoid this we can restrict the step size with the following condition:
Some common choices are:
- The size of the table is a power of 2 and a step is an odd number .
1 + 2 * ((elem.hashCode() & 0x7FFFFFFF) % (m / 2))
- The size of the table is a prime number and a step is .
1 + (elem.hashCode() & 0x7FFFFFFF) % (m – 2)
Removing Elements
When removing an element it can't just be set to null
because otherwise when looking for an element after the deletion we could hit a null reference and crash before we find the element we are looking for (depending on language and implementation). Instead of setting it to null
it is common practice to set it to a sentinel object. If we are then looking for an element and we hit a sentinel we can just carry on our search. This then also means that when we add an element and we come across a sentinel we can add the element in place of the sentinel.
public class HashTable < T > {
private final Object[] arr;
private static final Object sentinel = new Object();
...
public void remove(Object o) {
assert o != null;
int i = (o.hashCode() & 0x7FFFFFFF) % arr.length;
int cnt = 0;
while (arr[i] != null && !o.equals(arr[i]) && cnt != arr.length) {
i = (i + 1) % arr.length;
cnt++;
}
if (o.equals(arr[i])) arr[i] = sentinel;
}
public boolean contains(Object o) {
assert o != null;
int i = (o.hashCode() & 0x7FFFFFFF) % arr.length;
int cnt = 0;
while (arr[i] != null && !o.equals(arr[i]) && cnt != arr.length) {
i = (i + 1) % arr.length;
cnt++;
}
return cnt != arr.length && arr[i] != null;
}
}
Performance Improvements
Using modulo in the probe loop causes is not optimal because of the multiple divisions that need to be calculated.
So instead of i = (i + step) % size;
we can use one of the following:
- If the table size is a power of 2 we can use a bitmask which is very fast.
i = (i + step) & (size - 1);
- Instead of using modulo, we could also manually detect an overflow.
i = i + step; if (i >= size) i -= size;
- Because a comparison with 0 is faster than with a given number we could also probe backward and check for an underflow.
i = i - step; if (i < 0) i += size;
Load Factor
The number of collisions increases with the number of elements in the table. To be able to make statements on the status of the table there is the so-called load factor which is defined as followed:
If we know the number of elements to be added we can then also calculate an optimal size for the table depending on the desired load factor.
We can also create a new table and copy all the elements to the new table if a certain threshold load factor has been reached. However, it is important to recalculate the indices when doing this. This process is called rehashing.
When searching for an element in a hash table that is using the separate chained strategy we expect to find the element after half the load factor so . If a search is unsuccessful then the waste is because the entire list was searched.
Separate Chaining
There is no upper limit for the load factor as the chains can be of any length. The average length is equivalent to the load factor. For the table to be efficient the load factor should be .
Open Addressing
The load factor is limited to . As long as there is still space in the table. For optimal performance, it is recommended to have a load factor of for linear probing and double hashing .