Maps that are hash tables are normally implemented
using a hash function which maps a data item (the key) into
an indexing table. The hash function takes the key and uses some
algorithm to convert it to an index value into an array. For example,
the hash function from the Java SDK HashMap class:
int index = (key.hashCode() & 0x7FFFFFFF) % table.length;
This hash function extracts some bits from the hashcode of the key
object (to quickly convert the hashcode to some positive number), and
then uses the remainder, after division by the table length, to get an
index that fits into the HashMap internal array. A hash
function cannot generally guarantee every data item maps to a
different index, so the structure of a hash table is complicated by
needing to maintain a list at every index, allowing multiple objects
to be held at any index. Having multiple objects at the same index is
inefficient since the collection at that index must be searched every
time an element from that collection is required.
|
Related Reading
|
A perfect hash function is one that maps every key to a
different index. Map implementations with perfect hash
functions are much simpler than using generic hash functions; they
consist of two arrays, one for the keys and one to hold all
corresponding values. The array elements do not need to hold
collections since only one object is ever located at each index,
making processing easier and faster. (Note that the hash function does
not have to map a key into every index, i.e. some indexes can be
empty. A hash function which maps each key to a unique index
and leaves no empty entries in the table is called a
minimal perfect hash function). But how can you get a perfect
hash function?
Recently I came across an interesting performance problem. The
application used a number of objects, which were at core
Maps with known key objects (not
Strings). There were several kinds of Maps,
and each kind was allowed to contain only one specific set of key
objects. For example, suppose the key objects were defined simply
as
public class Key {
int id;
public Key(int id) {this.id = id;}
}
Then the first kind of Map might be allowed to have
keys with ids of 5, 10, 11, and 27, while another kind of
Map must contain only keys with ids 10, 11 and 15,
etc. There could be multiple instances of each kind of map, but every
instance could only contain its defined keys.
In this application these Map objects were accessed
and updated often. The key lookup needed to be as fast as
possible. Note the information we have about the Map
data: the key set was restricted and known before construction for
every Map, and the keys had associated numerical values
which were unique to each key. Because the key data was restricted and
known, my first thought was that the these Maps were
ideal for optimization using perfect hash functions.
Creating optimized classes for these Maps is
straightforward except for the hash function. To keep the code clear
and optimized for Key object keys, I won't implement the
Map interface precisely -- converting the following class
to a standard implementation of Map is easy enough.
public class PerfectKeyHashMap
{
Key[] keys;
Object[] values;
public void put(Key key, Object value)
{
//The following line is the currently unknown hash function
int index = someHashFunctionOn(key);
keys[index] = key;
values[index] = values;
}
public Object get(Key key)
{
//The following line is the currently unknown hash function
int index = someHashFunctionOn(key);
//keys assumed to be entered on map creation with null values.
//no validation done on key, but if needed, that could be
//if(keys[index].id != key.id) throw new Exception
//("invalid key");
return values[index];
}
...
}
Now how do we get our perfect hash function? We want it to be quick
so we should use as few simple operations as possible. The
HashMap function manipulates bits and uses the remainder
operator. Remainder looks like it could be quite a useful operation
for our hash function. It will help map arbitrary numbers into a small
table, and it can be used with a variable argument. So let's try
it.
We have a known maximum number of keys in the hashtable. We also know that we want each of those keys to map to a different index. So clearly our remainder operator argument must be a number greater than or equal to the number of keys we have. Let's try out some tests to get a feel for the data. Using some examples, including the ones we had earlier, what would work using the remainder operator? We'll start with the list of key values and apply the remainder operator to each value. Then we use the argument that starts with the size of the key set, which increments by one each time until we get a result set of unique keys (the repeated values in each column are marked with a *):
Table 1: Finding the hash function for the set {5,10,11,27} | ||||
keys 1 |
%4 |
%5 |
%6 |
%7 |
5 |
1 |
0* |
5* |
5 |
10 |
2 |
0* |
4 |
3 |
11 |
3* |
1 |
5* |
4 |
27 |
3* |
2 |
3 |
6 |
Table 2: Finding the hash function for the set {10,11,15} | |
keys 2 |
%3 |
10 |
1 |
11 |
2 |
15 |
0 |
Table 3: Finding the hash function for the set {1,55,300,1095,1111} | ||||||||||
keys 3 |
%5 |
%6 |
%7 |
%8 |
%9 |
%10 |
%11 |
%12 |
%13 |
%14 |
300 |
0 |
0 |
6* |
4 |
3 |
0 |
3 |
0 |
1 |
6 |
1095 |
0 |
3 |
3 |
7* |
6 |
5* |
6 |
3 |
3* |
3 |
55 |
0 |
1* |
6* |
5 |
1* |
5* |
0* |
7* |
3* |
13 |
1 |
1* |
1* |
1 |
1 |
1* |
1 |
1 |
1 |
1 |
1 |
1111 |
1* |
1* |
5 |
7* |
4 |
1 |
0* |
7* |
6 |
5 |
For each of these three examples, we eventually gained a set of indexes which were unique and would map to a fairly small table. This method of obtaining indexes by successively increasing the remainder argument will definitely lead to a set of unique value for any set of non-repeated keys. This is because the value of one plus the largest key in the set for the remainder operator argument is guaranteed to produce a unique set of integers. For the last example this argument would be 1112 as shown in this table:
Table 4: Showing that a hash function producing a unique set is always obtainable | |
keys 3 |
%1112 |
300 |
300 |
1095 |
1095 |
55 |
55 |
1 |
1 |
1111 |
1111 |
So our perfect hash map needs an extra instance variable to hold the remainder appropriate for its set of keys. The hash function is just
int index = key.id % remainder;
|
What is the speed benefit of this perfect hashtable compared to
HashMap? I'll test the two by comparing
PerfectHashMaps to HashMaps which have the
benefit of the same amount of memory, as well as measurements against
a HashMap sized to just contain the set. I've optimized
the Key.hashCode() and key.equals() methods
so that the HashMaps are not at a disadvantage from these
methods. And I've included a test against a KeyHashMap
implementation, a specialized implementation of HashMap,
which is optimized for Key object keys (see "Optimizing a
query on a Map" in the resource section below).
It is worth noting that the HashMap.put() method
performs at different speeds depending on whether or not the
put() is overwriting an existing entry. With no existing
entry for a key, the HashMap needs to create a node
element before inserting the value into the collection, whereas the
node already exists when the entry is being overwritten. So I test
both these scenarios too. All measured times have been scaled by the
same value for simpler comparison.
Table 5: Comparing the performance of various Maps using the put() method on an empty table | ||||
put(), no previous element at key |
1.2 JVM |
1.3 JVM |
HotSpot 2.0 |
|
1 |
HashMap, size of key set |
100% |
49.5% |
57.6% |
2 |
HashMap, same size as perfect hash map |
66.5% |
35.8% |
27.8% |
3 |
KeyHashMap, size of key set |
73.3% |
46.1% |
44.0% |
4 |
KeyHashMap, same size as perfect hash map |
42.4% |
34.2% |
26.6% |
5 |
PerfectKeyHashMap |
15.1% |
15.7% |
12.2% |
6 |
MinimalPerfectKeyHashMap (size of key set) |
32.2% |
35.0% |
27.7% |
Table 6: Comparing the performance of various Maps using the put() method on a full table | ||||
put() overwriting previous element at key |
1.2 JVM |
1.3 JVM |
HotSpot 2.0 |
|
1 |
HashMap, size of key set |
17.4% |
21.9% |
27.7% |
2 |
HashMap, same size as perfect hash map |
18.3% |
24.1% |
18.6% |
3 |
KeyHashMap, size of key set |
13.4% |
17.0% |
15.4% |
4 |
KeyHashMap, same size as perfect hash map |
14.6% |
16.1% |
15.5% |
5 |
PerfectKeyHashMap |
15.15 |
14.0% |
11.4% |
6 |
MinimalPerfectKeyHashMap (size of key set) |
32.4% |
34.3% |
27.2% |
Table 7: Comparing the performance of various Maps using the get() method | ||||
get() |
1.2 JVM |
1.3 JVM |
HotSpot 2.0 |
|
1 |
HashMap, size of key set |
16.6% |
25.1% |
20.6% |
2 |
HashMap, same size as perfect hash map |
17.1% |
20.4% |
16.8% |
3 |
KeyHashMap, size of key set |
13.4% |
11.7% |
12.6% |
4 |
KeyHashMap, same size as perfect hash map |
13.2% |
15.1% |
11.7% |
5 |
PerfectKeyHashMap |
7.3% |
8.7% |
6.6% |
6 |
MinimalPerfectKeyHashMap (size of key set) |
16.0% |
18.2% |
14.9% |
The comparisons show that PerfectKeyHashMap can be
significantly faster than any of the other implementations except in
the case where a put() is overwriting an existing
entry. The Maps based on the HashMap
function have a severely decreased worst case performance, which
occurs when many of the keys map to the same index (not tested here),
whereas the PerfectKeyHashMap has constant performance
independent of the key set.
Our hash function may require a large table in the worst case. A useful concept is the memory factor requirement of the perfect maps, which I define as the number of times larger the memory needs to be compared to the number of elements being stored in the maps. For the examples previously listed, the set {keys 1} using the hash function %7 produced a memory factor requirement of 1.75 (four elements requiring seven indexes); {keys 2} using %3 produced a memory factor requirement of 1; {keys 3} using %14 produced a memory factor requirement of 3.
I ran a test of thousands of randomly generated keysets of between 5 and 500 elements where each key value was retricted to the range 1 to 10,000. This test produced an average memory factor requirement of 12 and a maximum factor of 20. Whether these memory requirements are reasonable for your application depends on how many large perfect maps you need to create and on the importance of speed compared to memory. Can we gain the benefits of the perfect map without the large memory overhead?
For some datasets it may be possible to produce a perfect hash function that is fast and has a small memory factor requirement (for example the final map implementation in the article "Optimizing a query on a Map"). In general, however, reducing the memory overhead of hash functions usually requires the use of a table, thus reducing the speed of the hash function.
For the hash functions considered in this article, we have the
required tables available already; the perfect maps provide us with
tables that can produce minimal perfect maps. Since any minimal
perfect map requires a perfect map as well, you may think we have
increased the memory requirements rather than reduced
them. But multiple minimal perfect map instances for the same set of
keys can all point to the same perfect map for their memory map table;
therefore, depending on the application, this might be a feasible
solution. The optimized implementation of the
MinimalPerfectKeyHashMap looks like
class MinimalPerfectKeyHashMap
{
Key[] keys;
int[] map;
Object[] values;
int hfRemainder;
MinimalPerfectKeyHashMap(int size, Key[] keys, int[] map,
int remainder)
{
this.keys = keys;
values = new Object[size];
this.map = map;
this.hfRemainder = remainder;
}
public void put(Key key, Object value)
{
int index = key.id % hfRemainder;
keys[index] = key;
values[map[index]] = value;
}
public Object get(Key key)
{
// int index = key.id % hfRemainder;
return values[map[key.id % hfRemainder]];
}
}
Examples of minimal perfect map usage can be found in the
associated source code with this article. However, the speed tests
applied to the minimal perfect map (the rows numbered 6, in tables 5
through 7) show that there may be no advantage to using this
implementation of MinimalPerfectKeyHashMaps. The table
lookup produces enough of an overhead that an optimized
KeyHashMap using the HashMap function can
produce superior performance.
Jack Shirazi is the author of Java Performance Tuning. He was an early adopter of Java, and for the last few years has consulted mainly for the financial sector, focusing on Java performance.
Return to ONJava.com.
Copyright © 2007 O'Reilly Media, Inc.