Diving into HashMap!
HashMap is probably one of the most important collections in Java ecosystem. Let's understand its internal working in this article.
Photo by Fotis Fotopoulos on Unsplash
In this article, we will see the internal working of java.util.HashMap class. We are going to focus on the internal working of hashmap and understand some terms associated with it. I hope you go through the entire article to get a thorough understanding of this topic. So let's begin!
What is a HashMap?
A HashMap is a key-value pair datastructure where a mapping exists between each key and value and a hash table-based implementation helps in getting values very fast when the key is provided. There are a few conditions that decide how fast these get operations would be which we will see later.
Initializing a HashMap
A few things like load factor and initial capacity must be kept in mind while initializing a hashmap. Having an estimate of the number of entries that could go into a hashmap and setting capacity accordingly can help in minimizing the number of rehash operations, i.e. rebuilding the internal datastructures. Merely setting the initial capacity very high could result in more iterations and bad key distribution thereby degrading performance. But if many mappings are to be stored in a HashMap instance, creating it with a sufficiently large capacity will allow the mappings to be stored more efficiently than letting it perform automatic rehashing as needed to grow the table. Also setting the load factor too low would result in frequent rehash operations.
But how are load factor and capacity of hashmap related?
Let's understand this with an example.
Suppose you know that
The estimated total entries in hashmap are 75
The initial capacity of this hashmap is 101
The load factor is default i.e. 0.75
hence threshold would be 75/0.75 = 100
this means, if you know for sure that there are going to be just 75 entries in your hashmap, then you should keep the initial capacity as at least 101. How did we reach this number of 101? Well, it's simple, we assumed that 0.75 is the magic load factor number, then, using the formula maximum entries divided by load factor we get 75/0.75 = 100.
Now, as soon as the 75th entry is made in our hashmap, a check will be done on it using formula No. of current entries/load factor which gives the result 75/0.75 = 100. Now it will be checked if the capacity of this hashmap is greater than 100 or not. If it is, then no further operation takes place, but if it isn't then the hashmap is rehashed and expanded to twice the capacity, i.e. in our case, the total new capacity would be 101*2 = 202. Note: here I am rounding the numbers, but capacity is always in power-of-two.
Now, let's understand some attributes of HashMap to understand the internal working of HashMap.
1. Node<K,V>: this is the class that stores your key-value pair, the following are the attributes associated with it
final int hash;//the hashcode of key
final K key; //the key
V value;//the value
Node<K,V> next;//in case of hashcode collision, the next node is linked to this Node through this attribute
- Node<K,V> [] table: A table is an array of Node. Each index element of this table behaves as a bin. If there is a hash collision and two keys get the index of the same bin then the second Node is linked with the first in a doubly linked list manner.
3. TreeNode<K,V>: TreeNode is a child class of the Node class itself and is used when bins are treeified. It uses the above-mentioned attributes of the Node class as well as the following
TreeNode<K,V> parent; // red-black tree links
TreeNode<K,V> left;
TreeNode<K,V> right;
TreeNode<K,V> prev; // needed to unlink next upon deletion
boolean red;
4. int size = Total entries in HashMap; Size is the total of all the nodes.
5. int modCount = number of times hashmap has been structurally modified, to make hashmap iteration failfast.
6. int threshold = next size limit value at which to resize.
In case of very bad hashCode() implementation, when keys are poorly distributed, converting the bins to TreeNodes ensures the worst-case lookup time of O(log n).
Also, TreeNodes are twice the size of regular nodes, so they are only used when bins have reached a threshold limit. And let's say once the bins reach again a size less than the threshold, they are converted back to the regular nodes.
Ideally, with a good hashcode implementation and good key distribution, the frequency of nodes in bins follows a Poisson distribution with a parameter of about 0.5 on average with a default resizing threshold of 0.75.
The expected occurrence of the list of size k is
k = 0: probability = 0.6
k = 1, probability = 0.30
k = 2, probability =0.075
and so on
So as you can see, there is a probability of just 0.30 for a list of size 1 and a probability of 0.075 for a list of size 2, this means that if you have a hashcode implementation of key, then the hashmap will distribute keys to separate bins as there is a low chance of conflict. And if there is even a list created of size 2, then the probability of finding such a list is just 0.075.
If you are not familiar with the term Poisson Distribution, then let me explain to you in brief.
Suppose there are 8 Customers visiting a restaurant every hour on average. And someone asks you what is the probability of 10 Customer's visit? Then the poisson distribution can give you that probability.
It uses a formula
here, λ is the average parameter, k is the number of occurrences, and e is Euler's number (its value is 2.71828...)
in the above example, λ is 8, k is 10, and e is 2.71. Using the above formula probability of x being 10 is
Pr(X=10) = 0.0134335849
Hence there is around a 1.3% chance of 10 customers visiting the restaurant at any hour.
Similarly in our Hashmap's case, there is around a 30% chance of finding a bin with a list of size 1 and around a 7.5% chance of finding a list of size 2 and this keeps decreasing for incrementing list size.
Understanding hashCode(), hashing, and bit masking in hashmap
A hashCode() returns the hash associated with an object in java. It is supported for the benefits of a hash table like datastructures, you guessed it right, just like HashMap. When our key is a class that has various fields, the hash code that is generated is a combination of hash codes of all those fields. So, a hash code helps in determining the index where a key will be placed in the table. But let's say, currently, that table is just of size 8, which means, no matter what the hashcode is for a particular key, it must be put somewhere between index 0 to 7. Hence to do this, the hashcode needs to be hashed with a value one less than the table's size, in our case which is 8 - 1 = 7, let's call it tableSize for now. To do this, a simple AND operation is done between the key's hashcode and tableSize, this will give us the index value where this key needs to be placed. The better our hashCode() is, the better key distribution will be done and the chances of collision will be less. The less number of collisions, the more keys will be distributed and better lookup time will be there since no time would be wasted by iterating over the bin lists. But wait, with this simple hash operation, the collision risk increases. Because usually at first our table size would be small, just like we have chosen 8 in the above example. In that case, the AND operation that we are going to do to find the index would be with 7, as explained above, whose binary value is 111. And let's say the hash value of our key is 221, which results in a binary value of 11011101, ANDing the above two values, 11011101 and 00000111, we get the result as 101, which is 5 in decimal, hence this key would go to 5th index position. Now, let's assume there is another key whose hashcode is 189 in decimal format, which gives 10111101 in binary format. Using the above logic, when we do an AND between 10111101 and 00000111, we again get a result as 101 which again means this key needs to be placed at the 5th index of the table. Such collision of bits is a possibility because maybe some of the fields in your keys aren't changing that rapidly, and the ones that are changing are just changing the higher bits of hash code, so your lower bits stay the same for even different keys. To deal with this situation, HashMap has its hash function that is called on each key beforehand. The purpose of this function is to spread(XOR) higher bits of hash to lower. This means XORing the higher bits with the lower bits. To do this, it does a right shift by 16 bits and XOR. In our case, to keep things simple, we will do a right shift of 4. So, taking our above example of the first key resulting in a hash code of 221 (11011101 in binary), let's do a 4-bit right shift that would give us 00001101, XORing both values would give us 11010000 and AND it with 7 (0111) which would give us 0000 i.e index 0. Now let's do the same exercise on the second key with value 189 (10111101 in binary) and do a right shift of 4 bits which would give us 00001011, XORing both gives us: 10110110 and ANDing it with 7 (0111) we get 0110, which is index 6. So, earlier we got 5 as the resulting index for both values 221 and 189 whereas after spreading the higher bits we got 0 and 6 as new index values respectively. Hence, we can see how spreading the higher bits to lower bits helped us prevent a key collision by giving different hashes.
Inserting the value into HashMap using the put(K key, V value) method
In this method, first of all, the index where the key should reside is found using the hash of the key and the table size.
index = (hash of key) & (table size -1)
Once the index is determined, it is checked whether a key already exists at that location or not. If it doesn't, then a new Node is created and new key-value pair is added, if it does, then using equals() it is checked, whether it is the same key or not. If it is, then the value is updated, if it is not, then subsequent nodes in the chain are checked until a matching key is found, if after going through all the nodes, there is no matching key found, then a new Node is created and added subsequently.
One thing to notice here is that, if there are more Nodes than TREEIFY_THRESHOLD
(8 by default) then treeification of the earlier linked list happens and each Node is converted to TreeNode, which is a child of Node, and the entire list is converted to Red-Black Tree datastructure. There is also another condition that needs to be fulfilled before treeification could occur which is, the minimum size of the table must be greater than or equal to MIN_TREEIFY_CAPACITY (64 by default), if not, the table is resized instead. Each time the table is resized, the new index for the bin is calculated using the new table size. The table is always resized in a power-of-two i.e. doubled.
Retrieving values using get( Object key)
The get method returns the values associated with the key provided in the argument. Hence, this operation is the reverse of the previously mentioned put method.
In this method, first, the hash of the provided key is calculated, and then the index at which it must be put is calculated using the size of the table. Then it is checked whether the bin is of type Node or type TreeNode using instanceof
operator. Then, the list or tree is iterated and the key is matched using equals(), if found the value is returned, otherwise null is returned.
Note: get method could return null for a key when either the key is not present, or the value associated with that key is null, hence it should not be used to check for the presence of a key, containsKey(Object key) should be used instead to check this.
Conclusion
HashMap has a lot more to offer and discuss, but for the sake of keeping things simple, this article has explained the basic internal working of HashMap that anyone working with it should be aware of. You should also go through the source code of HashMap of OpenJDK to get a thorough understanding of this data structure.
I hope this was helpful.
Happy Coding!