Have you ever used a phone book? Maybe not, but you probably have used a cell phone. You might even be reading this article on that phone right now. Inside each cell phone is a contact list where you can store and save contacts that you can reach out to whenever and almost wherever you want in the world, but how?
Every contact in your phone must be saved with a unique phone number, oftentimes along with a country code and a location-specific area code.
Now imagine you want to program something like a contact list, how would you go about doing that? You would need to use some sort of method to structure the data that you want to save. This is exactly when a data structure would come into play, but which one should you use?
You could make an array, but what if you needed to add more contacts? Delete a contact? Arrays are great for accessing info but aren’t so great when it comes to deleting data. You are going to run into a similar issue with a linked list, this data structure does a poor job of searching through data quickly. You definitely want to be able to access your contact list easily, right? So that won't do.
That’s where the hash table comes in to save the day. The mighty hash table is used in so many situations besides the contact list that it’s pretty much the holy grail of data structures. Think DNS, domain name system, basically the entire reason you’re even reading this article right now. DNS isn’t much more than another contact list, but instead of phone numbers, you use IP Addresses and attach them to individual hostnames/URLs/Domain names to contact any device over the internet.
So what exactly is a hash table? What are the parts you need to be aware of when creating one? Why are they so stinking fast? And why is my internet so slow even though it's apparently powered by the magic of the hash table?
Maybe I will write an article one day about that last question but as for the hash table, they are also referred to as a map, hash map, dictionary, or associative array.
Back to the issue of creating a contact list, how are you going to configure your program? First, I do have something to confess, I kind of lied to you before. Remember when I said you weren’t going to be using an array? Well, you actually are, but an array alone wouldn’t be enough, you need to be a little fancier than that.
Officially to make a hash table, you need three components:
- Hash Function
The array is where your data is stored, this is where you would be storing that phone number or IP Address, you can think of this like a safe. You can put whatever info you want inside, and then you lock it up so that nobody can get inside unless they have the key.
Every safe has a key, a way to unlock it, or else it would be a useless box that stole your stuff. The key in a hash table is a string, this could be a name of a person, product name, or a URL.
In fact, some safes are easier to open than others, but why is that? What makes some data more secure than others? What stops me from just breaking into your safe and stealing all your stuff?
The lock is the hash function, some locks are better than other locks just like some hash functions are much more reliable than others. Some hash functions are even considered perfect(or are they?). What determines if a function is perfect or not is if it is able to route keys to the data inside the correct array slot without overwriting or adding data incorrectly.
A collision is a situation where your hash function incorrectly routes data to the wrong slot. Just like real car crashes, collisions are bad, and they can be just as expensive to fix. Imagine your phone contacts again, do you have any friends with the same name? Maybe you have two friends named Chris.
The Chris’s have the same name but different numbers, how can you store them both and keep your list organized? Most contact lists are ordered alphabetically, you will need to be creative to avoid collisions.
Remember when I said you wouldn’t use a linked list? Well, I lied again, in this situation a linked list would actually work. It would be a little slower this way, but you could actually write your hash function to create a linked list if there are multiple contacts who have the first starting letter of their names. This would allow your contacts to remain sorted alphabetically, but also reduce the possibility of collision.
Keep in mind that this method could also lead to issues if you had a lot of names that start with the same letter or just a lot of friends named Chris. Linked lists are slow so it's important to structure your hash function in a way that is as efficient as possible. SHA (Secure Hash Algorithms) are some best practice hash functions that you can check out, these functions incorporate extra levels of security and complexity to limit collisions and delay in searching, inserting, and deleting data. You can also write your own custom hash function depending on your personal needs.
I hope that made some sense to you because we are now going to write our own hash table. Honestly, I think you’re going to be unimpressed by what it takes to make one of these in Python because it's much less complex to make one than Wikipedia makes it out to be.
In Python, hash tables are called dictionaries, and you can make a new one using the dict() function. You can name your new hash table contact_list.
If you want to fill your table with information, we can do so by writing in the data like this:
Maybe you and Chris Evans are going to go watch the new Avengers movie later tonight, and you need to call him up. You can get his phone number by typing:
A simple shortcut for making a hash table instead of writing dict() is to just put curly braces like this:
You can choose either way to create a hash table, and you will get the exact same result.
In the phone book example, you didn’t include a hash function. Technically a hash table can be made without a hash function, but you will have no way to avoid a collision. You need logic to keep track of all the data, two records isn’t a big deal, 2 million is a different story.
Say you and your family want to go jet skiing in the ocean, you buy your virtual tickets on Groupon, and you’re scheduled to go later that week. You and your family show up at your exact time that is referenced on your voucher email confirmation, and you have a great day on the water. You had so much fun, you decide to go again using the same voucher. This time you get stopped, the computer turns up an error that says that the voucher has already been used. Bummer, what happened?
Fortunately for the jet ski rental company, they are using a hash table and a decent hash function that knows when people are trying to re-use vouchers. Let's create a hash table for our vouchers:
When someone shows up with the voucher, an employee enters it into the system to first check if this person already has used their ticket:
The get() function will return the voucher code if it has been used before, if not then it will just write back None.
Here’s what it looks like all put together:
That easy, there are many more uses for a hash table, but as long as you have an array, key, and hash function, you have all you need to get started.
Make sure to follow me on Medium for more engineer-related articles! :)