This is a follow on from the previous blog post about using a cyclic redundancy check (CRC32 in this case) to hash domain names to determine which database they should be stored on. More on that here.
Although CRC seems to work fine in the sense that it generated a completely unique hash codes it was not designed for that purpose. Instead, the focus of CRC is on error detection in data on storage devices and when transmitted over networks. I won’t go into it any further here as it’s explained very well elsewhere. On the other hand md5 was specifically designed as a hash function.
The question I had was whether MD5 was better at evenly distributing hashed values across a set of nodes. In a nutshell MD5 was better but not by a great deal and this was measured using a standard deviation and also by simply eyeballing the data (smaller min/max range in md5). If you look at the two graphs below, they each show 1 million domain names were distributed across a set of 64 databases and what you can see is that the groupings on the md5 graph are slightly more clustered around 15600 – 15700 but it’s nothing major and the crc32 does a reasonably good job.
Both java and python implementations were made which are shown below and the larger repository can be found below.
I’ll be scrapping the CRC32 and using the md5 instead simply because that’s what MD5 was designed for and the distribution is slightly better. If you need any clarification or have any improvements to the code, please comment below.