When I was working on the jdupes file duplicate scanner, I noticed that the program from which it was derived was using the MD5 secure hash algorithm to compare files and file pieces for fast exclusion. Unfortunately, while MD5 does a reasonable job of minimizing collisions for a given pair of data sets, MD5 is a pretty terrible hash choice for most programming choices. I wanted to use a faster hash algorithm. There is no shortage of algorithms and there is also no shortage of people testing them and talking about it, but I wanted to be bold and try making my own before using someone else’s, especially since I was willing to increase the chance of collisions in exchange for a super fast algorithm. I also wanted something that was specifically designed to use primitive operations such as XOR, rotation, and addition that would compile directly down to the corresponding CPU instructions. I had written a primitive hash function in my Windows registry filesystem project (“winregfs”) but that hash function was not very good.
I experimented with combinations of rotations, XORs, bit masking, and adding, using my imagepile project as a test bed since piles of disk images double as a nice large data set. I eventually found a combination of primitive operations that had a much lower collision rate than the winregfs hash function and in early 2015 I changed the jdupes (technically still just a modified fdupes at the time) (and winregfs and imagepile) hash algorithms over to the first version of jodyhash. After three more improvements I created today’s version of jodyhash, a low-collision super-fast algorithm.
An example of usage can be seen in the utility’s main.c file. Simply #include “jody_hash.h” and put the jody_hash.[ch] files into your source code directory, then use the jody_block_hash() function on the data you want to hash. The only restriction on input is that it must be done in 8-byte-wide pieces until the last 1-8 byte piece because the hash algorithm operates on 64-bit words at a time.
How fast is it? On an AMD A8-7600 system, I cached a 3.6GB file in RAM and hashed it five times with these algorithms to illustrate why secure hashes aren’t a great choice for speed. Best run times were as follows (in order of performance):
- jodyhash: 1.601s
- md5sum: 7.681s
- cksum: 10.897s
- sha512sum: 13.235s
- sha1sum: 13.661s
- sha256sum: 19.397s
How low is the rate of collisions? I tested against a list of 216,553 English words in upper case and had a single pair collide. The same list in lower case had zero collisions. A list of ASCII decimal numbers from 0..216553 generated (using the command seq 0 216553) had zero collisions. I have tested against registry key name strings, partial and full files (using jdupes with debug stats enabled), and full disk blocks and find that hash collisions for differing input data are so rare that I practically never have any, even with fairly large data sets.
If you are a programmer searching for a very fast hash function with a low collision rate, I’d like to invite you to try out jodyhash. I think you’ll be happy with it.