Skip navigation

The problem of finding and handling duplicate files has been with us for a long time. Since the end of the year 1999, the de facto answer to “how can I find and delete duplicate files?” for Linux and BSD users has been a program called ‘fdupes’ by Adrian Lopez. This venerable staple of system administrators is extremely handy when you’re trying to eliminate redundant data to reclaim some disk space, clean up a code base full of copy-pasted files, or delete photos you’ve accidentally copied from your digital camera to your computer more than once. I’ve been quite grateful to have it around–particularly when dealing with customer data recovery scenarios where every possible copy of a file is recovered and the final set ultimately contains thousands of unnecessary duplicates.

Unfortunately, development on Adrian’s fdupes had, for all practical purposes, ground to a halt. From June 2014 to July 2015, the only significant functional changes to the code have been modification to compile on Mac OS X. The code’s stagnant nature has definitely shown itself in real-world tests; in February 2015, Eliseo Papa published “What is the fastest way to find duplicate pictures?” which contains benchmarks of 15 duplicate file finders (including an early version of my fork which we’ll ignore for the moment) that places the original fdupes dead last in operational speed and shows it to be heavily CPU-bound rather than I/O-bound. In fact, Eliseo’s tests say that fdupes takes a minimum of 11 times longer to run than 13 of the other duplicate file finders in the benchmark!

As a heavy user of the program on fairly large data sets, I had noticed the poor performance of the software and became curious as to why it was so slow for a tool that should simply be comparing pairs of files. After inspecting the code base, I found a number of huge performance killers:

  1. Tons of time was wasted waiting on progress to print to the terminal
  2. Many performance-boosting C features weren’t used (static, inline, etc)
  3. A couple of one-line functions were very “hot,” adding heavy call overhead
  4. Using MD5 for file hashes was slower than other hash functions
  5. Storing MD5 hashes as strings instead of binary data was inefficient
  6. A “secure” hash like MD5 isn’t needed; matches get checked byte-for-byte

 

I submitted a pull request to the fdupes repository which solved these problems in December 2014. Nothing from the pull request was discussed on Github and none of the fixes were incorporated into fdupes. I emailed Adrian to discuss my changes with him directly and there was some interest in certain changes, but in the end nothing was changed and my emails became one-way.

It seemed that fdupes development was doomed to stagnation.

In the venerable traditions of open source software. I forked it and gave my new development tree a new name to differentiate it from Adrian’s code: fdupes-jody. I solved the six big problems outlined above with these changes:

  1. Rather than printing progress indication for every file examined, I added a delay counter to drastically reduce terminal printing. This was a much bigger deal when using SSH.
  2. I switched the code and compilation process to use C99 and added relevant keywords to improve overall performance.
  3. The “hot” one-line functions were changed to #define functions to chop function call overhead for them in half.
  4. (Also covers 5 and 6) I wrote my own hash function (appropriately named ‘jody_hash’) and replaced all of the MD5 code with it, resulting in a benchmarked speed boost of approximately 17%. The resulting hashes are passed around as a 64-bit unsigned integer, not an ASCII string, which (on 64-bit machines) reduces hash comparisons to a single compare instruction.

 

After forking all of these changes and enjoying the massive performance boost they brought about, I felt motivated to continue looking for potential improvements. I didn’t realize at the time that a simple need to eliminate duplicate files more quickly would morph into me spending the next half-year ruthlessly digging through the code for ways to make things better. Between the initial pull request that led to the fork and Eliseo Papa’s article, I managed to get a lot done.

 

At this point, Eliseo published his February 19 article on the fastest way to find duplicates. I did not discover the article until July 8 of the same year (at which time fdupes-jody was at least three versions higher than the one being tested), so I was initially disappointed with where fdupes-jody stood in the benchmarks relative to some of the other tested programs, but even the early fdupes-jody (version 1.51-jody2) code was absolutely stomping the original fdupes.

1.5 months into development, fdupes-jody was 19 times faster than the fdupes code it was forked from.

Nothing will make your programming efforts feel more validated than seeing something like that from a total stranger.

Between the publishing of the article and finding the article, I had continued to make heavy improvements:

 

When I found Eliseo’s article from February, I sent him an email inviting him to try out fdupes-jody again:

I have benchmarked fdupes-jody 1.51-jody4 from March 27 against fdupes-jody 1.51-jody6, the current code in the Git repo. The target is a post-compilation directory for linux-3.19.5 with 63,490 files and 664 duplicates in 152 sets. A “dry run” was performed first to ensure all files were cached in memory first and remove variances due to disk I/O. The benchmarking was as follows:

$ ./compare_fdupes.sh -nrq /usr/src/linux-3.19.5/
Installed fdupes:
real    0m1.532s
user    0m0.257s
sys     0m1.273s

Built fdupes:
real    0m0.581s
user    0m0.247s
sys     0m0.327s

Five sequential runs were consistently close (about ± 0.020s) to these times.

In half a year of casual spare-time coding,  I had made fdupes 32 times faster.

There’s probably not a lot more performance to be squeezed out of fdupes-jody today. Most of my work on the code has settled down into working on new features and improving Windows support. In particular, Windows has supported hard linked files for a long time, and I’ve taken full advantage of Windows hard link support. I’ve also made the progress indicator much more informative to the user. At this point in time, I consider the majority of my efforts complete. fdupes-jody has even gained inclusion as an available program in Arch Linux.

Out of the efforts undertaken in fdupes-jody, I have gained benefits for other projects as well. Improving jody_hash has been a fantastic help since I also use it in other programs such as winregfs and imagepile. I can see the potential for using the string_table allocator in other projects that don’t need to free() string memory until the program exits. Most importantly, my overall experience with working on fdupes-jody has improved my overall programming skills tremendously and I have learned a lot more than I could have imagined would come from improving such a seemingly simple file management tool.

If you’d like to use fdupes-jody, feel free to download one of my binary releases for Linux, Windows, and Mac OS X. You can find them here.

Everyone is flipping out over a picture of a blue-and-black dress. Just when I thought cat videos were the only thing to freak out about. Some say it’s white and gold, others say it’s blue and black. It seems that the white/gold perception is almost squarely with women and the blue/black perception is largely with men. Why is this? Many people have played with color correction to “prove” their answer is the One True Answer(TM) but the reality is that the dress is actually blue with black accents. Want proof? Here’s the dress in a catalog to show that it is factually blue-and-black:

enhanced-31486-1424999286-31

Is this a dress or a vase?

I know, I know, you’re saying “but I want to know why women see it as white and gold!” It’s simple. Here’s the top portion of the dress image:

blue-black-gold-white-dress-top_only

It’s gold and white. If you disagree, I’ll tweet bad things about your lawn.

It’s difficult to see ONLY this part and not think that it could be white and gold. The black part clearly has an incandescent spotlight above it somewhere which bounces off the semi-shiny black portion to give the appearance of a gold hue. The extreme backlight in the upper-right corner that is blowing the picture contrast out pretty severely gives the impression that the entire dress coloration is tainted by shadowing caused by the light source behind it; this combined with the gold “hint” from the incandescent light will cause anyone who looks at the top of the image first to mentally and subconsciously “auto-correct” their color perception to compensate. Thus, if you look at the top first, you’re seeing a white dress with gold accents. Let’s take a look around where the center body mass would be instead:

blue-black-gold-white-dress-middle_only

It’s blue. It’s not white. Don’t be such a racist.

If your first glance is closer to the center of the body, you’ll see a lot less of the gold “hinting.” Because the black is generally darker and the overall brightness of this section of the image is lower, the blue looks more blue and less white.

Let’s be honest with ourselves about typical instinctual human behavior here: men look at the body first and move around to get the whole picture; women size up the person they look at from top to bottom. Men see the dark part first, women see the light part first, and that’s why they perceive it differently. If the same visual tricks and erroneous hints were somehow swapped, the perceptions would also be swapped. There is also the fact that men and women perceive color slightly differently anyway, with women being more capable of distinguishing slight changes in color and men being better at detecting motion, contrast, and bigger changes in general; it could be that the superior color perception of women works against them given this atrocious lighting and terrible quality camera.

For reference, this is the full dress photo everyone’s so worked up about. What color is it? What color did you see it as when you scrolled down? If you scroll very slowly down without looking directly at the photo, even if you’ve seen it as blue/black every time before, you’ll probably see it as white/gold and immediately wonder if you’ve been slipped a hallucinogen via the Internet. I know that’s how I felt, anyway.

You are looking directly at the end of human civilization as we know it.

You are looking directly at the end of human civilization as we know it.

By default, every version of Windows since XP creates thumbnail database files that store small versions of every picture in every folder you browse into with Windows Explorer. These files are used to speed up thumbnail views in folders, but they have some serious disadvantages:

  1. They are created automatically without ever asking you if you want to use them.
  2. Deleting an image file doesn’t necessary delete it from the thumbnail database. The only way to delete the thumbnail is to delete the database (and hope you deleted the correct one…and that it’s not stored in more than one database!)
  3. These files consume a relatively small amount of disk space.
  4. The XP-style (which is also Vista/7/8 style when browsing network shares) “Thumbs.db” and the Windows Media Center “ehthumbs_vista.db” files are marked as hidden, but if you make an archive (such as a ZIP file) or otherwise copy the folder into a container that doesn’t support hidden attributes, not only does the database increase the size of the container required, it also gets un-hidden!
  5. If you write software, it can interfere with software version control systems. They may also update the timestamp on the folder they’re in, causing some programs to think your data in the folder has changed when it really hasn’t.
  6. If you value your privacy (particularly if you handle any sort of sensitive information) these files leave information behind that can be used to compromise that privacy, especially when in the hands of anyone with even just a casual understanding of forensic analysis, be it the private investigator hired by your spouse or the authorities (police, FBI, NSA, CIA, take your pick).

To shut them off completely, you’ll need to change a few registry values that aren’t available through normal control panels (and unavailable in ANY control panels on any Windows version below a Pro, Enterprise, or Ultimate version). Fortunately, someone has already created the necessary .reg files to turn the local thumbnail caches on or off in one shot. The registry file data was posted by Brink to SevenForums. The files at that page will disable or enable this feature locally. These will also shut off (or turn on) Windows Vista and higher creating “Thumbs.db” files on all of your network drives and shares.

If you want to delete all of the “Thumbs.db” style files on a machine that has more than a couple of them, open a command prompt (Windows key + R, then type “cmd” and hit enter) and type the following commands (yes, the colon after the “a” is supposed to be followed by an empty space):

cd \

del /s /a: Thumbs.db

del /s /a: ehthumbs_vista.db

This will enter every directory on the system hard drive and delete all of the Thumbs.db files. You may see some errors while this runs, but such behavior is normal. If you have more drives that need to be cleaned, you can type the drive letter followed by a colon (such as “E:” if you have a drive with that letter assigned to it, for example) and hit enter, then repeat the above two commands to clean them.

The centralized thumbnail databases for Vista and up are harder to find. You can open the folder quickly by going to Start, copy-pasting this into the search box with CTRL+V, and hitting enter:

%LOCALAPPDATA%\Microsoft\Windows\Explorer

Close all other Explorer windows that you have open to unlock as many of the files as possible. Delete everything that you see with the word “thumb” at the beginning. Some files may not be deletable; if you really want to get rid of them, you can start a command prompt, start Task Manager, use it to kill all “explorer.exe” processes, then delete the files manually using the command prompt:

cd %LOCALAPPDATA%\Microsoft\Windows\Explorer

del thumb*

rd /s thumbcachetodelete

When you’re done, either type “explorer” in the command prompt, or in Task Manager go to File > New Task (Run)… and type “explorer”. This will restart your Explorer shell so you can continue using Windows normally.

Follow

Get every new post delivered to your Inbox.

Join 71 other followers

%d bloggers like this: