Skip navigation

Tag Archives: software

The problem of finding and handling duplicate files has been with us for a long time. Since the end of the year 1999, the de facto answer to “how can I find and delete duplicate files?” for Linux and BSD users has been a program called ‘fdupes’ by Adrian Lopez. This venerable staple of system administrators is extremely handy when you’re trying to eliminate redundant data to reclaim some disk space, clean up a code base full of copy-pasted files, or delete photos you’ve accidentally copied from your digital camera to your computer more than once. I’ve been quite grateful to have it around–particularly when dealing with customer data recovery scenarios where every possible copy of a file is recovered and the final set ultimately contains thousands of unnecessary duplicates.

Unfortunately, development on Adrian’s fdupes had, for all practical purposes, ground to a halt. From June 2014 to July 2015, the only significant functional changes to the code have been modification to compile on Mac OS X. The code’s stagnant nature has definitely shown itself in real-world tests; in February 2015, Eliseo Papa published “What is the fastest way to find duplicate pictures?” which contains benchmarks of 15 duplicate file finders (including an early version of my fork which we’ll ignore for the moment) that places the original fdupes dead last in operational speed and shows it to be heavily CPU-bound rather than I/O-bound. In fact, Eliseo’s tests say that fdupes takes a minimum of 11 times longer to run than 13 of the other duplicate file finders in the benchmark!

As a heavy user of the program on fairly large data sets, I had noticed the poor performance of the software and became curious as to why it was so slow for a tool that should simply be comparing pairs of files. After inspecting the code base, I found a number of huge performance killers:

  1. Tons of time was wasted waiting on progress to print to the terminal
  2. Many performance-boosting C features weren’t used (static, inline, etc)
  3. A couple of one-line functions were very “hot,” adding heavy call overhead
  4. Using MD5 for file hashes was slower than other hash functions
  5. Storing MD5 hashes as strings instead of binary data was inefficient
  6. A “secure” hash like MD5 isn’t needed; matches get checked byte-for-byte

 

I submitted a pull request to the fdupes repository which solved these problems in December 2014. Nothing from the pull request was discussed on Github and none of the fixes were incorporated into fdupes. I emailed Adrian to discuss my changes with him directly and there was some interest in certain changes, but in the end nothing was changed and my emails became one-way.

It seemed that fdupes development was doomed to stagnation.

In the venerable traditions of open source software. I forked it and gave my new development tree a new name to differentiate it from Adrian’s code: jdupes. I solved the six big problems outlined above with these changes:

  1. Rather than printing progress indication for every file examined, I added a delay counter to drastically reduce terminal printing. This was a much bigger deal when using SSH.
  2. I switched the code and compilation process to use C99 and added relevant keywords to improve overall performance.
  3. The “hot” one-line functions were changed to #define functions to chop function call overhead for them in half.
  4. (Also covers 5 and 6) I wrote my own hash function (appropriately named ‘jody_hash’) and replaced all of the MD5 code with it, resulting in a benchmarked speed boost of approximately 17%. The resulting hashes are passed around as a 64-bit unsigned integer, not an ASCII string, which (on 64-bit machines) reduces hash comparisons to a single compare instruction.

 

After forking all of these changes and enjoying the massive performance boost they brought about, I felt motivated to continue looking for potential improvements. I didn’t realize at the time that a simple need to eliminate duplicate files more quickly would morph into spending the next half-year ruthlessly digging through the code for ways to make things better. Between the initial pull request that led to the fork and Eliseo Papa’s article, I managed to get a lot done:

 

At this point, Eliseo published his February 19 article on the fastest way to find duplicates. I did not discover the article until July 8 of the same year (at which time jdupes was at least three versions higher than the one being tested), so I was initially disappointed with where jdupes stood in the benchmarks relative to some of the other tested programs, but even the early jdupes (version 1.51-jody2) code was much faster than the original fdupes code for the same job.

1.5 months into development, jdupes was 19 times faster in a third-party test than the code it was forked from.

Nothing will make your programming efforts feel more validated than seeing something like that from a total stranger.

Between the publishing of the article and finding the article, I had continued to make heavy improvements:

 

When I found Eliseo’s article from February, I sent him an email inviting him to try out jdupes again:

I have benchmarked jdupes 1.51-jody4 from March 27 against jdupes 1.51-jody6, the current code in the Git repo. The target is a post-compilation directory for linux-3.19.5 with 63,490 files and 664 duplicates in 152 sets. A “dry run” was performed first to ensure all files were cached in memory first and remove variances due to disk I/O. The benchmarking was as follows:

$ ./compare_fdupes.sh -nrq /usr/src/linux-3.19.5/
Installed fdupes:
real 0m1.532s
user 0m0.257s
sys 0m1.273s

Built fdupes:
real 0m0.581s
user 0m0.247s
sys 0m0.327s

Five sequential runs were consistently close (about ± 0.020s) to these times.

In half a year of casual spare-time coding, I had made fdupes 32 times faster.

There’s probably not a lot more performance to be squeezed out of jdupes today. Most of my work on the code has settled down into working on new features and improving Windows support. In particular, Windows has supported hard linked files for a long time, and I’ve taken full advantage of Windows hard link support. I’ve also made the progress indicator much more informative to the user. At this point in time, I consider the majority of my efforts complete. jdupes has even gained inclusion as an available program in Arch Linux.

Out of the efforts undertaken in jdupes, I have gained benefits for other projects as well. Improving jody_hash has been a fantastic help since I also use it in other programs such as winregfs and imagepile. I can see the potential for using the string_table allocator in other projects that don’t need to free() string memory until the program exits. Most importantly, my overall experience with working on jdupes has improved my overall programming skills tremendously and I have learned a lot more than I could have imagined would come from improving such a seemingly simple file management tool.

If you’d like to use jdupes, feel free to download one of my binary releases for Linux, Windows, and Mac OS X. You can find them here.

I often find myself in a position where I must locate software to perform a niche task of some sort, and that inevitably means running lots of searches to discover available programs and research the merits of each. Unfortunately, I find that about 70% of my total “software hunting” time is spent constructing elaborate searches to try to weed out deceptive, bait-and-switch, scammy sounding website sentences that attempt to lure people seeking a free software program into installing a program that is not free at all and therefore isn’t within the criteria that the user is looking for.

Let’s say we need to extract email from an Outlook OST file (basically a PST-like file format used only for Exchange servers, and not readable as a PST file). The user wants to get email from the OST file, but Outlook only allows opening PST files, so naturally we look for something like “OST to PST free” online. Lo and behold, we have this program pop up from Softpedia:

Recover Data for OST to PST Free Download – Softpedia

Is this what we’re looking for? It says “free” in the title, and it says it recovers OST files to PST format. Sounds perfect! Well, perfect except for the line underneath it in the search results which tips us off on the truth behind the “free download” scam:

Rating: 4 – ‎12 votes – ‎$99.00 – ‎Windows – ‎Utilities/Tools

Oh.

So it’s a “free” program that costs nearly $100 to purchase. Apparently we have different definitions of what constitutes “free.”

But wait! It’s not a “free program,” it’s a “FREE DOWNLOAD.” As in, you pay nothing for the ability to download it…because it’s so obvious that anyone looking for the word “free” is worried about whether or not they have to pay to download it, right?

Look, you scummy marketing douche rockets, we see what you’re doing there, and we really don’t like it. The real purpose of the phrase “FREE DOWNLOAD” is not to emphasize the fact that the download itself doesn’t cost anything. The goons that use this phrase are attempting to do two equally deceptive things by tacking it onto their not-free software download pages:

  1. Lure in people seeking free stuff (using the search term “FREE”) to trick them into looking at their paid stuff, convincing them to download it (see next point) and then preying on the effort they’ve invested already to get them to shell out their credit card; and
  2. Playing a psychological trick in the process where the downloading person sees the word “FREE” and is convinced that they’re acquiring a solution that won’t cost any money.

Abuse of the term “free” will never end, so it pays to be vigilant and cautious when looking for anything which is truly free. I still say that the people who use this type of trickery are lousy people, and I for one will not ever download (and especially not pay for) any such software. A “free trial” is one thing, but they knew what they were doing with that “free download” garbage, and we shouldn’t allow it to work on us. Vote with your dollars: if you’re going to end up paying for something, make sure it’s not marketed deceptively first.

(Coincidentally, I was looking for WMV file editing software right after typing this, and Wondershare Video Editor came up with both “FREE DOWNLOAD” and “[checkmark-shield icon] SECURE DOWNLOAD” in a blog post of theirs with obviously planted comments at the bottom; visiting their normal site reveals that the software is a free trial and actually costs $40. For obvious reasons, WonderShare will never see a dime of my money.)

Ah, yes, the much-speculated Google Operating System.  Rumors about a possible OS from Google have been floating about for years now, and it seems that Google has finally delivered the cornucopia of computing goodness to your door.  Coming soon to a netbook near you:  Google’s new operating system.  The news is practically flooded with articles about why Google’s fancy new OS is so important and interesting.

I’m here to tell you why it sucks, and why it isn’t really that special at all.

First and foremost, Chrome OS is based on Linux, and Linux has already been out for a long time, with Ubuntu being the most well-known and possibly the most available distribution.  What makes Chrome OS different from any other Linux distro?  It’s Linux with yet another face, but under the hood it still shares far too much with Linux to be considered its own “operating system.”  (Watch for my next post to clarify the difference between a true operating system and what is merely labeled an OS but in fact is more of a “software environment.”)  Chrome OS = Linux with another pretty face.  End of story.  If you want Linux, download Ubuntu or Debian or Fedora or ArchLinux.  At least they offer up real applications and a fully featured environment…

Second, Chrome OS suffers from the most serious problem that other “cloud-centric” distributions of Linux are plagued with: the all-too-often wrong assumption that the computer will be connected to the Internet most of the time.  The OS is centered around the Chrome browser and the primary apps are online apps, with support for traditional offline apps likely to be minimal.  Case in point: gOS, which came with my Sylvania G netbook.  The first thing I did was toss out gOS and install something else–anything else— because it was such a nuisance.  gOS comes with icons for OpenOffice.org and Firefox, and that’s really about it.  Every other “application” seemed to be Internet-enabled.  Most of the “applications” were Google, Blogger, Facebook, MySpace, Google Docs, and other garbage that requires a (fast) Internet connection to work.  What good is having an ultraportable laptop if you need an Internet connection to use 90% of its functionality?  That’s one reason I documented some of the things you can do to get more out of the G netbook, because it actually comes with the majority of the standard GNOME environment, which includes a significant number of games, control panels, applications, and other tools…none of which has an icon in the default installation at all!  Chrome OS is doomed to suffer the same fate, because it is nothing more than “gOS reloaded” for all intents and purposes.

FEW PEOPLE WANT TO BE TETHERED TO THE INTERNET WITH THEIR LAPTOPS AT ALL TIMES.  LEARN THIS, GENIUSES: INTERNET APPLICATIONS SUCK.

Which brings us to my third point:  INTERNET APPLICATIONS SUCK. The ones that don’t suck aren’t Internet applications at all.  I don’t know anyone that uses Google Docs, and Google Docs is no replacement for an installation of OpenOffice.org or Microsoft Office.  One might be tempted to counter with a mention of the heavily-used Google SketchUp or Google Earth, but the difference is that those are true applications which just happen to be Internet-enabled or come from a site on the Internet.  Google Earth uses data pulled from the Internet, and Google Earth totally rocks.  Google Docs, though, is sparse on features and not very compatible with other office applications.  It is not a viable replacement for a real office package for most people, and it feels like “Microsoft Works lite” in general.  Looking beyond Google, we see sites such as MySpace, Facebook, Twitter, and other “social networking” sites taking longer and longer to load, being plagued by excessive use of widgets, and other serious issues.  Contrast this with traditional instant messenger applications and even the ever-hated AOL, which may not be the smallest programs in existence, but provide much better performance, a larger feature set, and better integration with other programs.  Internet applications are limited in their implementation and capabilities, as well as by the lack of proper support for industry standards that have been around for a long time now.

What’s very depressing is that I actually see many reputable sites hyping Chrome OS and discussing whether or not it threatens Windows, Linux, Mac OS, or even embedded operating systems.  Chrome OS is nothing more than a Linux distribution with a stupid idea behind it, and Google has spent considerable time and money on dumber things (can you say YouTube?)  This isn’t like Android, which opened up options in the mobile phone market considerably.  This is something targeted at machines that can already do more than Chrome OS can do.

In short, Google Chrome OS is obsolete before it ever rolls out.  Apparently, I’m not exactly alone in my opinions, and this article sums it all up quite nicely.

%d bloggers like this: