Skip navigation

I have a cleanup program that I’ve written as a Bash shell script. Over the years, it has morphed from a thing that just deleted a few fixed directories if they existed at all (mostly temporary file directories found on Windows) to a very flexible cleanup tool that can take a set of rules and rewrite and modify them to apply to multiple versions of Windows, along with safeguards that check the rules and auto-rewritten rules to prevent the equivalent of an “rm -rf /*” from happening. It’s incredibly useful for me; when I back up a customer’s PC data, I run the cleaner script first to delete many gigabytes of unnecessary junk and speed up the backup and restore process significantly.

Unfortunately, having the internal rewrite and safety check rules has the side effect of massively slowing the process. I’ve been tolerating the slowness for a long time, but as the rule set increased in size over the past few years the script has taken longer and longer to complete, so I finally decided to find out what was really going on and fix this speed problem.

Profiling shell scripts isn’t quite as easy as profiling C programs; with C, you can just use a tool like Valgrind to find out where all the effort is going, but shell scripts depend on the speed of the shell, the kernel, and the plethora of programs executed by the script, so it’s harder to follow what goes on and find the time sinks. However, I observed that a lot of time was spent in the steps between deleting items; since each rewrite and safety check is done on-the-fly as deletion rules are presented for processing, those were likely candidates. The first thing I wanted to know was how many times the script called an external program to do work; you can easily kill a shell script’s performance with unnecessary external program executions. To gather this info, I used the strace tool:

strace -f -o strace.txt tt_cleaner

This produced a file called “strace.txt” which contains every single system call issued by both the cleaner script and any forked programs. I then looked for the execve() system call and gathered the counts of the programs executed, excluding “execve resumed” events which aren’t actual execve() calls:

grep execve strace.txt | sed ‘s/.*execve/execve/’ | cut -d\” -f2 | grep -v resumed | sort | uniq -c | sort -g

The resulting output consisted of numbers below 100 until the last two lines, and that’s when I realized where the bottleneck might be:

4157 /bin/sed
11227 /usr/bin/grep

That’s a LOT of calls to sed, but the number of calls to grep was almost three times bigger, so that’s where I started to search for ways to improve. As I’ve said, the rewrite code takes each rule for deletion and rewrites it for other possible interpretations; “Username\Application Data” on Windows XP was moved to “Username\AppData\Roaming” on Vista and up, while “All Users\Application Data” was moved to “C:\ProgramData” in the same, plus there is a potential mirror of every single rule in “Username\AppData\Local\VirtualStore”. The rewrite code handles the expansion of the deletion rules to cover every single one of these possible cases. The outer loop of the rewrite engine grabs each rewrite rule in order while the inner loop does the actual rewriting to the current rule AND and all prior rewrites to ensure no possibilities are missed (VirtualStore is largely to blame for this double-loop architecture). This means that anything done within the inner loop is executed a huge number of times, and the very first command in the inner loop looked like this:

if echo “${RWNAMES[$RWNCNT]}” | grep -qi “${REWRITE0[$RWCNT]}”

This checks to see if the rewrite rule applies to the cleaner rule before doing the rewriting work. It calls grep once for every single iteration of the inner loop. I replaced this line with the following:

if [[ “${RWNAMES[$RWNCNT]}” =~ .*${REWRITE0[$RWCNT]}.* ]]

I had to also tack a “shopt -s nocasematch” to the top of the shell script to make the comparison case-insensitive. The result was a 6x speed increase. Testing on an existing data backup which had already been cleaned (no “work” to do) showed a consistent time reduction from 131 seconds to 22 seconds! The grep count dropped massively, too:

97 /usr/bin/grep

Bash can do wildcard and regular expression matching of strings (the =~ comparison operator is a regex match), so anywhere your shell script uses the “echo-grep” combination in a loop stands to benefit greatly by exploiting these Bash features. Unfortunately, these are not POSIX shell features and using them will lead to non-portable scripts, but if you will never use the script on other shells and the performance boost is significant, why not use them?

The bigger lesson here is that you should take some time to learn about the features offered by your shell if you’re writing advanced shell scripts.

Update: After writing this article, I set forth to eliminate the thousands of calls to sed. I was able to change an “echo-sed” combination to a couple of Bash substring substitutions. Try it out:

FOO=${VARIABLE/string_to_replace/replacement}

It accepts $VARIABLES where the strings go, so it’s quite powerful. Best of all, the total runtime dropped to 10.8 seconds for a total speed boost of over 11x!

I just found this thread on the /r/LegalAdvice subreddit about a concept called “filial responsibility” which basically means that parents and/or their adult children can be held legally responsible for paying medical bills incurred by each other. Apparently 29 states in the USA have filial responsibility laws on the books but I (like many other people) have never heard a thing about them before today.

Filial responsibility is super draconian and scary shit.

Interest in filial responsibility laws have slowly resurfaced after finalization of a Pennsylvania court case where a son was held legally liable for his mother’s $93,000 nursing home bill. Before this case came about, these laws had long since fallen out of any sort of actual enforcement in a similar vein to anti-sodomy or “crime against nature” laws that technically make it a felony to have oral or anal sex with a human. I started digging a bit and found out that these laws could be a nasty time bomb in North Carolina because NC criminal law says that not taking care of your parents if the State decides you should be able to do so is grounds for giving you a criminal record.

That’s right! Don’t pay for your parents’ medical bills? Class 2 misdemeanor, have fun in prison.

I am a firm believer that no person in a free society should ever be held liable for debts (financial or moral) incurred by any other person and that debtors’ prisons should be completely abolished. If you believe the same thing, contact your state government representatives and make sure they know you want these laws stricken from the books.

When I was working on the jdupes file duplicate scanner, I noticed that the program from which it was derived was using the MD5 secure hash algorithm to compare files and file pieces for fast exclusion. Unfortunately, while MD5 does a reasonable job of minimizing collisions for a given pair of data sets, MD5 is a pretty terrible hash choice for most programming choices. I wanted to use a faster hash algorithm. There is no shortage of algorithms and there is also no shortage of people testing them and talking about it, but I wanted to be bold and try making my own before using someone else’s, especially since I was willing to increase the chance of collisions in exchange for a super fast algorithm. I also wanted something that was specifically designed to use primitive operations such as XOR, rotation, and addition that would compile directly down to the corresponding CPU instructions. I had written a primitive hash function in my Windows registry filesystem project (“winregfs”) but that hash function was not very good.

I experimented with combinations of rotations, XORs, bit masking, and adding, using my imagepile project as a test bed since piles of disk images double as a nice large data set. I eventually found a combination of primitive operations that had a much lower collision rate than the winregfs hash function and in early 2015 I changed the jdupes (technically still just a modified fdupes at the time) (and winregfs and imagepile) hash algorithms over to the first version of jodyhash. After three more improvements I created today’s version of jodyhash, a low-collision super-fast algorithm.

An example of usage can be seen in the utility’s main.c file. Simply #include “jody_hash.h” and put the jody_hash.[ch] files into your source code directory, then use the jody_block_hash() function on the data you want to hash. The only restriction on input is that it must be done in 8-byte-wide pieces until the last 1-8 byte piece because the hash algorithm operates on 64-bit words at a time.

How fast is it? On an AMD A8-7600 system, I cached a 3.6GB file in RAM and hashed it five times with these algorithms to illustrate why secure hashes aren’t a great choice for speed. Best run times were as follows (in order of performance):

  • jodyhash: 1.601s
  • md5sum: 7.681s
  • cksum: 10.897s
  • sha512sum: 13.235s
  • sha1sum: 13.661s
  • sha256sum: 19.397s

How low is the rate of collisions? I tested against a list of 216,553 English words in upper case and had a single pair collide. The same list in lower case had zero collisions. A list of ASCII decimal numbers from 0..216553 generated (using the command seq 0 216553) had zero collisions. I have tested against registry key name strings, partial and full files (using jdupes with debug stats enabled), and full disk blocks and find that hash collisions for differing input data are so rare that I practically never have any, even with fairly large data sets.

If you are a programmer searching for a very fast hash function with a low collision rate, I’d like to invite you to try out jodyhash. I think you’ll be happy with it.

Follow

Get every new post delivered to your Inbox.

Join 73 other followers

%d bloggers like this: