hyperfine, benchmarks for CLIs

Some years ago, I was thinking that by directly look at code difference, I could estimate how faster it would run. I would reflect about complexity or how a given loop will be waay faster by precomputing some values. And of course, it is never that simple. Cache locality, threads synchronization and lock contention are all near impossible to estimate without running the code. So, we have to find a better way to know if a PR indeed accelerate the overall program’ speed. Here comes benchmarking to the rescue!

Benchmarking, in its simplest definition, means running some code and measure how much time it took. There is plenty of way to achieve that. One can run a specific function and see how well it behaves with each changes. That usually requires some language specific framework. To name a few:

  • For Rust, criterion generates graphs and even show improvements between each run
  • For Go, testing is the standard way to do it
  • For Python, timeit is the standard way to do it

But that’s not the kind of benchmarking I want to show here. I wanted to find a simple, cross-language way to measure different codes. So I found hyperfine, a generic way to measure CLIs. Simply give it the command to run (or multiple different commands) and it will run a bunch of time and show its speed. That’s not all, it also shows some very useful statistical information, such as the standard deviation.

For example, I want to run my Rust tests without any code modification. We first want to build the tests and measure how it behaves. Let’s copy the built test binary to avoid cargo test rebuilding on top of it. Now, we setup our hyperfine workflow:

hyperfine ./old-tests ./new-tests

Voila, both tests are run, you can see how long each took, was the variance and it even says that the old tests were faster! But I didn’t change anything, the ran tests are the same. It because benchmarking is intrinsically an experimental process, its results can vary for a number of reason. That’s why you should take theses small differences as measures and not as absolutes.

So now, let’s change some code and rerun tests. We also don’t want to copy the new tests each time, so we ask hyperfine to run one before actually measuring:

hyperfine --prepare 'cargo test --no-run; cp target/debug/deps/drop-*[!.][!d] new-tests' ./old-tests ./new-tests

Hoo, it seems that our code changes indeed improve the overall speed, that’s quite nice. We now can more confidently merge this PR, knowing that it will improve speed in some cases. And that’s the whole point of benchmarking.

There are plenty of options that you might find useful in your workflows. One can have `hyperfine` run a command with some arguments that changes for every benchmark, such as the number of threads to use. Or you can also exports the results in markdown, to easily share it with others. Or have a set of warmups round, to ensure that caches are all filled before actually measuring.

Now, go on to better estimating how much the changes you’re pushing are improving the overall speed. Let’s use fact-based development and not (only) complexity analysis. Or do both at the same time, ensuring that your reasoning is sound. Do test, do measure, do know.