After coming across cargo bench, I thought I’d track benchmarks on a raytracer I’ve been working on. cargo bench gives you a simple way of defining micro-benchmarks and running them.

Here’s an example like the one in the Rust book:

#![feature(test)]

extern crate test;

#[cfg(test)]
mod tests {
    use test::{Bencher, black_box};

    #[bench]
    fn bench_pow(b: &mut Bencher) {
        // Optionally include some setup
        let x: f64 = 211.0 * 11.0;
        let y: f64 = 301.0 * 103.0;

        b.iter(|| {
            // Inner closure, the actual test
            for i in 1..100 {
                black_box(x.powf(y).powf(x));
            }
        });
    }
}

And we can run it with rustup run nightly cargo bench:

$ rustup run nightly cargo bench
running 1 test
test tests::bench_pow ... bench:          47 ns/iter (+/- 8)
test result: ok. 0 passed; 0 failed; 0 ignored; 1 measured; 0 filtered out

Here we also had to use black_box which is an “opaque “black_box” to the optimizer”, so the compiler can’t optimize away our computation in the benchmark.

Currently this is an “unstable” feature because the design hasn’t been finalized, so you’ll need to use nightly Rust. People recommend putting your benchmarks under benches/ so nightly picks up the benchmarks, but stable doesn’t (which errors on using #![feature(test)]). I’ve had trouble getting this to work.

What happens under the covers

Looking at the source for libtest/lib.rs we can see how the benchmark gets run. When the bencher is run, it somehow collects functions with the #[bench] attribute and calls test::bench::benchmark.

Then to setup the test, the bencher calls the outer function which has your setup code. You code calls bench.iter with the inner closure. The iter function is the core of the test, it:

  1. Runs single iteration to get a rough time estimate
  2. Determines how many runs per millisecond we can do, with a minimum of 1. Call this N
  3. Loops

    • run 50 iterations, measuring N at a time
    • throw out outliers (<5% or >95%)
    • run 50 iterations, measuring 5*N at a time
    • check if converged after 100ms, and exit prematurely
    • if run longer than 3 seconds, exit

There are some pretty neat things in here:

The benchmark harness also runs at least 301 iterations, since N will be 1 if a single iteration takes longer than 1ms.

DIY Benchmarking

I wanted to measure the performance of what I’d actually be running, rendering a test scene. Even for a small test, this takes a few seconds and I’d like to run less than 301 iterations, at the cost of statistical accuracy. The current build of the benchmark harness doesn’t have any options for parameterizing benchmarks, and I haven’t found a way to limit the number of iterations.

I learned I am trying to build a macro-benchmark, trying to measure realistic workloads or real-life situations, whereas a micro-benchmark aims to test an individual piece like a critical operation.

You can see from the source Rust’s benchmarking is designed to make writing converging micro-benchmarks simple. For my use case, I can forego some of these features and get away with really just a timing function.

// Run function and return result with seconds duration
pub fn time<F, T>(f: F) -> (T, f64)
  where F: FnOnce() -> T {
  let start = PreciseTime::now();
  let res = f();
  let end = PreciseTime::now();

  let runtime_nanos = start.to(end).num_nanoseconds().expect("Benchmark iter took greater than 2^63 nanoseconds");
  let runtime_secs = runtime_nanos as f64 / 1_000_000_000.0;
  (res, runtime_secs)
}

The rest is just the actual benchmark and another function on top that averages and prints runs, but I could just replace it with libtest/stat.rs.

When I hook it up in main and run it:

$ cargo run --release
...
25000 rays in 0.120828293 sec, 206905.18 rays/sec
25000 rays in 0.123399486 sec, 202594.04 rays/sec
25000 rays in 0.120740197 sec, 207056.15 rays/sec
Avg: 203611.35 rays/sec from 10 runs

As a bonus, I can get summary statistics on the main program for free!

Benchmark Design

Since I started digging into this, I also started reading about what makes a good benchmark. Benchmarks are really error prone.

Even here, I don’t know if my test scene is a representative measure as my program grows, so it may be biased. I also am just measuring wall clock time, so I’m including a lot of things like the whims of the OS scheduler. And this is just what I already know. Maybe the benchmark will just be a number that tells which way the wind is blowing with performance.

I also don’t get any insight beyond a number and “this change might slow things”. I’m trying to profile the program now, and, hey, maybe I could actually use cargo bench once I know what the critical operations are.