What is Continuous Benchmarking?

Continuous Benchmarking is a software development practice where members of a team benchmark their work frequently, usually each person benchmarks at least daily - leading to multiple benchmarks per day. Each benchmark is verified by an automated build to detect performance regressions as quickly as possible. Many teams find that this approach leads to significantly reduced performance regressions and allows a team to develop performant software more rapidly.

By now, everyone in the software industry is aware of continuous integration (CI). At a fundamental level, CI is about detecting and preventing software feature regressions before they make it to production. Similarly, continuous benchmarking (CB) is about detecting and preventing software performance regressions before they make it to production. For the same reasons that unit tests are run in CI for each code change, performance tests should be run in CB for each code change. This analogy is so apt in fact, that the first paragraph of this section is just a Mad Libs version of Martin Fowler’s 2006 intro to Continuous Integration.

🐰 Performance bugs are bugs!

Benchmarking in CI

Myth: You can’t run benchmarks in CI

Most benchmarking harnesses use the system wall clock to measure latency or throughput. This is very helpful, as these are the exact metrics that we as developers care the most about. However, general purpose CI environments are often noisy and inconsistent when measuring wall clock time. When performing continuous benchmarking, this volatility adds unwanted noise into the results.

There are a few options for handling this:

Relative Continuous Benchmarking
Dedicated CI runners
Switching benchmark harnesses to one that counts instructions as opposed to wall time

Or simply embrace the chaos! Continuous benchmarking doesn’t have to be perfect. Yes, reducing the volatility and thus the noise in your continuous benchmarking environment will allow you to detect ever finer performance regressions. However, don’t let perfect be the enemy of good here!

Embrace the Chaos! for Bencher - Bencher

You might look at this graph and think, “Wow, that’s crazy!” But ask yourself, can your current development process detect a factor of two or even a factor of ten performance regression before it affects your users? Probably not! Now that is crazy!

Even with all of the noise from a CI environment, tracking wall clock benchmarks can still pay great dividends in catching performance regressions before they reach your customers in production. Over time, as your software performance management matures you can build from there. In the meantime, just use your regular CI.

Performance Matters

Myth: You can’t notice 100ms of latency

It’s common to hear people claim that humans can’t perceive 100ms of latency. A Nielsen Group article on response times is often cited for this claim.

0.1 second is about the limit for having the user feel that the system is reacting instantaneously, meaning that no special feedback is necessary except to display the result.

Jakob Nielsen, 1 Jan 1993

But that simply is not true. On some tasks, people can perceive as little as 2ms of latency. An easy way to prove this is an experiment from Dan Luu: open your terminal and run sleep 0; echo "ping" and then run sleep 0.1; echo "pong". You noticed the difference right‽

Another common point of confusion is the distinction between the perception of latency and human reaction times. Even though it takes around 200ms to respond to a visual stimulus, that is independent from the perception of the event itself. By analogy, you can notice that your train is two minutes late (perceived latency) even though the train ride takes two hours (reaction time).

Performance matters! Performance is a feature!

Every 100ms faster → 1% more conversions (Mobify, earning +$380,000/yr)
50% faster → 12% more sales (AutoAnything)
20% faster → 10% more conversions (Furniture Village)
40% faster → 15% more sign-ups (Pinterest)
850ms faster → 7% more conversions (COOK)
Every 1 second slower → 10% fewer users (BBC)

With the death of Moore’s Law, workloads that can run in parallel will need to parallelized. However, most workloads need to run in series, and simply throwing more compute at the problem is quickly becoming an intractable and expensive solution.

Continuous Benchmarking is a key component to developing and maintaining performant modern software in the face of this change.

Moore's Law from https://davidwells.io/blog/rise-of-embarrassingly-parallel-serverless-compute

Continuous Benchmarking Tools

Before creating Bencher, we set out to find a tool that could:

Track benchmarks across multiple languages
Seamlessly ingest language standard benchmark harness output
Extensible for custom benchmark harness output
Open source and able to self-host
Work with multiple CI hosts
User authentication and authorization

Unfortunately, nothing that met all of these criteria existed. See prior art from a comprehensive list of the existing benchmarking tools that we took inspiration from.

Continuous Benchmarking in Big Tech

Tools like Bencher have been developed internally at Microsoft, Facebook (now Meta), Apple, Amazon, Netflix, and Google among countless others. As the titans of the industry, they understand the importance of monitoring performance during development and integrating these insights into the development process through CB. We built Bencher to bring continuous benchmarking from behind the walls of Big Tech to the open source community. For links to posts related to continuous benchmarking from Big Tech see prior art.

Bencher: Continuous Benchmarking

Bencher is a suite of continuous benchmarking tools. Have you ever had a performance regression impact your users? Bencher could have prevented that from happening. Bencher allows you to detect and prevent performance regressions before they make it to production.

Run: Run your benchmarks locally or in CI using your favorite benchmarking tools. The bencher CLI simply wraps your existing benchmark harness and stores its results.
Track: Track the results of your benchmarks over time. Monitor, query, and graph the results using the Bencher web console based on the source branch, testbed, benchmark, and measure.
Catch: Catch performance regressions in CI. Bencher uses state of the art, customizable analytics to detect performance regressions before they make it to production.

For the same reasons that unit tests are run in CI to prevent feature regressions, benchmarks should be run in CI with Bencher to prevent performance regressions. Performance bugs are bugs!

Start catching performance regressions in CI — try Bencher Cloud for free.

Continuous Benchmarking vs Local Benchmark Comparison

There are several benchmark harnesses that allow you to compare results locally. Local comparison is great for iterating quickly when performance tuning. However, it should not be relied on to catch performance regressions on an ongoing basis. Just as being able to run unit tests locally doesn’t obviate the need for CI, being able to run and compare benchmarks locally doesn’t obviate the need for CB.

There are several features Bencher offers that local benchmark comparison tools cannot:

Comparison of the same benchmark between different testbeds
Comparison of benchmarks across languages and harnesses
Collaboration and sharing of benchmark results
Running benchmarks on dedicated testbeds to minimize noise
No more copypasta

Continuous Benchmarking vs Application Performance Management (APM)

Application Performance Management (APM) is a vital tool for modern software services. However, APM is designed to be used in production. By the time a performance regression is detected, it is already impacting your customers.

Most defects end up costing more than it would have cost to prevent them. Defects are expensive when they occur, both the direct costs of fixing the defects and the indirect costs because of damaged relationships, lost business, and lost development time.

— Kent Beck, Extreme Programming Explained

There are several features Bencher offers that APM tools cannot:

Detect and prevent performance regressions before they make it to production
Performance changes and impacts included in code review
No overhead in production environments
Effective for on-prem deployments
No changes to production source code

Continuous Benchmarking vs Observability

A rose by any other name would smell as sweet. See Continuous Benchmarking vs Application Performance Management above.

Continuous Benchmarking vs Continuous Integration (CI)

Continuous Benchmarking (CB) is complimentary to Continuous Integration (CI). For the same reasons that unit tests are run in CI for each code change, performance tests should be run in CB for each code change.

While unit and acceptance testing are widely embraced as standard development practices, this trend has not continued into the realm of performance testing. Currently, the common tooling drives testers towards creating throw away code and a click-and-script mentality. Treating performance testing as a first-class citizen enables the creation of better tests that cover more functionality, leading to better tooling to create and run performance tests, resulting in a test suite that is maintainable and can itself be tested.

— Thoughworks Technology Radar, 22 May 2013

Continuous Benchmarking vs Continuous Load Testing

In order to understand the difference between Continuous Benchmarking and Continuous Load Testing, you need to understand the difference between benchmarking and load testing.

Test Kind	Test Scope	Test Users
Benchmarking	Function - Service	One - Many
Load Testing	Service	Many

Benchmarking can test the performance of software from the function level (micro-benchmarks) all the way up to the service level (macro-benchmarks). Benchmarks are great for testing the performance of a particular part of your code in an isolated manner. Load testing only tests the performance of software at the service level and mocks multiple concurrent users. Load tests are great for testing the performance of the entire service under a specific load.

🍦 Imagine we wanted to track the performance of an ice-cream truck. Benchmarking could be used to measure how long it takes to scoop an ice-cream cone (micro-benchmark), and benchmarking could also be used to measure how long it takes a single customer to order, get their ice-cream, and pay (macro-benchmark). Load testing could be used to see how well the ice-cream truck serves 100 customer on a hot summer day.

Keep Going: Quick Start ➡

Published: Sat, August 12, 2023 at 4:07:00 PM UTC | Last Updated: Wed, March 27, 2024 at 7:50:00 AM UTC