What is Continuous Benchmarking?
Continuous Benchmarking is a software development practice where members of a team benchmark their work frequently, usually each person benchmarks at least daily - leading to multiple benchmarks per day. Each benchmark is verified by an automated build to detect performance regressions as quickly as possible. Many teams find that this approach leads to significantly reduced performance regressions and allows a team to develop performant software more rapidly.
By now, everyone in the software industry is aware of continuous integration (CI). At a fundamental level, CI is about detecting and preventing software feature regressions before they make it to production. Similarly, continuous benchmarking (CB) is about detecting and preventing software performance regressions before they make it to production. For the same reasons that unit tests are run in CI for each code change, performance tests should be run in CB for each code change. This analogy is so apt in fact, that the first paragraph of this section is just a Mad Libs version of Martin Fowler’s 2006 intro to Continuous Integration.
🐰 Performance bugs are bugs!
Benchmarking in CI
Myth: You can’t run benchmarks in CI
Most benchmarking harnesses use the system wall clock to measure latency or throughput. This is very helpful, as these are the exact metrics that we as developers care the most about. However, general purpose CI environments are often noisy and inconsistent when measuring wall clock time. When performing continuous benchmarking, this volatility adds unwanted noise into the results.
There are a few options for handling this:
- Relative benchmarking
- Dedicated CI runners, like BuildJet
- Switching benchmark harnesses to one that counts instructions as opposed to wall time
Or simply embrace the chaos! Continuous benchmarking doesn’t have to be perfect. Yes, reducing the volatility and thus the noise in your continuous benchmarking environment will allow you to detect ever finer performance regressions. However, don’t let perfect be the enemy of good here!
You might look at this graph and think, “Wow, that’s crazy!” But ask yourself, can your current development process detect a factor of two or even a factor of ten performance regression before it affects your users? Probably not! Now that is crazy!
Even with all of the noise from a CI environment, tracking wall clock benchmarks can still pay great dividends in catching performance regressions before they reach your customers in production. Over time, as your software performance management matures you can build from there. In the meantime, just use your regular CI.
Myth: You can’t notice 100ms of latency
It’s common to hear people claim that humans can’t perceive 100ms of latency. A Nielsen Group article on response times is often cited for this claim.
0.1 second is about the limit for having the user feel that the system is reacting instantaneously, meaning that no special feedback is necessary except to display the result.
- Jakob Nielsen, 1 Jan 1993
But that simply is not true.
On some tasks, people can perceive as little as 2ms of latency.
An easy way to prove this is an experiment from Dan Luu: open your terminal and run
sleep 0; echo "ping" and then run
sleep 0.1; echo "pong". You noticed the difference right‽
Another common point of confusion is the distinction between the perception of latency and human reaction times. Even though it takes around 200ms to respond to a visual stimulus, that is independent from the perception of the event itself. By analogy, you can notice that your train is two minutes late (perceived latency) even though the train ride takes two hours (reaction time).
Performance matters! Performance is a feature!
- Every 100ms faster → 1% more conversions (Mobify, earning +$380,000/yr)
- 50% faster → 12% more sales (AutoAnything)
- 20% faster → 10% more conversions (Furniture Village)
- 40% faster → 15% more sign-ups (Pinterest)
- 850ms faster → 7% more conversions (COOK)
- Every 1 second slower → 10% fewer users (BBC)
With the death of Moore’s Law, workloads that can run in parallel will need to parallelized. However, most workloads need to run in series, and simply throwing more compute at the problem is quickly becoming an intractable and expensive solution.
Continuous Benchmarking is a key component to developing and maintaining performant modern software in the face of this change.
Continuous Benchmarking Tools
Before creating Bencher, we set out to find a tool that could:
- Track benchmarks across multiple languages
- Seamlessly ingest language standard benchmark harness output
- Extensible for custom benchmark harness output
- Open source and able to self-host
- Work with multiple CI hosts
- User authentication and authorization
Unfortunately, nothing that met all of these criteria existed. See prior art from a comprehensive list of the existing benchmarking tools that we took inspiration from.
Continuous Benchmarking in Big Tech
Tools like Bencher have been developed internally at Microsoft, Facebook (now Meta), Apple, Amazon, Netflix, and Google among countless others. As the titans of the industry, they understand the importance of monitoring performance during development and integrating these insights into the development process through CB. We built Bencher to bring continuous benchmarking from behind the walls of Big Tech to the open source community. For links to posts related to continuous benchmarking from Big Tech see prior art.
Continuous Benchmarking vs Local Benchmark Comparison
There are several benchmark harnesses that allow you to compare results locally. Local comparison is great for iterating quickly when performance tuning. However, it should not be relied on to catch performance regressions on an ongoing basis. Just as being able to run unit tests locally doesn’t obviate the need for CI, being able to run and compare benchmarks locally doesn’t obviate the need for CB.
There are several features Bencher offers that local benchmark comparison tools cannot:
- Comparison of the same benchmark between different testbeds
- Comparison of benchmarks across languages and harnesses
- Collaboration and sharing of benchmark results
- Running benchmarks on dedicated testbeds to minimize noise
- No more copypasta
Continuous Benchmarking vs Application Performance Management (APM)
Application Performance Management (APM) is a vital tool for modern software services. However, APM is designed to be used in production. By the time a performance regression is detected, it is already impacting your customers.
Most defects end up costing more than it would have cost to prevent them. Defects are expensive when they occur, both the direct costs of fixing the defects and the indirect costs because of damaged relationships, lost business, and lost development time.
— Kent Beck, Extreme Programming Explained
There are several features Bencher offers that APM tools cannot:
- Detect and prevent performance regressions before they make it to production
- Performance changes and impacts included in code review
- No overhead in production environments
- Effective for on-prem deployments
- No changes to production source code
Continuous Benchmarking vs Observability
A rose by any other name would smell as sweet. See Continuous Benchmarking vs Application Performance Management above.
Continuous Benchmarking vs Continuous Integration (CI)
Continuous Benchmarking (CB) is complimentary to Continuous Integration (CI). For the same reasons that unit tests are run in CI for each code change, performance tests should be run in CB for each code change.
While unit and acceptance testing are widely embraced as standard development practices, this trend has not continued into the realm of performance testing. Currently, the common tooling drives testers towards creating throw away code and a click-and-script mentality. Treating performance testing as a first-class citizen enables the creation of better tests that cover more functionality, leading to better tooling to create and run performance tests, resulting in a test suite that is maintainable and can itself be tested.
Continuous Benchmarking vs Continuous Load Testing
|Test Kind||Test Scope||Test Users|
|Benchmarking||Function - Service||One - Many|
Benchmarking can test the performance of software from the function level (micro-benchmarks) all the way up to the service level (macro-benchmarks). Benchmarks are great for testing the performance of a particular part of your code in an isolated manner. Load testing only tests the performance of software at the service level and mocks multiple concurrent users. Load tests are great for testing the performance of the entire service under a specific load.
🍦 Imagine we wanted to track the performance of an ice-cream truck. Benchmarking could be used to measure how long it takes to scoop an ice-cream cone (micro-benchmark), and benchmarking could also be used to measure how long it takes a single customer to order, get their ice-cream, and pay (macro-benchmark). Load testing could be used to see how well the ice-cream truck serves 100 customer on a hot summer day.