How to Track Benchmarks in CI with Bencher
Most benchmark results are ephemeral. They disappear as soon as your terminal reaches its scrollback limit. Some benchmark harnesses let you cache results, but most only do so locally. Bencher allows you to track your benchmarks from both local and CI runs and compare the results, while still using your favorite benchmark harness.
There are three popular ways to compare benchmark results when Continuous Benchmarking, that is benchmarking in CI:
- Statistical Continuous Benchmarking
- Track benchmark results over time to create a baseline
- Use this baseline along with Statistical Thresholds to create a statistical boundary
- Compare the new results against this statistical boundary to detect performance regressions
- Relative Continuous Benchmarking
- Run the benchmarks for the current baseline code
- Switch over to the new version of the code
- Run the benchmarks for the new version of the code
- Use Percentage Thresholds to create a boundary for the baseline code
- Compare the new version of the code results against the baseline code results to detect performance regressions
- Change Point Detection
- Occasionally run the benchmarks for new versions of the code
- Use a change point detection algorithm to detect performance regressions
- Bisect to find the commit that introduced the performance regression
Statistical Continuous Benchmarking
Picking up where we left off in the
Quick Start and Docker Self-Hosted tutorials,
let’s add Statistical Continuous Benchmarking to our Save Walter White
project.
🐰 Make sure you have created an API token and set it as the
BENCHER_API_TOKEN
environment variable before continuing on!
Now we are ready to run our benchmarks in CI. Because every CI environment is a little bit different, the following example is meant to be more illustrative than practical. For more specific examples, see Continuous Benchmarking in GitHub Actions and Continuous Benchmarking in GitLab CI/CD.
First, we need to create and maintain a historical baseline for our main
branch by benchmarking every change in CI:
- Use the
bencher run
CLI subcommand to run yourfeature-branch
branch benchmarks. See thebencher run
CLI subcommand for a full overview. (ex:bencher run
) - Set the
--project
option to the Project slug. See the--project
docs for more details. (ex:--project save-walter-white-1234abcd
) - Set the
--branch
option to the base Branch name. See the--branch
docs for a full overview. (ex:--branch main
) - Set the
--testbed
option to the CI runner Testbed name. See the--testbed
docs for more details. (ex:--testbed ci-runner
) - Set the Threshold for the
main
Branch,ci-runner
Testbed, andlatency
Measure:- Set the
--threshold-measure
option to the built-inlatency
Measure that is generated bybencher mock
. See the--threshold-measure
docs for more details. (ex:--threshold-measure latency
) - Set the
--threshold-test
option to a Student’s t-test (t_test
). See the--threshold-test
docs for a full overview. (ex:--threshold-test t_test
) - Set the
--threshold-max-sample-size
option to the maximum sample size of64
. See the--threshold-max-sample-size
docs for more details. (ex:--threshold-max-sample-size 64
) - Set the
--threshold-upper-boundary
option to the Upper Boundary of0.99
. See the--threshold-upper-boundary
docs for more details. (ex:--threshold-upper-boundary 0.99
) - Set the
--thresholds-reset
flag so that only the specified Threshold is active. See the--thresholds-reset
docs for a full overview. (ex:--thresholds-reset
)
- Set the
- Set the
--err
flag to fail the command if an Alert is generated. See the--err
docs for a full overview. (ex:--err
) - Set the
--adapter
option to Bencher Metric Format JSON (json
) that is generated bybencher mock
. See benchmark harness adapters for a full overview. (ex:--adapter json
) - Specify the benchmark command arguments.
See benchmark command for a full overview.
(ex:
bencher mock
)
The first time this is command is run in CI,
it will create the main
Branch if it does not exist yet.
The new main
will not have a start point or existing data.
A Threshold will be created for the main
Branch, ci-runner
Testbed, and latency
Measure.
On subsequent runs, new data will be added to the main
Branch.
The specified Threshold will then be used to detect performance regressions.
Now, we are ready to catch performance regressions in CI.
This is how we would track the performance of a new feature branch in CI, aptly named feature-branch
:
- Use the
bencher run
CLI subcommand to run yourfeature-branch
branch benchmarks. See thebencher run
CLI subcommand for a full overview. (ex:bencher run
) - Set the
--project
option to the Project slug. See the--project
docs for more details. (ex:--project save-walter-white-1234abcd
) - Set the
--branch
option to the feature Branch name. See the--branch
docs for a full overview. (ex:--branch feature-branch
) - Set the Start Point for the
feature-branch
Branch:- Set the
--start-point
option to the feature Branch start point. See the--start-point
docs for a full overview. (ex:--start-point main
) - Set the
--start-point-hash
option to the feature Branch start pointgit
hash. See the--start-point-hash
docs for a full overview. (ex:--start-point-hash 32ae...dd8b
) - Set the
--start-point-clone-thresholds
flag to clone the Thresholds from the start point. See the--start-point-clone-thresholds
docs for a full overview. (ex:--start-point-clone-thresholds
) - Set the
--start-point-reset
flag to always reset the Branch to the start point. This will prevent benchmark data drift. See the--start-point-reset
docs for a full overview. (ex:--start-point-reset
)
- Set the
- Set the
--testbed
option to the Testbed name. See the--tested
docs for more details. (ex:--testbed ci-runner
) - Set the
--err
flag to fail the command if an Alert is generated. See the--err
docs for a full overview. (ex:--err
) - Set the
--adapter
option to Bencher Metric Format JSON (json
) that is generated bybencher mock
. See benchmark harness adapters for a full overview. (ex:--adapter json
) - Specify the benchmark command arguments.
See benchmark command for a full overview.
(ex:
bencher mock
)
The first time this is command is run in CI,
Bencher will create the feature-branch
Branch since it does not exist yet.
The new feature-branch
will use the main
Branch
at hash 32aea434d751648726097ed3ac760b57107edd8b
as its start point.
This means that feature-branch
will have a copy of all the data and Thresholds
from the main
Branch to compare the results of bencher mock
against.
On all subsequent runs, Bencher will reset the feature-branch
Branch to the start point,
and use the main
Branch data and Thresholds to detect performance regressions.
Relative Continuous Benchmarking
Picking up where we left off in the
Quick Start and Docker Self-Hosted tutorials,
let’s add Relative Continuous Benchmarking to our Save Walter White
project.
🐰 Make sure you have created an API token and set it as the
BENCHER_API_TOKEN
environment variable before continuing on!
Relative Continuous Benchmarking runs a side-by-side comparison of two versions of your code.
This can be useful when dealing with noisy CI/CD environments,
where the resources available can be highly variable between runs.
In this example we will be comparing the results from running on the main
branch
to results from running on a feature branch, aptly named feature-branch
.
Because every CI environment is a little bit different,
the following example is meant to be more illustrative than practical.
For more specific examples, see Continuous Benchmarking in GitHub Actions
and Continuous Benchmarking in GitLab CI/CD.
First, we need to checkout the main
branch with git
in CI:
Then we need to run our benchmarks on the main
branch in CI:
- Use the
bencher run
CLI subcommand to run yourmain
branch benchmarks. See thebencher run
CLI subcommand for a full overview. (ex:bencher run
) - Set the
--project
option to the Project slug. See the--project
docs for more details. (ex:--project save-walter-white-1234abcd
) - Set the
--branch
option to the base Branch name. See the--branch
docs for a full overview. (ex:--branch main
) - Set the
--start-point-reset
flag to always reset the base Branch. This will make sure that all of the benchmark data is from the current CI runner. See the--start-point-reset
docs for a full overview. (ex:--start-point-reset
) - Set the
--testbed
option to the CI runner Testbed name. See the--testbed
docs for more details. (ex:--testbed ci-runner
) - Set the
--adapter
option to Bencher Metric Format JSON (json
) that is generated bybencher mock
. See benchmark harness adapters for a full overview. (ex:--adapter json
) - Specify the benchmark command arguments.
See benchmark command for a full overview.
(ex:
bencher mock
)
The first time this is command is run in CI,
it will create the feature-branch
Branch since it does not exist yet.
The new feature-branch
will not have a start point, existing data, or Thresholds.
On subsequent runs, the old feature-branch
Head will be replaced
and a new feature-branch
Head will be created without a start point, existing data, or Thresholds.
Next, we need to checkout the feature-branch
branch with git
in CI:
Finally, we are ready to run our feature-branch
benchmarks in CI:
- Use the
bencher run
CLI subcommand to run yourfeature-branch
benchmarks. See thebencher run
CLI subcommand for a full overview. (ex:bencher run
) - Set the
--project
option to the Project slug. See the--project
docs for more details. (ex:--project save-walter-white-1234abcd
) - Set the
--branch
option to the feature branch Branch name. See the--branch
docs for a full overview. (ex:--branch feature-branch
) - Set the Start Point for the
feature-branch
Branch:- Set the
--start-point
option to the feature Branch start point. See the--start-point
docs for a full overview. (ex:--start-point main
) - Set the
--start-point-reset
flag to always reset the Branch to the start point. This will use only the latest relative benchmark results. See the--start-point-reset
docs for a full overview. (ex:--start-point-reset
)
- Set the
- Set the
--testbed
option to the CI runner Testbed name. See the--testbed
docs for more details. (ex:--testbed ci-runner
) - Set the Threshold for the
feature-branch
Branch,ci-runner
Testbed, andlatency
Measure:- Set the
--threshold-measure
option to the built-inlatency
Measure that is generated bybencher mock
. See the--threshold-measure
docs for more details. (ex:--threshold-measure latency
) - Set the
--threshold-test
option to a basic percentage (percentage
). See the--threshold-test
docs for a full overview. (ex:--threshold-test percentage
) - Set the
--threshold-upper-boundary
option to the Upper Boundary of0.25
. See the--threshold-upper-boundary
docs for more details. (ex:--threshold-upper-boundary 0.25
) - Set the
--thresholds-reset
flag so that only the specified Threshold is active. See the--thresholds-reset
docs for a full overview. (ex:--thresholds-reset
)
- Set the
- Set the
--err
flag to fail the command if an Alert is generated. See the--err
docs for a full overview. (ex:--err
) - Set the
--adapter
option to Bencher Metric Format JSON (json
) that is generated bybencher mock
. See benchmark harness adapters for a full overview. (ex:--adapter json
) - Specify the benchmark command arguments.
See benchmark command for a full overview.
(ex:
bencher mock
)
Every time this command is run in CI,
it is comparing the results from feature-branch
against only the most recent results from main
.
The specified Threshold is then used to detect performance regressions.
Change Point Detection
Change Point Detection uses a change point algorithm to evaluate a large window of recent results. This allows the algorithm to ignore outliers as noise and produce fewer false positives. Even though Change Point Detection is considered continuous benchmarking, it does not allow you to detect performance regression in CI. That is, you cannot detect a performance regression before a feature branch merges. This is sometimes referred to as “out-of-band” detection.
For example, if you have a benchmark bench_my_critical_path
,
and it had the following historical latencies: 5 ms
, 6 ms
, 5 ms
, 5ms
, 7ms
.
If the next benchmark result was 11 ms
then a Statistical Continuous Benchmarking threshold
and Change Point Detection algorithm would interpret things very differently.
The threshold would likely be exceeded and an alert would be generated.
If this benchmark run was tied to a pull request,
the build would likely be set to fail due to this alert.
However, the change point algorithm wouldn’t do anything… yet.
If the next run things dropped back down to 5 ms
then it would probably not generate an alert.
Conversely, if the next run or two resulted in 10 ms
and 12 ms
,
only then would the change point algorithm trigger an alert.
Are you interested in using Change Point Detection with Bencher? If so, please leave a comment on the tracking issue or reach out to us directly.
🐰 Congrats! You have learned how to track benchmarks in CI with Bencher! 🎉