How to Track Benchmarks in CI with Bencher

Most benchmark results are ephemeral. They disappear as soon as your terminal reaches its scrollback limit. Some benchmark harnesses let you cache results, but most only do so locally. Bencher allows you to track your benchmarks from both local and CI runs and compare the results, while still using your favorite benchmark harness.

There are three popular ways to compare benchmark results when Continuous Benchmarking, that is benchmarking in CI:

Statistical Continuous Benchmarking
1. Track benchmark results over time to create a baseline
2. Use this baseline along with a Statistical Threshold to create a statistical boundary
3. Compare the new results against this statistical boundary to detect performance regressions
Relative Continuous Benchmarking
1. Run the benchmarks for the current baseline code
2. Switch over to the new version of the code
3. Use a Percentage Threshold to create a boundary for the baseline code
4. Run the benchmarks for the new version of the code
5. Compare the new version of the code results against the baseline code results to detect performance regressions
Change Point Detection
1. Occasionally run the benchmarks for new versions of the code
2. Use a change point detection algorithm to detect performance regressions
3. Bisect to find the commit that introduced the performance regression

Statistical Continuous Benchmarking

Picking up where we left off in the Quick Start and Docker Self-Hosted tutorials, let’s add Statistical Continuous Benchmarking to our claimed project.

🐰 Make sure you have created an API token and set it as the BENCHER_API_TOKEN environment variable before continuing on!

Now we are ready to run our benchmarks in CI. Because every CI environment is a little bit different, the following example is meant to be more illustrative than practical. For more specific examples, see Continuous Benchmarking in GitHub Actions and Continuous Benchmarking in GitLab CI/CD.

First, we need to create and maintain a historical baseline for our main branch by benchmarking every change in CI:

bencher run \
--project project-abc4567-wxyz123456789 \
--branch main \
--testbed ci-runner \
--threshold-measure latency \
--threshold-test t_test \
--threshold-max-sample-size 64 \
--threshold-upper-boundary 0.99 \
--thresholds-reset \
--err \
--adapter json \
bencher mock

Use the bencher run CLI subcommand to run your main branch benchmarks. See the bencher run CLI subcommand for a full overview. (ex: bencher run)
Set the --project option to the Project slug. See the --project docs for more details. (ex: --project project-abc4567-wxyz123456789)
Set the --branch option to the base Branch name. See the --branch docs for a full overview. (ex: --branch main)
Set the --testbed option to the CI runner Testbed name. See the --testbed docs for more details. (ex: --testbed ci-runner)
Set the Threshold for the main Branch, ci-runner Testbed, and latency Measure:
1. Set the --threshold-measure option to the built-in latency Measure that is generated by bencher mock. See the --threshold-measure docs for more details. (ex: --threshold-measure latency)
2. Set the --threshold-test option to a Student’s t-test (t_test). See the --threshold-test docs for a full overview. (ex: --threshold-test t_test)
3. Set the --threshold-max-sample-size option to the maximum sample size of 64. See the --threshold-max-sample-size docs for more details. (ex: --threshold-max-sample-size 64)
4. Set the --threshold-upper-boundary option to the Upper Boundary of 0.99. See the --threshold-upper-boundary docs for more details. (ex: --threshold-upper-boundary 0.99)
5. Set the --thresholds-reset flag so that only the specified Threshold is active. See the --thresholds-reset docs for a full overview. (ex: --thresholds-reset)
Set the --err flag to fail the command if an Alert is generated. See the --err docs for a full overview. (ex: --err)
Set the --adapter option to Bencher Metric Format JSON (json) that is generated by bencher mock. See benchmark harness adapters for a full overview. (ex: --adapter json)
Specify the benchmark command arguments. See benchmark command for a full overview. (ex: bencher mock)

The first time this is command is run in CI, it will create the main Branch if it does not exist yet. The new main will not have a start point or existing data. A Threshold will be created for the main Branch, ci-runner Testbed, and latency Measure. On subsequent runs, new data will be added to the main Branch. The specified Threshold will then be used to detect performance regressions.

Now, we are ready to catch performance regressions in CI. This is how we would track the performance of a new feature branch in CI, aptly named feature-branch:

bencher run \
--project project-abc4567-wxyz123456789 \
--branch feature-branch \
--start-point main \
--start-point-hash 32aea434d751648726097ed3ac760b57107edd8b \
--start-point-clone-thresholds \
--start-point-reset \
--testbed ci-runner \
--err \
--adapter json \
bencher mock

Use the bencher run CLI subcommand to run your feature-branch branch benchmarks. See the bencher run CLI subcommand for a full overview. (ex: bencher run)
Set the --project option to the Project slug. See the --project docs for more details. (ex: --project project-abc4567-wxyz123456789)
Set the --branch option to the feature Branch name. See the --branch docs for a full overview. (ex: --branch feature-branch)
Set the Start Point for the feature-branch Branch:
1. Set the --start-point option to the feature Branch start point. See the --start-point docs for a full overview. (ex: --start-point main)
2. Set the --start-point-hash option to the feature Branch start point git hash. See the --start-point-hash docs for a full overview. (ex: --start-point-hash 32ae...dd8b)
3. Set the --start-point-clone-thresholds flag to clone the Thresholds from the start point. See the --start-point-clone-thresholds docs for a full overview. (ex: --start-point-clone-thresholds)
4. Set the --start-point-reset flag to always reset the Branch to the start point. This will prevent benchmark data drift. See the --start-point-reset docs for a full overview. (ex: --start-point-reset)
Set the --testbed option to the Testbed name. See the --tested docs for more details. (ex: --testbed ci-runner)
Set the --err flag to fail the command if an Alert is generated. See the --err docs for a full overview. (ex: --err)
Set the --adapter option to Bencher Metric Format JSON (json) that is generated by bencher mock. See benchmark harness adapters for a full overview. (ex: --adapter json)
Specify the benchmark command arguments. See benchmark command for a full overview. (ex: bencher mock)

The first time this is command is run in CI, Bencher will create the feature-branch Branch since it does not exist yet. The new feature-branch will use the main Branch at hash 32aea434d751648726097ed3ac760b57107edd8b as its start point. This means that feature-branch will have a copy of all the data and Thresholds from the main Branch to compare the results of bencher mock against. On all subsequent runs, Bencher will reset the feature-branch Branch to the start point, and use the main Branch data and Thresholds to detect performance regressions.

Relative Continuous Benchmarking

Picking up where we left off in the Quick Start and Docker Self-Hosted tutorials, let’s add Relative Continuous Benchmarking to our claimed project.

🐰 Make sure you have created an API token and set it as the BENCHER_API_TOKEN environment variable before continuing on!

Relative Continuous Benchmarking runs a side-by-side comparison of two versions of your code. This can be useful when dealing with noisy CI/CD environments, where the resources available can be highly variable between runs. In this example we will be comparing the results from running on the main branch to results from running on a feature branch, aptly named feature-branch. Because every CI environment is a little bit different, the following example is meant to be more illustrative than practical. For more specific examples, see Continuous Benchmarking in GitHub Actions and Continuous Benchmarking in GitLab CI/CD.

First, we need to checkout the main branch with git in CI:

git checkout main

Then we need to run our benchmarks on the main branch in CI:

bencher run \
--project project-abc4567-wxyz123456789 \
--branch main \
--start-point-reset \
--testbed ci-runner \
--adapter json \
bencher mock

Use the bencher run CLI subcommand to run your main branch benchmarks. See the bencher run CLI subcommand for a full overview. (ex: bencher run)
Set the --project option to the Project slug. See the --project docs for more details. (ex: --project project-abc4567-wxyz123456789)
Set the --branch option to the base Branch name. See the --branch docs for a full overview. (ex: --branch main)
Set the --start-point-reset flag to always reset the base Branch. This will make sure that all of the benchmark data is from the current CI runner. See the --start-point-reset docs for a full overview. (ex: --start-point-reset)
Set the --testbed option to the CI runner Testbed name. See the --testbed docs for more details. (ex: --testbed ci-runner)
Set the --adapter option to Bencher Metric Format JSON (json) that is generated by bencher mock. See benchmark harness adapters for a full overview. (ex: --adapter json)
Specify the benchmark command arguments. See benchmark command for a full overview. (ex: bencher mock)

The first time this is command is run in CI, it will create the main Branch since it does not exist yet. The new main will not have a start point, existing data, or Thresholds. On subsequent runs, the old main Head will be replaced and a new main Head will be created without a start point, existing data, or Thresholds.

Next, we need to checkout the feature-branch branch with git in CI:

git checkout feature-branch

Finally, we are ready to run our feature-branch benchmarks in CI:

bencher run \
--project project-abc4567-wxyz123456789 \
--branch feature-branch \
--start-point main \
--start-point-reset \
--testbed ci-runner \
--threshold-measure latency \
--threshold-test percentage \
--threshold-upper-boundary 0.25 \
--thresholds-reset \
--err \
--adapter json \
bencher mock

Use the bencher run CLI subcommand to run your feature-branch benchmarks. See the bencher run CLI subcommand for a full overview. (ex: bencher run)
Set the --project option to the Project slug. See the --project docs for more details. (ex: --project project-abc4567-wxyz123456789)
Set the --branch option to the feature branch Branch name. See the --branch docs for a full overview. (ex: --branch feature-branch)
Set the Start Point for the feature-branch Branch:
1. Set the --start-point option to the feature Branch start point. See the --start-point docs for a full overview. (ex: --start-point main)
2. Set the --start-point-reset flag to always reset the Branch to the start point. This will use only the latest relative benchmark results. See the --start-point-reset docs for a full overview. (ex: --start-point-reset)
Set the --testbed option to the CI runner Testbed name. See the --testbed docs for more details. (ex: --testbed ci-runner)
Set the Threshold for the feature-branch Branch, ci-runner Testbed, and latency Measure:
1. Set the --threshold-measure option to the built-in latency Measure that is generated by bencher mock. See the --threshold-measure docs for more details. (ex: --threshold-measure latency)
2. Set the --threshold-test option to a basic percentage (percentage). See the --threshold-test docs for a full overview. (ex: --threshold-test percentage)
3. Set the --threshold-upper-boundary option to the Upper Boundary of 0.25. See the --threshold-upper-boundary docs for more details. (ex: --threshold-upper-boundary 0.25)
4. Set the --thresholds-reset flag so that only the specified Threshold is active. See the --thresholds-reset docs for a full overview. (ex: --thresholds-reset)
Set the --err flag to fail the command if an Alert is generated. See the --err docs for a full overview. (ex: --err)
Set the --adapter option to Bencher Metric Format JSON (json) that is generated by bencher mock. See benchmark harness adapters for a full overview. (ex: --adapter json)
Specify the benchmark command arguments. See benchmark command for a full overview. (ex: bencher mock)

Every time this command is run in CI, it is comparing the results from feature-branch against only the most recent results from main. The specified Threshold is then used to detect performance regressions.

Change Point Detection

Change Point Detection uses a change point algorithm to evaluate a large window of recent results. This allows the algorithm to ignore outliers as noise and produce fewer false positives. Even though Change Point Detection is considered continuous benchmarking, it does not allow you to detect performance regression in CI. That is, you cannot detect a performance regression before a feature branch merges. This is sometimes referred to as “out-of-band” detection.

For example, if you have a benchmark bench_my_critical_path, and it had the following historical latencies: 5 ms, 6 ms, 5 ms, 5ms, 7ms.

If the next benchmark result was 11 ms then a Statistical Continuous Benchmarking threshold and Change Point Detection algorithm would interpret things very differently. The threshold would likely be exceeded and an alert would be generated. If this benchmark run was tied to a pull request, the build would likely be set to fail due to this alert. However, the change point algorithm wouldn’t do anything… yet. If the next run things dropped back down to 5 ms then it would probably not generate an alert. Conversely, if the next run or two resulted in 10 ms and 12 ms, only then would the change point algorithm trigger an alert.

Are you interested in using Change Point Detection with Bencher? If so, please leave a comment on the tracking issue or reach out to us directly.

🐰 Congrats! You have learned how to track benchmarks in CI with Bencher! 🎉

Add Bencher to GitHub Actions ➡

Add Bencher to GitLab CI/CD ➡

Published: Sat, August 12, 2023 at 4:07:00 PM UTC | Last Updated: Mon, November 11, 2024 at 7:45:00 AM UTC