Thresholds & Alerts


Thresholds are how you catch performance regressions with Bencher. A Threshold is assigned to a unique combination of: Branch, Testbed, and Measure. A Threshold uses a specific Test to detect performance regressions. The combination of a Test and its parameters is called a Model. A Model must have a Lower Boundary, Upper Boundary, or both.

  • Lower Boundary
    • A Lower Boundary is used when a smaller value would indicate a performance regression, such as with the Throughput Measure.
  • Upper Boundary
    • An Upper Boundary is used when a larger value would indicate a performance regression, such as with the Latency Measure.

Each Boundary is used to calculate a Boundary Limit. Then every new Metric is checked against each Boundary Limit. An Alert is generated when a new Metric is below a Lower Boundary Limit or above an Upper Boundary Limit.

When Continuous Benchmarking, that is benchmarking in CI you will want to create Thresholds. Using the bencher run CLI subcommand, you already specify a Branch with the --branch option and a Testbed with the --testbed option. So the only other dimension you need to specify is a Measure, with the --threshold-measure option. Then you can use the --threshold-test option to specify the Test to use for that Measure. The --threshold-min-sample-size, --threshold-max-sample-size, and --threshold-window options allow you to control what data is used by the Test. Finally, the --threshold-lower-boundary and --threshold-upper-boundary options allow you to set the Lower Boundary and Upper Boundary. If you want to remove all Models that are not specified, you can do so with the --thresholds-reset flag.

  • If the Threshold does not exits, it will be created for you.
  • If the Threshold does exist and the specified Model is the same, then the Model is ignored.
  • If the Threshold does exist and the specified Model is different, then a new Model is created for the Threshold.
  • If a Threshold does exist and it is reset, then the current Model is removed from the Threshold.

For example, to only use a Threshold for the Latency Measure using a Student’s t-test Test with a maximum sample size of 64 and an Upper Boundary of 0.99, you could write something like this:

Terminal window
bencher run \
--project save-walter-white-1234abcd \
--branch main \
--testbed localhost \
--threshold-measure latency \
--threshold-test t_test \
--threshold-max-sample-size 64 \
--threshold-upper-boundary 0.99 \
--thresholds-reset \
--err \
--adapter json \
bencher mock

🐰 When working with feature branches, you may want to copy the existing Thresholds from the base, Start Point Branch. This is possible with the --start-point-clone-thresholds flag. Note that the --thresholds-reset flag will still remove any cloned Thresholds that are not explicitly specified.

Multiple Thresholds

It is possible to create multiple Thresholds with the same bencher run invocation. When specifying multiple Thresholds, all of the same options must be used for each Threshold. To ignore an option for a specific Threshold, use an underscore (_).

For example, if you only want to use two Thresholds, one for the Latency Measure and one for the Throughput Measure then you would likely want to set an Upper Boundary for the Latency Measure and a Lower Boundary for the Throughput Measure. Therefore, you would use --threshold-lower-boundary _ for the Latency Measure and --threshold-upper-boundary _ for the Throughput Measure. You could write something like this:

Terminal window
bencher run \
--project save-walter-white-1234abcd \
--branch main \
--testbed localhost \
--threshold-measure latency \
--threshold-test t_test \
--threshold-max-sample-size 64 \
--threshold-lower-boundary _ \
--threshold-upper-boundary 0.99 \
--threshold-measure throughput \
--threshold-test t_test \
--threshold-max-sample-size 64 \
--threshold-lower-boundary 0.99 \
--threshold-upper-boundary _ \
--thresholds-reset \
--err \
--adapter json \
bencher mock --measure latency --measure throughput

--threshold-measure <MEASURE>


Use the specified Measure name, slug, or UUID for a Threshold. If the value specified is a name or slug and the Measure does not already exist, it will be created for you. However, if the value specified is a UUID then the Measure must already exist.

For example, to use a Threshold for the Latency Measure, you could write --threshold-measure latency.

--threshold-test <TEST>


Use the specified Test to detect performance regressions.

There are a several different Tests available:

For example to use a Threshold with a Student’s t-test, you could write --threshold-test t_test.

Percentage

A Percentage Test (percentage) is the simplest statistical Test. If a new Metric is below a certain percentage of the mean (Lower Boundary) or above a certain percentage of the mean (Upper Boundary) of your historical Metrics an Alert is generated. Either a Lower Boundary, Upper Boundary, or both must be set. Percentage Tests work best when the value of the Metric should stay within a known good range.

  • Percentage Lower Boundary

    • A Percentage Test Lower Boundary can be any percentage greater than or equal to zero in decimal form (ex: use 0.10 for 10%). It is used when a smaller value would indicate a performance regression.
    • For example, if you had a Percentage Test with a Lower Boundary set to 0.10 and your historical Metrics had a mean of 100 the Lower Boundary Limit would be 90 and any value less than 90 would generate an Alert.
  • Percentage Upper Boundary

    • A Percentage Test Upper Boundary can be any percentage greater than or equal to zero in decimal form (ex: use 0.10 for 10%). It is used when a greater value would indicate a performance regression.
    • For example, if you had a Percentage Test with an Upper Boundary set to 0.10 and your historical Metrics had a mean of 100 the Upper Boundary Limit would be 110 and any value greater than 110 would generate an Alert.

z-score

A z-score Test (z_score) measures the number of standard deviations (σ) a new Metric is from the mean of your historical Metrics using a z-score.

z-score Tests work best when:

  • There are no extreme differences between benchmark runs
  • Benchmark runs are totally independent of one another
  • The number of iterations for a single benchmark run is less than 10% of the historical Metrics
  • There are at least 30 historical Metrics (minimum Sample Size >= 30)

For z-score Tests, standard deviations are expressed as a decimal cumulative percentage. If a new Metric is below a certain left-side cumulative percentage (Lower Boundary) or above a certain right-side cumulative percentage (Upper Boundary) for your historical Metrics an Alert is generated. Either a Lower Boundary, Upper Boundary, or both must be set.

  • z-score Lower Boundary

    • A z-score Test Lower Boundary can be any positive decimal between 0.5 and 1.0. Where 0.5 represents the mean and 1.0 represents all possible left-side values (-∞). It is used when a smaller value would indicate a performance regression.
    • For example, if you used a z-score Test with a Lower Boundary of 0.977 and your historical Metrics had a mean of 100 and a standard deviation of 10, the Lower Boundary Limit would be 80.05 and any value less than 80.05 would generate an Alert.
  • z-score Upper Boundary

    • A z-score Test Upper Boundary can be any positive decimal between 0.5 and 1.0. Where 0.5 represents the mean and 1.0 represents all possible right-side values (∞). It is used when a greater value would indicate a performance regression.
    • For example, if you used a z-score Test with an Upper Boundary of 0.977 and your historical Metrics had a mean of 100 and a standard deviation of 10, the Upper Boundary Limit would be 119.95 and any value greater than 119.95 would generate an Alert.

t-test

A t-test Test (t_test) measures the confidence interval (CI) for how likely it is that a new Metric is above or below the mean of your historical Metrics using a Student’s t-test.

t-test Tests work best when:

  • There are no extreme differences between benchmark runs
  • Benchmark runs are totally independent of one another
  • The number of iterations for a single benchmark run is less than 10% of the historical Metrics

For t-test Tests, confidence intervals are expressed as a decimal confidence percentage. If a new Metric is below a certain left-side confidence percentage (Lower Boundary) or above a certain right-side confidence percentage (Upper Boundary) for your historical Metrics an Alert is generated. Either a Lower Boundary, Upper Boundary, or both must be set.

  • t-test Lower Boundary

    • A t-test Test Lower Boundary can be any positive decimal between 0.5 and 1.0. Where 0.5 represents the mean and 1.0 represents all possible left-side values (-∞). It is used when a smaller value would indicate a performance regression.
    • For example, if you used a t-test Test with a Lower Boundary of 0.977 and you had 25 historical Metrics with a mean of 100 and a standard deviation of 10, the Lower Boundary Limit would be 78.96 and any value less than 78.96 would generate an Alert.
  • t-test Upper Boundary

    • A t-test Test Upper Boundary can be any positive decimal between 0.5 and 1.0. Where 0.5 represents the mean and 1.0 represents all possible right-side values (∞). It is used when a greater value would indicate a performance regression.
    • For example, if you used a t-test Test with an Upper Boundary of 0.977 and you had 25 historical Metrics with a mean of 100 and a standard deviation of 10, the Upper Boundary Limit would be 121.04 and any value greater than 121.04 would generate an Alert.

Log Normal

A Log Normal Test (log_normal) measures how likely it is that a new Metric is above or below the center location of your historical Metrics using a Log Normal Distribution.

Log Normal Tests work best when:

  • Benchmark runs are totally independent of one another
  • The number of iterations for a single benchmark run is less than 10% of the historical Metrics
  • All data is positive (the natural log of a negative number is undefined)

For Log Normal Tests, the likelihood expressed as a decimal percentage. If a new Metric is below a certain left-side percentage (Lower Boundary) or above a certain right-side percentage (Upper Boundary) for your historical Metrics an Alert is generated. Either a Lower Boundary, Upper Boundary, or both must be set.

  • Log Normal Lower Boundary

    • A Log Normal Test Lower Boundary can be any positive decimal between 0.5 and 1.0. Where 0.5 represents the center location and 1.0 represents all possible left-side values (-∞). It is used when a smaller value would indicate a performance regression.
    • For example, if you used a Log Normal Test with a Lower Boundary of 0.977 and you had 25 historical Metrics centered around 100 and one pervious outlier at 200, the Lower Boundary Limit would be 71.20 and any value less than 71.20 would generate an Alert.
  • Log Normal Upper Boundary

    • A Log Normal Test Upper Boundary can be any positive decimal between 0.5 and 1.0. Where 0.5 represents the center location and 1.0 represents all possible right-side values (∞). It is used when a greater value would indicate a performance regression.
    • For example, if you used a Log Normal Test with an Upper Boundary of 0.977 and you had 25 historical Metrics centered around 100 and one previous outlier at 200, the Upper Boundary Limit would be 134.18 and any value greater than 134.18 would generate an Alert.

Interquartile Range

An Interquartile Range Test (iqr) measures how many multiples of the interquartile range (IQR) a new Metric is above or below the median of your historical Metrics. If a new Metric is below a certain multiple of the IQR from the median (Lower Boundary) or above a certain multiple of the IQR from the median (Upper Boundary) of your historical Metrics an Alert is generated. Either a Lower Boundary, Upper Boundary, or both must be set.

  • Interquartile Range Lower Boundary

    • An Interquartile Range Test Lower Boundary can be any multiplier greater than or equal to zero (ex: use 2.0 for 2x). It is used when a smaller value would indicate a performance regression.
    • For example, if you had an Interquartile Range Test with a Lower Boundary set to 2.0 and your historical Metrics had a median of 100 and an interquartile range of 10 the Lower Boundary Limit would be 80 and any value less than 80 would generate an Alert.
  • Interquartile Range Upper Boundary

    • An Interquartile Range Test Upper Boundary can be any multiplier greater than or equal to zero (ex: use 2.0 for 2x). It is used when a greater value would indicate a performance regression.
    • For example, if you had an Interquartile Range Test with an Upper Boundary set to 2.0 and your historical Metrics had a median of 100 and an interquartile range of 10 the Upper Boundary Limit would be 120 and any value greater than 120 would generate an Alert.

Delta Interquartile Range

A Delta Interquartile Range Test (delta_iqr) measures how many multiples of the average percentage change (Δ) interquartile range (IQR) a new Metric is above or below the median of your historical Metrics. If a new Metric is below a certain multiple of the ΔIQR from the median (Lower Boundary) or above a certain multiple of the ΔIQR from the median (Upper Boundary) of your historical Metrics an Alert is generated. Either a Lower Boundary, Upper Boundary, or both must be set.

  • Delta Interquartile Range Lower Boundary

    • A Delta Interquartile Range Test Lower Boundary can be any multiplier greater than or equal to zero (ex: use 2.0 for 2x). It is used when a smaller value would indicate a performance regression.
    • For example, if you had a Delta Interquartile Range Test with a Lower Boundary set to 2.0 and your historical Metrics had a median of 100, an interquartile range of 10, and an average delta interquartile range of 0.2 (20%) the Lower Boundary Limit would be 60 and any value less than 60 would generate an Alert.
  • Delta Interquartile Range Upper Boundary

    • A Delta Interquartile Range Test Upper Boundary can be any multiplier greater than or equal to zero (ex: use 2.0 for 2x). It is used when a greater value would indicate a performance regression.
    • For example, if you had a Delta Interquartile Range Test with an Upper Boundary set to 2.0 and your historical Metrics had a median of 100, an interquartile range of 10, and an average delta interquartile range of 0.2 (20%) the Upper Boundary Limit would be 140 and any value greater than 140 would generate an Alert.

Static

A Static Test (static) is the simplest Test. If a new Metric is below a set Lower Boundary or above a set Upper Boundary an Alert is generated. That is, the Lower/Upper Boundary is an explicit Lower/Upper Boundary Limit. Either a Lower Boundary, Upper Boundary, or both must be set. Static Tests work best when the value of the Metric should stay within a constant range across all Benchmarks, such as code coverage.

🐰 If you want a different static Lower/Upper Boundary Limit for each Benchmark, then you should use a Percentage Test (percentage) with the Lower/Upper Boundary set to 0.0 and the Max Sample Size set to 2.

  • Static Lower Boundary

    • A Static Test Lower Boundary can be any floating point number. It is used when a smaller value would indicate a performance regression. The Lower Boundary must be less than or equal to the Upper Boundary, if both are specified.
    • For example, if you had a Static Test with a Lower Boundary set to 100 the Lower Boundary Limit would likewise be 100 and any value less than 100 would generate an Alert.
  • Static Upper Boundary

    • A Static Test Upper Boundary can be any floating point number. It is used when a greater value would indicate a performance regression. The Upper Boundary must be greater than or equal to the Lower Boundary, if both are specified.
    • For example, if you had a Static Test with an Upper Boundary set to 100 the Upper Boundary Limit would likewise be 100 and any value greater than 100 would generate an Alert.

--threshold-min-sample-size <SAMPLE_SIZE>


Optionally specify the minimum number of Metrics required to run a Test. If this minimum is not met, the Test will not run. The specified sample size must be greater than or equal to 2. If the --threshold-max-sample-size option is also set, then the specified sample size must be less than or equal to --threshold-max-sample-size. This option cannot be used with the Static (static) Test.

For example to use a Threshold with a minimum sample size of 10, you could write --threshold-min-sample-size 10. If there were fewer than 10 Metrics, the Test would not run. Conversely, if there were 10 or more Metrics, the Test would run.

--threshold-max-sample-size <SAMPLE_SIZE>


Optionally specify the maximum number of Metrics used to run a Test. If this maximum is exceeded, the oldest Metrics will be ignored. The specified sample size must be greater than or equal to 2. If the --threshold-min-sample-size option is also set, then the specified sample size must be greater than or equal to --threshold-min-sample-size. This option cannot be used with the Static (static) Test.

For example, to use a Threshold with a maximum sample size of 100, you could write --threshold-max-sample-size 100. If there were more than 100 Metrics, only the most recent 100 Metrics would be included. Conversely, if there were 100 or fewer Metrics, all of the Metrics would be included.

--threshold-window <WINDOW>


Optionally specify the window of time for Metrics used to perform the Test, in seconds. The specified window must be greater than 0. This option cannot be used with the Static (static) Test.

For example, to use a Threshold with a window of four weeks or 2419200 seconds, you could write --threshold-window 2419200. If there are any Metrics older than four weeks, they would be excluded. Conversely, if all Metrics are from within the past four weeks, they would all be included.

--threshold-lower-boundary <BOUNDARY>


Specify the Lower Boundary. The constraints on the Lower Boundary depend on the Test used. Either the Lower Boundary, Upper Boundary, or both must be specified.

For details, see the documentation for the specific Test you are using:

--threshold-upper-boundary <BOUNDARY>


Specify the Upper Boundary. The constraints on the Upper Boundary depend on the Test used. Either the Lower Boundary, Upper Boundary, or both must be specified.

For details, see the documentation for the specific Test you are using:

---thresholds-reset


Reset all unspecified Thresholds for the given Branch and Testbed. If a Threshold already exists and is not specified, its current Model will be removed.

For example, if there were two Thresholds for the main Branch and localhost Testbed:

  • main Branch, localhost Testbed, latency Measure
  • main Branch, localhost Testbed, throughput Measure

If only the latency Measure is specified in the bencher run subcommand and --thresholds-reset is used, then the throughput Measure would have its Model removed.

--err


Optionally error when an Alert is generated. An Alert is generated when a new Metric is below a Lower Boundary Limit or above an Upper Boundary Limit.

Suppressing Alerts

Sometimes it can be useful to suppress Alerts for a particular Benchmark. The best way to do this is by adding one of these special suffixes to that Benchmark’s name:

  • _bencher_ignore
  • BencherIgnore
  • -bencher-ignore

For example, if your Benchmark was named my_flaky_benchmark then renaming it to my_flaky_benchmark_bencher_ignore would ignore just that particular Benchmark going forward. Ignored Benchmarks do get checked against existing Thresholds. However, an Alert will not be generated for them. The Metrics for ignored Benchmarks are still stored. The results from my_flaky_benchmark_bencher_ignore would still be stored as the Benchmark my_flaky_benchmark. If you remove the suffix and return to the original Benchmark name, then things will pick right back up where you left off.



🐰 Congrats! You have learned all about Thresholds & Alerts! 🎉


Keep Going: Benchmark Harness Adapters ➡



Published: Sat, August 12, 2023 at 4:07:00 PM UTC | Last Updated: Sat, October 19, 2024 at 1:27:00 PM UTC