Thresholds & Alerts
Thresholds can be created for the unique combination of a Metric Kind, Branch, and Testbed. They are statistical significance tests that use either a Z-score or Student’s t-test in order to detect performance regressions and generate Alerts. When a Metric is below a Threshold’s Lower Boundary Limit or above a Threshold’s Upper Boundary Limit, an Alert is generated for that Metric.
Thresholds work best when:
- There are no extreme differences between benchmark runs
- Benchmark runs are totally independent of one another
- The number of iterations for a single benchmark run is less than 10% of the historical Metrics
If there are less than 30 historical Metrics for the combination of Metric Kind, Branch, and Testbed then a Student’s t-test Threshold should be used and not a Z-score Threshold.
🐰 Don’t Panic! This will all make sense in a minute.
Statistical Significance Test
Z-score
The Z-score measures the number of standard deviations (σ) a given Metric is above or below the mean of the historical Metrics. The standard deviation (σ) can also be expressed as a lower boundary or upper boundary cumulative percentage.
For example, two standard deviations (2σ) is the same as an upper boundary cumulative percentage of 97.7%, as pictured above.
When creating Z-score Thresholds, the decimal notation of the cumulative percentage is used.
In this example, the upper bound cumulative percentage of 97.7% would be an Upper Boundary of 0.977
.
In practice, a Threshold like this would be useful for the Latency Metric Kind.
That is, a larger value would indicate a performance regression.
When a smaller value would indicate a performance regression such as with the Throughput Metric Kind,
a lower boundary cumulative percentage should be used.
A lower boundary cumulative percentage of 97.7% would correspond to two standard deviations below the mean (-2σ).
This would be given in decimal notation as a Lower Boundary of 0.977
.
🐰 Tip: When using a Z-score Threshold, set the Minimum Sample Size to at least 30.
Student’s t-test
The Student’s t-test measures how likely it is that a given Metric is above or below the mean of the historical Metrics. This likelihood is called a confidence interval (CI). The confidence interval (CI) is expressed as a lower boundary or upper boundary confidence percentage.
For example, an upper boundary confidence percentage of 95.0% indicates that 95.0% of Metrics should be less than an expected maximum.
When creating t-test Thresholds, the decimal notation of the confidence percentage is used.
In this example, the upper boundar confidence percentage of 95.0% would be an Upper Boundary of 0.95
.
In practice, a Threshold like this would be useful for the Latency Metric Kind.
That is, a larger value would indicate a performance regression.
When a smaller value would indicate a performance regression such as with the Throughput Metric Kind,
a lower boundary confidence percentage should be used.
A lower boundary confidence percentage of 95.0% would indicate that Metrics should be greater than an expected minimum.
This would be given in decimal notation as a Lower Boundary of 0.95
.
🐰 Tip: Use a t-test Threshold if you have less than 30 historical Metrics.
Statistical Significance Boundary
The meaning of the statistical significance boundary depends on the statistical significance test:
- Z-score: Standard deviation (σ) expressed as a decimal cumulative percentage
- t-test: Confidence interval (CI) expressed as a decimal confidence percentage
Each Metric is checked against the Threshold’s statistical significance boundary if it exists. This can include a Lower Boundary, Upper Boundary, or both. A Boundary Limit is calculated for each Boundary. This Boundary Limit is then compared against the current Metric. If that Metric falls outside of the Boundary Limit an Alert will be generate.
🐰 Tip: To fail a CI build when a boundary is violated use the
--err
flag for thebencher run
CLI subcommand.
Lower Boundary
A lower boundary can be set for a Threshold.
It is used when a smaller value would indicate a performance regression,
such as with the Throughput Metric Kind.
The value must be a decimal between 0.5
and 1.0
.
For example, if you used a Z-score and your historical Metrics had a mean of 100
and a standard deviation of 10
,
then a Lower Boundary of 0.977
would create a Lower Boundary Limit at 80.05
.
Any value less than 80.05
would generate an Alert.
Upper Boundary
An upper boundary can be set for a Threshold.
It is used when a larger value would indicate a performance regression,
such as with the Latency Metric Kind.
The value must be a decimal between 0.5
and 1.0
.
For example, if you used a Z-score and your historical Metrics had a mean of 100
and a standard deviation of 10
,
then an Upper Boundary of 0.977
would create an Upper Boundary Limit at 119.95
.
Any value greater than 119.95
would generate an Alert.
Sample Size
Minimum Sample Size
A minimum sample size can be set for a Threshold. The Threshold will only run its statistical significance test if the number of historical Metrics is greater than or equal to the minimum sample size.
Maximum Sample Size
A maximum sample size can be set for a Threshold. The Threshold will limit itself to only the most recent historical Metrics capped at the maximum sample size for its statistical significance test.
Window Size
A window size in seconds can be set for a Threshold. The Threshold will limit itself to only the most recent historical Metrics bounded by the given time window for its statistical significance test.
Alerts
Alerts are generated when a Metric is below a Threshold’s Lower Boundary Limit or above a Threshold’s Upper Boundary Limit.
To fail a CI build in the event of an Alert set the --err
flag when using the bencher run
CLI subcommand.
Suppressing Alerts
Sometimes it can be useful to suppress Alerts for a particular Benchmark. The best way to do this is by adding one of these special suffixes to that Benchmark’s name:
_bencher_ignore
BencherIgnore
-bencher-ignore
For example, if your Benchmark was named my_flaky_benchmark
then renaming it to my_flaky_benchmark_bencher_ignore
would ignore just that particular Benchmark going forward.
Ignored Benchmarks do not get checked against the Threshold even if one exists.
However, the metrics for ignored Benchmarks are still stored.
Continuing with our example, the results from my_flaky_benchmark_bencher_ignore
would still be stored in the database under my_flaky_benchmark
.
If you remove the suffix and return to the original Benchmark name,
then things will pick right back up where you left off.
🐰 Congrats! You have learned all about Thresholds & Alerts! 🎉