Thresholds & Alerts
Thresholds are how you catch performance regressions with Bencher. A Threshold is assigned to a unique combination of: Branch, Testbed, and Measure. A Threshold uses a specific Test to detect performance regressions. The combination of a Test and its parameters is called a Model. A Model must have a Lower Boundary, Upper Boundary, or both.
- Lower Boundary
- A Lower Boundary is used when a smaller value would indicate a performance regression, such as with the Throughput Measure.
- Upper Boundary
- An Upper Boundary is used when a larger value would indicate a performance regression, such as with the Latency Measure.
Each Boundary is used to calculate a Boundary Limit. Then every new Metric is checked against each Boundary Limit. An Alert is generated when a new Metric is below a Lower Boundary Limit or above an Upper Boundary Limit.
When Continuous Benchmarking, that is benchmarking in CI you will want to create Thresholds.
Using the bencher run
CLI subcommand,
you already specify a Branch with the --branch
option
and a Testbed with the --testbed
option.
So the only other dimension you need to specify is a Measure, with the --threshold-measure
option.
Then you can use the --threshold-test
option to specify the Test to use for that Measure.
The --threshold-min-sample-size
, --threshold-max-sample-size
, and --threshold-window
options
allow you to control what data is used by the Test.
Finally, the --threshold-lower-boundary
and --threshold-upper-boundary
options
allow you to set the Lower Boundary and Upper Boundary.
If you want to remove all Models that are not specified,
you can do so with the --thresholds-reset
flag.
- If the Threshold does not exits, it will be created for you.
- If the Threshold does exist and the specified Model is the same, then the Model is ignored.
- If the Threshold does exist and the specified Model is different, then a new Model is created for the Threshold.
- If a Threshold does exist and it is reset, then the current Model is removed from the Threshold.
For example, to only use a Threshold for the Latency Measure
using a Student’s t-test Test
with a maximum sample size of 64
and an Upper Boundary of 0.99
, you could write something like this:
🐰 When working with feature branches, you may want to copy the existing Thresholds from the base, Start Point Branch. This is possible with the
--start-point-clone-thresholds
flag. Note that the--thresholds-reset
flag will still remove any cloned Thresholds that are not explicitly specified.
Multiple Thresholds
It is possible to create multiple Thresholds with the same
bencher run
invocation.
When specifying multiple Thresholds, all of the same options must be used for each Threshold.
To ignore an option for a specific Threshold, use an underscore (_
).
For example, if you only want to use two Thresholds,
one for the Latency Measure and one for the Throughput Measure
then you would likely want to set an Upper Boundary for the Latency Measure
and a Lower Boundary for the Throughput Measure.
Therefore, you would use --threshold-lower-boundary _
for the Latency Measure
and --threshold-upper-boundary _
for the Throughput Measure.
You could write something like this:
--threshold-measure <MEASURE>
Use the specified Measure name, slug, or UUID for a Threshold. If the value specified is a name or slug and the Measure does not already exist, it will be created for you. However, if the value specified is a UUID then the Measure must already exist.
For example, to use a Threshold for the Latency Measure,
you could write --threshold-measure latency
.
--threshold-test <TEST>
Use the specified Test to detect performance regressions.
There are a several different Tests available:
- Percentage (
percentage
) - z-score (
z_score
) - t-test (
t_test
) - Log Normal (
log_normal
) - Interquartile Range (
iqr
) - Delta Interquartile Range (
delta_iqr
) - Static (
static
)
For example to use a Threshold with a Student’s t-test,
you could write --threshold-test t_test
.
Percentage
A Percentage Test (percentage
) is the simplest statistical Test.
If a new Metric is below a certain percentage of the mean (Lower Boundary)
or above a certain percentage of the mean (Upper Boundary) of your historical Metrics an Alert is generated.
Either a Lower Boundary, Upper Boundary, or both must be set.
Percentage Tests work best when the value of the Metric should stay within a known good range.
-
Percentage Lower Boundary
- A Percentage Test Lower Boundary can be any percentage greater than or equal to zero in decimal form (ex: use
0.10
for10%
). It is used when a smaller value would indicate a performance regression. - For example, if you had a Percentage Test with a Lower Boundary set to
0.10
and your historical Metrics had a mean of100
the Lower Boundary Limit would be90
and any value less than90
would generate an Alert.
- A Percentage Test Lower Boundary can be any percentage greater than or equal to zero in decimal form (ex: use
-
Percentage Upper Boundary
- A Percentage Test Upper Boundary can be any percentage greater than or equal to zero in decimal form (ex: use
0.10
for10%
). It is used when a greater value would indicate a performance regression. - For example, if you had a Percentage Test with an Upper Boundary set to
0.10
and your historical Metrics had a mean of100
the Upper Boundary Limit would be110
and any value greater than110
would generate an Alert.
- A Percentage Test Upper Boundary can be any percentage greater than or equal to zero in decimal form (ex: use
z-score
A z-score Test (z_score
) measures the number of standard deviations (σ)
a new Metric is from the mean of your historical Metrics using a z-score.
z-score Tests work best when:
- There are no extreme differences between benchmark runs
- Benchmark runs are totally independent of one another
- The number of iterations for a single benchmark run is less than 10% of the historical Metrics
- There are at least 30 historical Metrics (minimum Sample Size >= 30)
For z-score Tests, standard deviations are expressed as a decimal cumulative percentage. If a new Metric is below a certain left-side cumulative percentage (Lower Boundary) or above a certain right-side cumulative percentage (Upper Boundary) for your historical Metrics an Alert is generated. Either a Lower Boundary, Upper Boundary, or both must be set.
-
z-score Lower Boundary
- A z-score Test Lower Boundary can be any positive decimal between
0.5
and1.0
. Where0.5
represents the mean and1.0
represents all possible left-side values (-∞). It is used when a smaller value would indicate a performance regression. - For example, if you used a z-score Test with a Lower Boundary of
0.977
and your historical Metrics had a mean of100
and a standard deviation of10
, the Lower Boundary Limit would be80.05
and any value less than80.05
would generate an Alert.
- A z-score Test Lower Boundary can be any positive decimal between
-
z-score Upper Boundary
- A z-score Test Upper Boundary can be any positive decimal between
0.5
and1.0
. Where0.5
represents the mean and1.0
represents all possible right-side values (∞). It is used when a greater value would indicate a performance regression. - For example, if you used a z-score Test with an Upper Boundary of
0.977
and your historical Metrics had a mean of100
and a standard deviation of10
, the Upper Boundary Limit would be119.95
and any value greater than119.95
would generate an Alert.
- A z-score Test Upper Boundary can be any positive decimal between
t-test
A t-test Test (t_test
) measures the confidence interval (CI) for how likely it is that
a new Metric is above or below the mean of your historical Metrics using a Student’s t-test.
t-test Tests work best when:
- There are no extreme differences between benchmark runs
- Benchmark runs are totally independent of one another
- The number of iterations for a single benchmark run is less than 10% of the historical Metrics
For t-test Tests, confidence intervals are expressed as a decimal confidence percentage. If a new Metric is below a certain left-side confidence percentage (Lower Boundary) or above a certain right-side confidence percentage (Upper Boundary) for your historical Metrics an Alert is generated. Either a Lower Boundary, Upper Boundary, or both must be set.
-
t-test Lower Boundary
- A t-test Test Lower Boundary can be any positive decimal between
0.5
and1.0
. Where0.5
represents the mean and1.0
represents all possible left-side values (-∞). It is used when a smaller value would indicate a performance regression. - For example, if you used a t-test Test with a Lower Boundary of
0.977
and you had25
historical Metrics with a mean of100
and a standard deviation of10
, the Lower Boundary Limit would be78.96
and any value less than78.96
would generate an Alert.
- A t-test Test Lower Boundary can be any positive decimal between
-
t-test Upper Boundary
- A t-test Test Upper Boundary can be any positive decimal between
0.5
and1.0
. Where0.5
represents the mean and1.0
represents all possible right-side values (∞). It is used when a greater value would indicate a performance regression. - For example, if you used a t-test Test with an Upper Boundary of
0.977
and you had25
historical Metrics with a mean of100
and a standard deviation of10
, the Upper Boundary Limit would be121.04
and any value greater than121.04
would generate an Alert.
- A t-test Test Upper Boundary can be any positive decimal between
Log Normal
A Log Normal Test (log_normal
) measures how likely it is that
a new Metric is above or below the center location of your historical Metrics using a Log Normal Distribution.
Log Normal Tests work best when:
- Benchmark runs are totally independent of one another
- The number of iterations for a single benchmark run is less than 10% of the historical Metrics
- All data is positive (the natural log of a negative number is
undefined
)
For Log Normal Tests, the likelihood expressed as a decimal percentage. If a new Metric is below a certain left-side percentage (Lower Boundary) or above a certain right-side percentage (Upper Boundary) for your historical Metrics an Alert is generated. Either a Lower Boundary, Upper Boundary, or both must be set.
-
Log Normal Lower Boundary
- A Log Normal Test Lower Boundary can be any positive decimal between
0.5
and1.0
. Where0.5
represents the center location and1.0
represents all possible left-side values (-∞). It is used when a smaller value would indicate a performance regression. - For example, if you used a Log Normal Test with a Lower Boundary of
0.977
and you had25
historical Metrics centered around100
and one pervious outlier at200
, the Lower Boundary Limit would be71.20
and any value less than71.20
would generate an Alert.
- A Log Normal Test Lower Boundary can be any positive decimal between
-
Log Normal Upper Boundary
- A Log Normal Test Upper Boundary can be any positive decimal between
0.5
and1.0
. Where0.5
represents the center location and1.0
represents all possible right-side values (∞). It is used when a greater value would indicate a performance regression. - For example, if you used a Log Normal Test with an Upper Boundary of
0.977
and you had25
historical Metrics centered around100
and one previous outlier at200
, the Upper Boundary Limit would be134.18
and any value greater than134.18
would generate an Alert.
- A Log Normal Test Upper Boundary can be any positive decimal between
Interquartile Range
An Interquartile Range Test (iqr
) measures how many multiples of the interquartile range (IQR)
a new Metric is above or below the median of your historical Metrics.
If a new Metric is below a certain multiple of the IQR from the median (Lower Boundary)
or above a certain multiple of the IQR from the median (Upper Boundary) of your historical Metrics an Alert is generated.
Either a Lower Boundary, Upper Boundary, or both must be set.
-
Interquartile Range Lower Boundary
- An Interquartile Range Test Lower Boundary can be any multiplier greater than or equal to zero (ex: use
2.0
for2x
). It is used when a smaller value would indicate a performance regression. - For example, if you had an Interquartile Range Test with a Lower Boundary set to
2.0
and your historical Metrics had a median of100
and an interquartile range of10
the Lower Boundary Limit would be80
and any value less than80
would generate an Alert.
- An Interquartile Range Test Lower Boundary can be any multiplier greater than or equal to zero (ex: use
-
Interquartile Range Upper Boundary
- An Interquartile Range Test Upper Boundary can be any multiplier greater than or equal to zero (ex: use
2.0
for2x
). It is used when a greater value would indicate a performance regression. - For example, if you had an Interquartile Range Test with an Upper Boundary set to
2.0
and your historical Metrics had a median of100
and an interquartile range of10
the Upper Boundary Limit would be120
and any value greater than120
would generate an Alert.
- An Interquartile Range Test Upper Boundary can be any multiplier greater than or equal to zero (ex: use
Delta Interquartile Range
A Delta Interquartile Range Test (delta_iqr
) measures how many multiples of the average percentage change (Δ) interquartile range (IQR)
a new Metric is above or below the median of your historical Metrics.
If a new Metric is below a certain multiple of the ΔIQR from the median (Lower Boundary)
or above a certain multiple of the ΔIQR from the median (Upper Boundary) of your historical Metrics an Alert is generated.
Either a Lower Boundary, Upper Boundary, or both must be set.
-
Delta Interquartile Range Lower Boundary
- A Delta Interquartile Range Test Lower Boundary can be any multiplier greater than or equal to zero (ex: use
2.0
for2x
). It is used when a smaller value would indicate a performance regression. - For example, if you had a Delta Interquartile Range Test with a Lower Boundary set to
2.0
and your historical Metrics had a median of100
, an interquartile range of10
, and an average delta interquartile range of0.2
(20%
) the Lower Boundary Limit would be60
and any value less than60
would generate an Alert.
- A Delta Interquartile Range Test Lower Boundary can be any multiplier greater than or equal to zero (ex: use
-
Delta Interquartile Range Upper Boundary
- A Delta Interquartile Range Test Upper Boundary can be any multiplier greater than or equal to zero (ex: use
2.0
for2x
). It is used when a greater value would indicate a performance regression. - For example, if you had a Delta Interquartile Range Test with an Upper Boundary set to
2.0
and your historical Metrics had a median of100
, an interquartile range of10
, and an average delta interquartile range of0.2
(20%
) the Upper Boundary Limit would be140
and any value greater than140
would generate an Alert.
- A Delta Interquartile Range Test Upper Boundary can be any multiplier greater than or equal to zero (ex: use
Static
A Static Test (static
) is the simplest Test.
If a new Metric is below a set Lower Boundary or above a set Upper Boundary an Alert is generated.
That is, the Lower/Upper Boundary is an explicit Lower/Upper Boundary Limit.
Either a Lower Boundary, Upper Boundary, or both must be set.
Static Tests work best when the value of the Metric should stay within a constant range across all Benchmarks,
such as code coverage.
🐰 If you want a different static Lower/Upper Boundary Limit for each Benchmark, then you should use a Percentage Test (
percentage
) with the Lower/Upper Boundary set to0.0
and the Max Sample Size set to2
.
-
Static Lower Boundary
- A Static Test Lower Boundary can be any floating point number. It is used when a smaller value would indicate a performance regression. The Lower Boundary must be less than or equal to the Upper Boundary, if both are specified.
- For example, if you had a Static Test with a Lower Boundary set to
100
the Lower Boundary Limit would likewise be100
and any value less than100
would generate an Alert.
-
Static Upper Boundary
- A Static Test Upper Boundary can be any floating point number. It is used when a greater value would indicate a performance regression. The Upper Boundary must be greater than or equal to the Lower Boundary, if both are specified.
- For example, if you had a Static Test with an Upper Boundary set to
100
the Upper Boundary Limit would likewise be100
and any value greater than100
would generate an Alert.
--threshold-min-sample-size <SAMPLE_SIZE>
Optionally specify the minimum number of Metrics required to run a Test.
If this minimum is not met, the Test will not run.
The specified sample size must be greater than or equal to 2
.
If the --threshold-max-sample-size
option is also set,
then the specified sample size must be less than or equal to --threshold-max-sample-size
.
This option cannot be used with the Static (static
) Test.
For example to use a Threshold with a minimum sample size of 10
,
you could write --threshold-min-sample-size 10
.
If there were fewer than 10
Metrics, the Test would not run.
Conversely, if there were 10
or more Metrics, the Test would run.
--threshold-max-sample-size <SAMPLE_SIZE>
Optionally specify the maximum number of Metrics used to run a Test.
If this maximum is exceeded, the oldest Metrics will be ignored.
The specified sample size must be greater than or equal to 2
.
If the --threshold-min-sample-size
option is also set,
then the specified sample size must be greater than or equal to --threshold-min-sample-size
.
This option cannot be used with the Static (static
) Test.
For example, to use a Threshold with a maximum sample size of 100
,
you could write --threshold-max-sample-size 100
.
If there were more than 100
Metrics, only the most recent 100
Metrics would be included.
Conversely, if there were 100
or fewer Metrics, all of the Metrics would be included.
--threshold-window <WINDOW>
Optionally specify the window of time for Metrics used to perform the Test, in seconds.
The specified window must be greater than 0
.
This option cannot be used with the Static (static
) Test.
For example, to use a Threshold with a window of four weeks or 2419200
seconds,
you could write --threshold-window 2419200
.
If there are any Metrics older than four weeks, they would be excluded.
Conversely, if all Metrics are from within the past four weeks, they would all be included.
--threshold-lower-boundary <BOUNDARY>
Specify the Lower Boundary. The constraints on the Lower Boundary depend on the Test used. Either the Lower Boundary, Upper Boundary, or both must be specified.
For details, see the documentation for the specific Test you are using:
- Percentage Lower Boundary
- z-score Lower Boundary
- t-test Lower Boundary
- Log Normal Lower Boundary
- Interquartile Range Lower Boundary
- Delta Interquartile Range Lower Boundary
- Static Lower Boundary
--threshold-upper-boundary <BOUNDARY>
Specify the Upper Boundary. The constraints on the Upper Boundary depend on the Test used. Either the Lower Boundary, Upper Boundary, or both must be specified.
For details, see the documentation for the specific Test you are using:
- Percentage Upper Boundary
- z-score Upper Boundary
- t-test Upper Boundary
- Log Normal Upper Boundary
- Interquartile Range Upper Boundary
- Delta Interquartile Range Upper Boundary
- Static Upper Boundary
---thresholds-reset
Reset all unspecified Thresholds for the given Branch and Testbed. If a Threshold already exists and is not specified, its current Model will be removed.
For example, if there were two Thresholds for the main
Branch and localhost
Testbed:
If only the latency
Measure is specified in the bencher run
subcommand
and --thresholds-reset
is used,
then the throughput
Measure would have its Model removed.
--err
Optionally error when an Alert is generated. An Alert is generated when a new Metric is below a Lower Boundary Limit or above an Upper Boundary Limit.
Suppressing Alerts
Sometimes it can be useful to suppress Alerts for a particular Benchmark. The best way to do this is by adding one of these special suffixes to that Benchmark’s name:
_bencher_ignore
BencherIgnore
-bencher-ignore
For example, if your Benchmark was named my_flaky_benchmark
then renaming it to my_flaky_benchmark_bencher_ignore
would ignore just that particular Benchmark going forward.
Ignored Benchmarks do get checked against existing Thresholds.
However, an Alert will not be generated for them.
The Metrics for ignored Benchmarks are still stored.
The results from my_flaky_benchmark_bencher_ignore
would still be stored as the Benchmark my_flaky_benchmark
.
If you remove the suffix and return to the original Benchmark name,
then things will pick right back up where you left off.
🐰 Congrats! You have learned all about Thresholds & Alerts! 🎉