Thresholds can be created for the unique combination of a Metric Kind, Branch, and Testbed. They are statistical significance tests that use either a Z-score or Student’s t-test in order to detect performance regressions and generate Alerts. When a Metric is below a Threshold’s Lower Boundary Limit or above a Threshold’s Upper Boundary Limit, an Alert is generated for that Metric.

Thresholds work best when:

• There are no extreme differences between benchmark runs
• Benchmark runs are totally independent of one another
• The number of iterations for a single benchmark run is less than 10% of the historical Metrics

If there are less than 30 historical Metrics for the combination of Metric Kind, Branch, and Testbed then a Student’s t-test Threshold should be used and not a Z-score Threshold.

## Statistical Significance Test

### Z-score

The Z-score measures the number of standard deviations (σ) a given Metric is above or below the mean of the historical Metrics. The standard deviation (σ) can also be expressed as a lower boundary or upper boundary cumulative percentage.

For example, two standard deviations (2σ) is the same as an upper boundary cumulative percentage of 97.7%, as pictured above. When creating Z-score Thresholds, the decimal notation of the cumulative percentage is used. In this example, the upper bound cumulative percentage of 97.7% would be an Upper Boundary of `0.977`. In practice, a Threshold like this would be useful for the Latency Metric Kind. That is, a larger value would indicate a performance regression.

When a smaller value would indicate a performance regression such as with the Throughput Metric Kind, a lower boundary cumulative percentage should be used. A lower boundary cumulative percentage of 97.7% would correspond to two standard deviations below the mean (-2σ). This would be given in decimal notation as a Lower Boundary of `0.977`.

🐰 Tip: When using a Z-score Threshold, set the Minimum Sample Size to at least 30.

### Student’s t-test

The Student’s t-test measures how likely it is that a given Metric is above or below the mean of the historical Metrics. This likelihood is called a confidence interval (CI). The confidence interval (CI) is expressed as a lower boundary or upper boundary confidence percentage.

For example, an upper boundary confidence percentage of 95.0% indicates that 95.0% of Metrics should be less than an expected maximum. When creating t-test Thresholds, the decimal notation of the confidence percentage is used. In this example, the upper boundar confidence percentage of 95.0% would be an Upper Boundary of `0.95`. In practice, a Threshold like this would be useful for the Latency Metric Kind. That is, a larger value would indicate a performance regression.

When a smaller value would indicate a performance regression such as with the Throughput Metric Kind, a lower boundary confidence percentage should be used. A lower boundary confidence percentage of 95.0% would indicate that Metrics should be greater than an expected minimum. This would be given in decimal notation as a Lower Boundary of `0.95`.

🐰 Tip: Use a t-test Threshold if you have less than 30 historical Metrics.

## Statistical Significance Boundary

The meaning of the statistical significance boundary depends on the statistical significance test:

• Z-score: Standard deviation (σ) expressed as a decimal cumulative percentage
• t-test: Confidence interval (CI) expressed as a decimal confidence percentage

Each Metric is checked against the Threshold’s statistical significance boundary if it exists. This can include a Lower Boundary, Upper Boundary, or both. A Boundary Limit is calculated for each Boundary. This Boundary Limit is then compared against the current Metric. If that Metric falls outside of the Boundary Limit an Alert will be generate.

🐰 Tip: To fail a CI build when a boundary is violated use the `--err` flag for the `bencher run` CLI subcommand.

### Lower Boundary

A lower boundary can be set for a Threshold. It is used when a smaller value would indicate a performance regression, such as with the Throughput Metric Kind. The value must be a decimal between `0.5` and `1.0`.

For example, if you used a Z-score and your historical Metrics had a mean of `100` and a standard deviation of `10`, then a Lower Boundary of `0.977` would create a Lower Boundary Limit at `80.05`. Any value less than `80.05` would generate an Alert.

### Upper Boundary

An upper boundary can be set for a Threshold. It is used when a larger value would indicate a performance regression, such as with the Latency Metric Kind. The value must be a decimal between `0.5` and `1.0`.

For example, if you used a Z-score and your historical Metrics had a mean of `100` and a standard deviation of `10`, then an Upper Boundary of `0.977` would create an Upper Boundary Limit at `119.95`. Any value greater than `119.95` would generate an Alert.

## Sample Size

### Minimum Sample Size

A minimum sample size can be set for a Threshold. The Threshold will only run its statistical significance test if the number of historical Metrics is greater than or equal to the minimum sample size.

### Maximum Sample Size

A maximum sample size can be set for a Threshold. The Threshold will limit itself to only the most recent historical Metrics capped at the maximum sample size for its statistical significance test.

## Window Size

A window size in seconds can be set for a Threshold. The Threshold will limit itself to only the most recent historical Metrics bounded by the given time window for its statistical significance test.

Alerts are generated when a Metric is below a Threshold’s Lower Boundary Limit or above a Threshold’s Upper Boundary Limit. To fail a CI build in the event of an Alert set the `--err` flag when using the `bencher run` CLI subcommand.

Sometimes it can be useful to suppress Alerts for a particular Benchmark. The best way to do this is by adding one of these special suffixes to that Benchmark’s name:

• `_bencher_ignore`
• `BencherIgnore`
• `-bencher-ignore`

For example, if your Benchmark was named `my_flaky_benchmark` then renaming it to `my_flaky_benchmark_bencher_ignore` would ignore just that particular Benchmark going forward. Ignored Benchmarks do not get checked against the Threshold even if one exists. However, the metrics for ignored Benchmarks are still stored. Continuing with our example, the results from `my_flaky_benchmark_bencher_ignore` would still be stored in the database under `my_flaky_benchmark`. If you remove the suffix and return to the original Benchmark name, then things will pick right back up where you left off.