Confidence Interval Estimation#

It is a fact universally acknowledged that a person in possession of a metric must be in want of a confidence interval. Indeed, the omnipresence of random noise means that anything purporting to represent the real world which does not feature an accompanying confidence interval should be regarded with suspicion. When performing a fairness analysis this concern becomes acute, since we divide our sample into smaller groups, increasing the relative effects of random noise. We are then generally interested in the difference or ratio of function evaluations on these groups, and noise always accumulates, even when the target values (the difference or ratio in this case) are getting smaller. Intersecting groups make the problem worse, since some intersections can have very low sample counts, or even be empty.

In Fairlearn, we offer bootstrapping as a means of estimating confidence intervals.


When analysing data, we do not (usually) have access to the entire population; instead we have a sample. How then should we estimate the confidence intervals associated with the metrics we compute? Bootstrapping is a simple approach based on resampling with replacement. The process is as follows:

  1. Create a number of bootstrap samples by:

    1. Creating a new data set of equal size to the original by random sampling with replacement

    2. Evaluate the metric on this dataset

  2. Compute the distribution function of the set of bootstrap samples

  3. Estimate confidence intervals based on this distribution function

This is an easy and simple solution to a complex question, so we must immediately ask ourselves “Is this also wrong?” To answer this, we must first think carefully about what we have actually computed. Because we have been resampling our sample, the distribution of bootstrap samples will be based on our sample and not the entire population. Hence, we should say things like “there is a 95% likelihood that the metric of our sample lies between…” and not “there is a 95% likelihood that the metric lies between….” A full analysis is beyond the scope of this user guide, but it turns out that so long as our sample is representative of the population bootstrapping is a reasonable approach.

We then need to determine how many bootstrap samples are required. Bootstrapping is a Monte Carlo approach, so it introduces its own noise and reducing this will require more bootstrap samples (assuming that a poor random number generator does not render the exercise futile). In practice, it has been found that around 100 bootstrap samples can give reasonable estimates. While the number of bootstrap samples is trivial to increase, always remember that while this may make the answers more precise (by reducing the noise due to the bootstrap sampling), it will not necessarily make them more accurate. This is because the accuracy of the bootstrapped confidence interval estimates depends on how well the data sample reflects the underlying population.

There is one final subtlety to remember: when we perform bootstrapping, we will be computing a separate confidence interval on each quantity. There is no reason to expect that the underlying distributions for any given pair of quantities are the same. If the confidence intervals for that pair of quantities overlap, we cannot conclude that the quantities are statistically identical. This is particularly important in a fairness analysis where the ‘good’ case is usually equality (to give a MetricFrame.difference() of zero or MetricFrame.ratio() of one). For fairness, we must confine ourselves to considering the size of the confidence intervals, and whether they are indicating that we need to gather more data.

Bootstrapping MetricFrame#

We will now work through a short example of using MetricFrame’s bootstrapping capabilities. We start by setting up a very simple and small dataset, and a couple of metrics:

>>> y_true = [0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1]
>>> y_pred = [0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0]
>>> sf_data = ['b', 'b', 'a', 'b', 'b', 'a', 'a', 'a', 'b',
...            'a', 'b', 'a', 'b', 'b', 'a', 'b', 'b', 'b']
>>> import pandas as pd
>>> pd.set_option('display.max_columns', 20)
>>> pd.set_option('display.width', 80)
>>> from fairlearn.metrics import MetricFrame
>>> from fairlearn.metrics import count, selection_rate
>>> # Construct a function dictionary
>>> my_metrics = {
...     'sel' : selection_rate,
...     'count' : count
... }

With everything set up, we can now construct a MetricFrame with bootstrapping enabled. There are three relevant arguments for the constructor:

  • n_boot

  • ci_quantiles

  • random_state

Internally, MetricFrame will construct n_boot bootstrap samples (i.e. variations of the supplied dataset generated by sampling with replacement), according to the supplied random_state. Each quantity available (such as MetricFrame.overall or MetricFrame.difference()), is then evaluated for each of the bootstrap samples. The distribution of each is estimated via numpy.quantile() and the quantiles specified in ci_quantiles extracted. Since the quantiles are estimated from a distribution, even if the input data are integers (such as counts), then the bootstrapped results will always be floating point numbers. We create our MetricFrame thus:

>>> # Construct a MetricFrame with bootstrapping
>>> mf = MetricFrame(
...     metrics=my_metrics,
...     y_true=y_true,
...     y_pred=y_pred,
...     sensitive_features=sf_data,
...     n_boot=100,
...     ci_quantiles=[0.159, 0.5, 0.841],
...     random_state=20231019
... )

The quantiles we have chosen (in ci_quantiles) correspond to the standard deviation and median of the distribution. The ‘normal’ functionality of MetricFrame is still available. For example:

>>> mf.overall
sel       0.555556
count    18.000000
dtype: float64
>>> mf.by_group
                          sel  count
a                    0.714286    7.0
b                    0.454545   11.0

Let us look at the features bootstrapping makes available. First, the MetricFrame.ci_quantiles property records the confidence interval quantiles which we requested:

>>> mf.ci_quantiles
[0.159, 0.5, 0.841]

Now, we can start looking at the quantities we have computed. These are obtained by adding _ci to the existing functionality. The result is an array, indexed by MetricFrame.ci_quantiles where each element is of the same type as the non-bootstrapped function. For example, consider MetricFrame.overall_ci:

>>> _ = [print(x, '\n--') for x in mf.overall_ci]
sel       0.444444
count    18.000000
dtype: float64
sel       0.555556
count    18.000000
dtype: float64
sel       0.666667
count    18.000000
dtype: float64

We see that, for the overall metrics, the bootstrapped count() value is unchanged in each case. This is as we would expect: each sample is constructed to have the same number of entries as the original. However, the selection_rate() metric has been found to have values of 0.444, 0.556 and 0.667 for the quantiles specified. These values are in line with expectations (although note that with small numbers and ‘proportion’ metrics like selection_rate(), the median can quickly deviate from the nominal value). Next, we can examine MetricFrame.by_group_ci:

>>> _ = [print(x, '\n--') for x in mf.by_group_ci]
                          sel  count
a                    0.500000    5.0
b                    0.333333    9.0
                          sel  count
a                    0.700000    6.5
b                    0.440972   11.5
                          sel  count
a                    0.891767    9.0
b                    0.583333   13.0

We now have much more to dig into. Firstly, the count() metric is showing a variation, reflecting the fact that the resampled data are certain to have different proportions of labels a and b. Also, the sum of the count column no longer has to be an integer, or equal to the total number of samples. For the median the sum is as expected, but the individual counts are no longer integers; this is expected, since we requested an even number of bootstrap samples. When we inspect the sel column, we see that the estimates of the median, while close to the nominal values (from MetricFrame.by_group above), are not equal to them. In all cases, though, the numbers reported are intuitively reasonable.

We provide methods such as MetricFrame.group_min_ci(), which are similar to their non-bootstrapped counterparts. However, they have no errors parameter. This parameter controls what happens when a metric returns a result for which less-than is not well defined (e.g a confusion matrix). A bootstrapped MetricFrame will not even get this far, since the lack of a less-than operator will cause the numpy.quantile() call to fail.


Bootstrapping is a powerful and simple technique, but its limitations must be borne in mind:

  • Bootstrapping assumes the data sample is representative of the population

  • Overlapping confidence intervals do not imply statistical equality. This is very important in a fairness analysis where we are usually hoping for equality

  • Increasing the number of bootstrap samples will make the results more precise, but not necessarily more accurate. The accuracy of the results depends on the degree to which the supplied data are representative of the population

  • As a Monte-Carlo technique, it can only be as good as the underlying random number generator

The first of these limitations is likely to give the most trouble in practice.