fairlearn.metrics package#

Functionality for computing metrics, with a particular focus on disaggregated metrics.

For our purpose, a metric is a function with signature f(y_true, y_pred, ....) where y_true are the set of true values and y_pred are values predicted by a machine learning algorithm. Other arguments may be present (most often sample weights), which will affect how the metric is calculated.

This module provides the concept of a disaggregated metric. This is a metric where in addition to y_true and y_pred values, the user provides information about group membership for each sample. For example, a user could provide a ‘Gender’ column, and the disaggregated metric would contain separate results for the subgroups ‘male’, ‘female’ and ‘nonbinary’ indicated by that column. The underlying metric function is evaluated for each of these three subgroups. This extends to multiple grouping columns, calculating the metric for each combination of subgroups.

class fairlearn.metrics.MetricFrame(*, metrics, y_true, y_pred, sensitive_features, control_features=None, sample_params=None)[source]#

Bases: object

Collection of disaggregated metric values.

This data structure stores and manipulates disaggregated values for any number of underlying metrics. At least one sensitive feature must be supplied, which is used to split the data into subgroups. The underlying metric(s) is(are) calculated across the entire dataset (made available by the overall property) and for each identified subgroup (made available by the by_group property).

The only limitations placed on the metric functions are that:

The first two arguments they take must be y_true and y_pred arrays
Any other arguments must correspond to sample properties (such as sample weights), meaning that their first dimension is the same as that of y_true and y_pred. These arguments will be split up along with the y_true and y_pred arrays

The interpretation of the y_true and y_pred arrays is up to the underlying metric - it is perfectly possible to pass in lists of class probability tuples. We also support non-scalar return types for the metric function (such as confusion matrices) at the current time. However, the aggregation functions will not be well defined in this case.

Group fairness metrics are obtained by methods that implement various aggregators over group-level metrics, such as the maximum, minimum, or the worst-case difference or ratio.

This data structure also supports the concept of ‘control features.’ Like the sensitive features, control features identify subgroups within the data, but aggregations are not performed over the control features. Instead, the aggregations produce a result for each subgroup identified by the control feature(s). The name ‘control features’ refers to the statistical practice of ‘controlling’ for a variable.

Read more in the User Guide.

Parameters

metrics (callable or dict) –
The underlying metric functions which are to be calculated. This can either be a single metric function or a dictionary of functions. These functions must be callable as fn(y_true, y_pred, **sample_params). If there are any other arguments required (such as beta for sklearn.metrics.fbeta_score()) then functools.partial() must be used.

Note that the values returned by various members of the class change based on whether this argument is a callable or a dictionary of callables. This distinction remains even if the dictionary only contains a single entry.
y_true (List, pandas.Series, numpy.ndarray, pandas.DataFrame) – The ground-truth labels (for classification) or target values (for regression).
y_pred (List, pandas.Series, numpy.ndarray, pandas.DataFrame) – The predictions.
sensitive_features (List, pandas.Series, dict of 1d arrays, numpy.ndarray, pandas.DataFrame) – The sensitive features which should be used to create the subgroups. At least one sensitive feature must be provided. All names (whether on pandas objects or dictionary keys) must be strings. We also forbid DataFrames with column names of None. For cases where no names are provided we generate names sensitive_feature_[n].
control_features (List, pandas.Series, dict of 1d arrays, numpy.ndarray, pandas.DataFrame) –
Control features are similar to sensitive features, in that they divide the input data into subgroups. Unlike the sensitive features, aggregations are not performed across the control features - for example, the overall property will have one value for each subgroup in the control feature(s), rather than a single value for the entire data set. Control features can be specified similarly to the sensitive features. However, their default names (if none can be identified in the input values) are of the format control_feature_[n]. See the section on intersecting groups in the User Guide to learn how to use control levels.

Note the types returned by members of the class vary based on whether control features are present.
sample_params (dict) – Parameters for the metric function(s). If there is only one metric function, then this is a dictionary of strings and array-like objects, which are split alongside the y_true and y_pred arrays, and passed to the metric function. If there are multiple metric functions (passed as a dictionary), then this is a nested dictionary, with the first set of string keys identifying the metric function name, with the values being the string-to-array-like dictionaries.
metric (callable or dict) –
The underlying metric functions which are to be calculated. This can either be a single metric function or a dictionary of functions. These functions must be callable as fn(y_true, y_pred, **sample_params). If there are any other arguments required (such as beta for sklearn.metrics.fbeta_score()) then functools.partial() must be used.

Deprecated since version 0.7.0: metric will be removed in version 0.10.0, use metrics instead.

Examples

We will now go through some simple examples (see the User Guide for a more in-depth discussion):

>>> from fairlearn.metrics import MetricFrame, selection_rate
>>> from sklearn.metrics import accuracy_score
>>> import pandas as pd
>>> y_true = [1,1,1,1,1,0,0,1,1,0]
>>> y_pred = [0,1,1,1,1,0,0,0,1,1]
>>> sex = ['Female']*5 + ['Male']*5
>>> metrics = {"selection_rate": selection_rate}
>>> mf1 = MetricFrame(
...      metrics=metrics,
...      y_true=y_true,
...      y_pred=y_pred,
...      sensitive_features=sex)

Access the disaggregated metrics via a pandas Series

>>> mf1.by_group 
                    selection_rate
sensitive_feature_0
Female                         0.8
Male                           0.4

Access the largest difference, smallest ratio, and worst case performance

>>> print(f"difference: {mf1.difference()[0]:.3}   "
...      f"ratio: {mf1.ratio()[0]:.3}   "
...      f"max across groups: {mf1.group_max()[0]:.3}")
difference: 0.4   ratio: 0.5   max across groups: 0.8

You can also evaluate multiple metrics by providing a dictionary

>>> metrics_dict = {"accuracy":accuracy_score, "selection_rate": selection_rate}
>>> mf2 = MetricFrame(
...      metrics=metrics_dict,
...      y_true=y_true,
...      y_pred=y_pred,
...      sensitive_features=sex)

Access the disaggregated metrics via a pandas DataFrame

>>> mf2.by_group 
                    accuracy selection_rate
sensitive_feature_0
Female                   0.8            0.8
Male                     0.6            0.4

The largest difference, smallest ratio, and the maximum and minimum values across the groups are then all pandas Series, for example:

>>> mf2.difference()
accuracy          0.2
selection_rate    0.4
dtype: float64

You’ll probably want to view them transposed

>>> pd.DataFrame({'difference': mf2.difference(),
...               'ratio': mf2.ratio(),
...               'group_min': mf2.group_min(),
...               'group_max': mf2.group_max()}).T
           accuracy selection_rate
difference      0.2            0.4
ratio          0.75            0.5
group_min       0.6            0.4
group_max       0.8            0.8

More information about plotting metrics can be found in the plotting section of the User Guide.

Attributes

by_group: Return the collection of metrics evaluated for each subgroup.
control_levels: Return a list of feature names which are produced by control features.
overall: Return the underlying metrics evaluated on the whole dataset.
sensitive_levels: Return a list of the feature names which are produced by sensitive features.

Methods

`difference`([method, errors])	Return the maximum absolute difference between groups for each metric.
`group_max`([errors])	Return the maximum value of the metric over the sensitive features.
`group_min`([errors])	Return the maximum value of the metric over the sensitive features.
`ratio`([method, errors])	Return the minimum ratio between groups for each metric.

difference(method='between_groups', errors='coerce')[source]#

Return the maximum absolute difference between groups for each metric.

This method calculates a scalar value for each underlying metric by finding the maximum absolute difference between the entries in each combination of sensitive features in the by_group property.

Similar to other methods, the result type varies with the specification of the metric functions, and whether control features are present or not.

There are two allowed values for the method= parameter. The value between_groups computes the maximum difference between any two pairs of groups in the by_group property (i.e. group_max() - group_min()). Alternatively, to_overall computes the difference between each subgroup and the corresponding value from overall (if there are control features, then overall is multivalued for each metric). The result is the absolute maximum of these values.

Read more in the User Guide.

Parameters

method (str) – How to compute the aggregate. Default is between_groups
errors ({'raise', 'coerce'}, default 'coerce') – if ‘raise’, then invalid parsing will raise an exception if ‘coerce’, then invalid parsing will be set as NaN

Returns

The exact type follows the table in MetricFrame.overall.

Return type

Any or pandas.Series or pandas.DataFrame

group_max(errors='raise')[source]#

Return the maximum value of the metric over the sensitive features.

This method computes the maximum value over all combinations of sensitive features for each underlying metric function in the by_group property (it will only succeed if all the underlying metric functions return scalar values). The exact return type depends on whether control features are present, and whether the metric functions were specified as a single callable or a dictionary.

Read more in the User Guide.

Parameters: errors ({'raise', 'coerce'}, default 'raise') – if ‘raise’, then invalid parsing will raise an exception if ‘coerce’, then invalid parsing will be set as NaN
Returns: The maximum value over sensitive features. The exact type follows the table in MetricFrame.overall.
Return type: Any or pandas.Series or pandas.DataFrame

group_min(errors='raise')[source]#

Return the maximum value of the metric over the sensitive features.

This method computes the minimum value over all combinations of sensitive features for each underlying metric function in the by_group property (it will only succeed if all the underlying metric functions return scalar values). The exact return type depends on whether control features are present, and whether the metric functions were specified as a single callable or a dictionary.

Read more in the User Guide.

Parameters: errors ({'raise', 'coerce'}, default 'raise') – if ‘raise’, then invalid parsing will raise an exception if ‘coerce’, then invalid parsing will be set as NaN
Returns: The maximum value over sensitive features. The exact type follows the table in MetricFrame.overall.
Return type: Any or pandas.Series or pandas.DataFrame

ratio(method='between_groups', errors='coerce')[source]#

Return the minimum ratio between groups for each metric.

This method calculates a scalar value for each underlying metric by finding the minimum ratio (that is, the ratio is forced to be less than unity) between the entries in each column of the by_group property.

Similar to other methods, the result type varies with the specification of the metric functions, and whether control features are present or not.

There are two allowed values for the method= parameter. The value between_groups computes the minimum ratio between any two pairs of groups in the by_group property (i.e. group_min() / group_max()). Alternatively, to_overall computes the ratio between each subgroup and the corresponding value from overall (if there are control features, then overall is multivalued for each metric), expressing the ratio as a number less than 1. The result is the minimum of these values.

Read more in the User Guide.

Parameters

method (str) – How to compute the aggregate. Default is between_groups
errors ({'raise', 'coerce'}, default 'coerce') – if ‘raise’, then invalid parsing will raise an exception if ‘coerce’, then invalid parsing will be set as NaN

Returns

The exact type follows the table in MetricFrame.overall.

Return type

Any or pandas.Series or pandas.DataFrame

property by_group: Union[pandas.core.series.Series, pandas.core.frame.DataFrame]#

Return the collection of metrics evaluated for each subgroup.

The collection is defined by the combination of classes in the sensitive and control features. The exact type depends on the specification of the metric function.

Read more in the User Guide.

Returns

When a callable is supplied to the constructor, the result is a pandas.Series, indexed by the combinations of subgroups in the sensitive and control features.

When the metric functions were specified with a dictionary (even if the dictionary only has a single entry), then the result is a pandas.DataFrame with columns named after the metric functions, and rows indexed by the combinations of subgroups in the sensitive and control features.

If a particular combination of subgroups was not present in the dataset (likely to occur as more sensitive and control features are specified), then the corresponding entry will be NaN.

Return type

pandas.Series or pandas.DataFrame

property control_levels: Optional[List[str]]#

Return a list of feature names which are produced by control features.

If control features are present, then the rows of the by_group property have a pandas.MultiIndex index. This property identifies which elements of that index are control features.

Returns: List of names, which can be used in calls to pandas.DataFrame.groupby() etc.
Return type: List[str] or None

property overall: Union[Any, pandas.core.series.Series, pandas.core.frame.DataFrame]#

Return the underlying metrics evaluated on the whole dataset.

Read more in the User Guide.

Returns

The exact type varies based on whether control featuers were provided and how the metric functions were specified.

Metrics	Control Features	Result Type
Callable	None	Return type of callable
Callable	Provided	Series, indexed by the subgroups of the conditional feature(s)
Dict	None	Series, indexed by the metric names
Dict	Provided	DataFrame. Columns are metric names, rows are subgroups of conditional feature(s)

The distinction applies even if the dictionary contains a single metric function. This is to allow for a consistent interface when calling programatically, while also reducing typing for those using Fairlearn interactively.

Return type

Any or pandas.Series or pandas.DataFrame

property sensitive_levels: List[str]#

Return a list of the feature names which are produced by sensitive features.

In cases where the by_group property has a pandas.MultiIndex index, this identifies which elements of the index are sensitive features.

Read more in the User Guide.

Returns: List of names, which can be used in calls to pandas.DataFrame.groupby() etc.
Return type: List[str]

fairlearn.metrics.count(y_true, y_pred)[source]#

Calculate the number of data points in each group when working with MetricFrame.

The y_true argument is used to make this calculation. For consistency with other metric functions, the y_pred argument is required, but ignored.

Read more in the User Guide.

Parameters

y_true (array_like) – The list of true labels
y_pred (array_like) – The predicted labels (ignored)

Returns

The number of data points in each group.

Return type

int

fairlearn.metrics.demographic_parity_difference(y_true, y_pred, *, sensitive_features, method='between_groups', sample_weight=None)[source]#

Calculate the demographic parity difference.

The demographic parity difference is defined as the difference between the largest and the smallest group-level selection rate, \(E[h(X) | A=a]\), across all values \(a\) of the sensitive feature(s). The demographic parity difference of 0 means that all groups have the same selection rate.

Read more in the User Guide.

Parameters

y_true (array-like) – Ground truth (correct) labels.
y_pred (array-like) – Predicted labels \(h(X)\) returned by the classifier.
sensitive_features – The sensitive features over which demographic parity should be assessed
method (str) – How to compute the differences. See fairlearn.metrics.MetricFrame.difference() for details.
sample_weight (array-like) – The sample weights

Returns

The demographic parity difference

Return type

float

fairlearn.metrics.demographic_parity_ratio(y_true, y_pred, *, sensitive_features, method='between_groups', sample_weight=None)[source]#

Calculate the demographic parity ratio.

The demographic parity ratio is defined as the ratio between the smallest and the largest group-level selection rate, \(E[h(X) | A=a]\), across all values \(a\) of the sensitive feature(s). The demographic parity ratio of 1 means that all groups have the same selection rate.

Read more in the User Guide.

Parameters

y_true (array-like) – Ground truth (correct) labels.
y_pred (array-like) – Predicted labels \(h(X)\) returned by the classifier.
sensitive_features – The sensitive features over which demographic parity should be assessed
method (str) – How to compute the differences. See fairlearn.metrics.MetricFrame.ratio() for details.
sample_weight (array-like) – The sample weights

Returns

The demographic parity ratio

Return type

float

fairlearn.metrics.equalized_odds_difference(y_true, y_pred, *, sensitive_features, method='between_groups', sample_weight=None)[source]#

Calculate the equalized odds difference.

The greater of two metrics: true_positive_rate_difference and false_positive_rate_difference. The former is the difference between the largest and smallest of \(P[h(X)=1 | A=a, Y=1]\), across all values \(a\) of the sensitive feature(s). The latter is defined similarly, but for \(P[h(X)=1 | A=a, Y=0]\). The equalized odds difference of 0 means that all groups have the same true positive, true negative, false positive, and false negative rates.

Read more in the User Guide.

Parameters

y_true (array-like) – Ground truth (correct) labels.
y_pred (array-like) – Predicted labels \(h(X)\) returned by the classifier.
sensitive_features – The sensitive features over which demographic parity should be assessed
method (str) – How to compute the differences. See fairlearn.metrics.MetricFrame.difference() for details.
sample_weight (array-like) – The sample weights

Returns

The equalized odds difference

Return type

float

fairlearn.metrics.equalized_odds_ratio(y_true, y_pred, *, sensitive_features, method='between_groups', sample_weight=None)[source]#

Calculate the equalized odds ratio.

The smaller of two metrics: true_positive_rate_ratio and false_positive_rate_ratio. The former is the ratio between the smallest and largest of \(P[h(X)=1 | A=a, Y=1]\), across all values \(a\) of the sensitive feature(s). The latter is defined similarly, but for \(P[h(X)=1 | A=a, Y=0]\). The equalized odds ratio of 1 means that all groups have the same true positive, true negative, false positive, and false negative rates.

Read more in the User Guide.

Parameters

y_true (array-like) – Ground truth (correct) labels.
y_pred (array-like) – Predicted labels \(h(X)\) returned by the classifier.
sensitive_features – The sensitive features over which demographic parity should be assessed
method (str) – How to compute the differences. See fairlearn.metrics.MetricFrame.ratio() for details.
sample_weight (array-like) – The sample weights

Returns

The equalized odds ratio

Return type

float

fairlearn.metrics.false_negative_rate(y_true, y_pred, sample_weight=None, pos_label=None)[source]#

Calculate the false negative rate (also called miss rate).

Read more in the User Guide.

Parameters

y_true (array-like) – The list of true values
y_pred (array-like) – The list of predicted values
sample_weight (array-like, optional) – A list of weights to apply to each sample. By default all samples are weighted equally
pos_label (scalar, optional) – The value to treat as the ‘positive’ label in the samples. If None (the default) then the largest unique value of the y arrays will be used.

Returns

The false negative rate for the data

Return type

float

fairlearn.metrics.false_positive_rate(y_true, y_pred, sample_weight=None, pos_label=None)[source]#

Calculate the false positive rate (also called fall-out).

Read more in the User Guide.

Parameters

y_true (array-like) – The list of true values
y_pred (array-like) – The list of predicted values
sample_weight (array-like, optional) – A list of weights to apply to each sample. By default all samples are weighted equally
pos_label (scalar, optional) – The value to treat as the ‘positive’ label in the samples. If None (the default) then the largest unique value of the y arrays will be used.

Returns

The false positive rate for the data

Return type

float

fairlearn.metrics.make_derived_metric(*, metric, transform, sample_param_names=['sample_weight'])[source]#

Create a scalar returning metric function based on aggregation of a disaggregated metric.

Many higher order machine learning operations (such as hyperparameter tuning) make use of functions which return scalar metrics. We can create such a function for our disaggregated metrics with this function.

This function takes a metric function, a string to specify the desired aggregation transform (matching the methods MetricFrame.group_min(), MetricFrame.group_max(), MetricFrame.difference() and MetricFrame.ratio()), and a list of parameter names to treat as sample parameters.

The result is a callable object which has the same signature as the original function, with a sensitive_features= parameter added. If the chosen aggregation transform accepts parameters (currently only method= is supported), these can also be given when invoking the callable object. The result of this function is identical to creating a MetricFrame object, and then calling the method specified by the transform= argument (with the method= argument, if required).

See the Defining custom fairness metrics section in the User Guide for more details. A sample notebook is also available.

Parameters

metric (callable) – The metric function from which the new function should be derived
transform (str) – Selects the transformation aggregation the resultant function should use. The list of possible options is: [‘difference’, ‘group_min’, ‘group_max’, ‘ratio’].
sample_param_names (List[str]) – A list of parameters names of the underlying metric which should be treated as sample parameters (i.e. the same leading dimension as the y_true and y_pred parameters). This defaults to a list with a single entry of sample_weight (as used by many SciKit-Learn metrics). If None or an empty list is supplied, then no parameters will be treated as sample parameters.

Returns

Function with the same signature as the metric but with additional sensitive_features= and method= arguments, to enable the required computation

Return type

callable

fairlearn.metrics.mean_prediction(y_true, y_pred, sample_weight=None)[source]#

Calculate the (weighted) mean prediction.

The true values are ignored, but required as an argument in order to maintain a consistent interface

Parameters

y_true (array_like) – The true labels (ignored)
y_pred (array_like) – The predicted labels
sample_weight (array_like) – Optional array of sample weights

Return type

float

fairlearn.metrics.plot_model_comparison(*, y_preds, y_true=None, sensitive_features=None, x_axis_metric=None, y_axis_metric=None, ax=None, axis_labels=True, point_labels=False, point_labels_position=(0, 0), legend=False, show_plot=False, **kwargs)[source]#

Create a scatter plot comparing multiple models along two metrics.

A typical use case is when one of the metrics is a performance metric (e.g., balanced_accuracy) and the other is a fairness metric (e.g., false_negative_rate_difference).

Parameters

y_preds (array-like, dict of array-like) – An array-like containing predictions per model. Hence, predictions of model i should be in y_preds[i].
y_true (List, pandas.Series, numpy.ndarray, pandas.DataFrame) – The ground-truth labels (for classification) or target values (for regression).
sensitive_features (List, pandas.Series, dict of 1d arrays, numpy.ndarray, pandas.DataFrame, optional) – Sensitive features for the fairness metrics (if a fairness metric is specified for the x-axis or the y-axis).
x_axis_metric (Callable) – The metric function for the x-axis. The metric function must take y_true, y_pred, and optionally sensitive_features as arguments, and return a scalar value.
y_axis_metric (Callable) – The metric function for the y-axis, similar to x_axis_metric. The metric function must take y_true, y_pred, and optionally sensitive_features as arguments, and return a scalar value.
ax (matplotlib.axes.Axes, optional) – If supplied, the scatter plot is drawn on this Axes object. Else, a new figure with Axes is created.
axis_labels (bool, list) – If true, add the names of x and y axis metrics. You can also pass a list of size two (or a two-tuple) of strings to use as axis labels instead.
point_labels (bool, list) – If true, annotate text with inferred point labels. These labels are the keys of y_preds if y_preds is a dictionary, else simply the integers 0…number of points - 1. You can specify point_labels as a list of labels as well.
point_labels_position (list) – a list (or a two-tuple) containing precisely two numbers that define the offset of the point labels in the x and y direction respectively. The offset value is in data coordinates, not pixels.
legend (bool) – If True, add a legend. Legend entries are created by passing the key word argument label to calls to this function. If you want to customize the legend, you should manually call ax.legend (where ax is the Axes object) with your customization params
show_plot (bool) – If true, call pyplot.plot.

Returns

ax – The Axes object that was drawn on.

Return type

matplotlib.axes.Axes

Notes

To offer flexibility in stylistic features besides the aforementioned API options, one has at least three options: 1) supply matplotlib arguments to plot_model_comparison as you normally would to matplotlib.axes.Axes.scatter 2) change the style of the returned Axes 3) supply an Axes with your own style already applied

In case no Axes object is supplied, axis labels are automatically inferred from their class name.

fairlearn.metrics.selection_rate(y_true, y_pred, *, pos_label=1, sample_weight=None)[source]#

Calculate the fraction of predicted labels matching the ‘good’ outcome.

The argument pos_label specifies the ‘good’ outcome. For consistency with other metric functions, the y_true argument is required, but ignored.

Read more in the User Guide.

Parameters

y_true (array_like) – The true labels (ignored)
y_pred (array_like) – The predicted labels
pos_label (Scalar) – The label to treat as the ‘good’ outcome
sample_weight (array_like) – Optional array of sample weights

Return type

float

fairlearn.metrics.true_negative_rate(y_true, y_pred, sample_weight=None, pos_label=None)[source]#

Calculate the true negative rate (also called specificity or selectivity).

Read more in the User Guide.

Parameters

y_true (array-like) – The list of true values
y_pred (array-like) – The list of predicted values
sample_weight (array-like, optional) – A list of weights to apply to each sample. By default all samples are weighted equally
pos_label (scalar, optional) – The value to treat as the ‘positive’ label in the samples. If None (the default) then the largest unique value of the y arrays will be used.

Returns

The true negative rate for the data

Return type

float

fairlearn.metrics.true_positive_rate(y_true, y_pred, sample_weight=None, pos_label=None)[source]#

Calculate the true positive rate (also called sensitivity, recall, or hit rate).

Read more in the User Guide.

Parameters

y_true (array-like) – The list of true values
y_pred (array-like) – The list of predicted values
sample_weight (array-like, optional) – A list of weights to apply to each sample. By default all samples are weighted equally
pos_label (scalar, optional) – The value to treat as the ‘positive’ label in the samples. If None (the default) then the largest unique value of the y arrays will be used.

Returns

The true positive rate for the data

Return type

float