Note
Click here to download the full example code
Metrics with Multiple Features¶
This notebook demonstrates the new API for metrics, which supports multiple sensitive and conditional features. This example does not contain a proper discussion of how fairness relates to the dataset used, although it does highlight issues which users may want to consider when analysing their datasets.
We are going to consider a lending scenario, supposing that we have a model which predicts whether or not a particular customer will repay a loan. This could be used as the basis of deciding whether or not to offer that customer a loan. With traditional metrics, we would assess the model using:
The ‘true’ values from the test set
The model predictions from the test set
Our fairness metrics compute group-based fairness statistics. To use these, we also need categorical columns from the test set. For this example, we will include:
The sex of each individual (two unique values)
The race of each individual (three unique values)
The credit score band of each individual (three unique values)
Whether the loan is considered ‘large’ or ‘small’
An individual’s sex and race should not affect a lending decision, but it would be legitimate to consider an individual’s credit score and the relative size of the loan which they desired.
A real scenario will be more complicated, but this will serve to illustrate the use of the new metrics.
Getting the Data¶
This section may be skipped. It simply creates a dataset for illustrative purposes
We will use the well-known UCI ‘Adult’ dataset as the basis of this demonstration. This is not for a lending scenario, but we will regard it as one for the purposes of this example. We will use the existing ‘race’ and ‘sex’ columns (trimming the former to three unique values), and manufacture credit score bands and loan sizes from other columns. We start with some uncontroversial import statements:
import functools
import numpy as np
import sklearn.metrics as skm
from sklearn.compose import ColumnTransformer
from sklearn.datasets import fetch_openml
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import make_column_selector as selector
from sklearn.pipeline import Pipeline
from fairlearn.metrics import MetricFrame
from fairlearn.metrics import selection_rate, count
Next, we import the data:
data = fetch_openml(data_id=1590, as_frame=True)
X_raw = data.data
y = (data.target == '>50K') * 1
For purposes of clarity, we consolidate the ‘race’ column to have three unique values:
Out:
/tmp/tmp8dm8m5o_/5f4919440d858d282f49b305702eb26df3476228/examples/plot_new_metrics.py:91: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
X_raw['race'] = X_raw['race'].map(race_transform).fillna('Other').astype('category')
['Black' 'Other' 'White']
Now, we manufacture the columns for the credit score band and requested loan size. These are wholly constructed, and not part of the actual dataset in any way. They are simply for illustrative purposes.
def marriage_transform(m_s_string):
"""Perform some simple manipulations."""
result = 'Low'
if m_s_string.startswith("Married"):
result = 'Medium'
elif m_s_string.startswith("Widowed"):
result = 'High'
return result
def occupation_transform(occ_string):
"""Perform some simple manipulations."""
result = 'Small'
if occ_string.startswith("Machine"):
result = 'Large'
return result
col_credit = X_raw['marital-status'].map(marriage_transform).fillna('Low')
col_credit.name = "Credit Score"
col_loan_size = X_raw['occupation'].map(occupation_transform).fillna('Small')
col_loan_size.name = "Loan Size"
A = X_raw[['race', 'sex']]
A['Credit Score'] = col_credit
A['Loan Size'] = col_loan_size
A
Out:
/tmp/tmp8dm8m5o_/5f4919440d858d282f49b305702eb26df3476228/examples/plot_new_metrics.py:125: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A['Credit Score'] = col_credit
/tmp/tmp8dm8m5o_/5f4919440d858d282f49b305702eb26df3476228/examples/plot_new_metrics.py:126: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A['Loan Size'] = col_loan_size
Now that we have imported our dataset and manufactured a few features, we can perform some more conventional processing. To avoid the problem of data leakage, we need to split the data into training and test sets before applying any transforms or scaling:
(X_train, X_test, y_train, y_test, A_train, A_test) = train_test_split(
X_raw, y, A, test_size=0.3, random_state=54321, stratify=y
)
# Ensure indices are aligned between X, y and A,
# after all the slicing and splitting of DataFrames
# and Series
X_train = X_train.reset_index(drop=True)
X_test = X_test.reset_index(drop=True)
y_train = y_train.reset_index(drop=True)
y_test = y_test.reset_index(drop=True)
A_train = A_train.reset_index(drop=True)
A_test = A_test.reset_index(drop=True)
Next, we build two Pipeline
objects
to process the columns, one for numeric data, and the other
for categorical data. Both impute missing values; the difference
is whether the data are scaled (numeric columns) or
one-hot encoded (categorical columns). Imputation of missing
values should generally be done with care, since it could
potentially introduce biases. Of course, removing rows with
missing data could also cause trouble, if particular subgroups
have poorer data quality.
numeric_transformer = Pipeline(
steps=[
("impute", SimpleImputer()),
("scaler", StandardScaler()),
]
)
categorical_transformer = Pipeline(
[
("impute", SimpleImputer(strategy="most_frequent")),
("ohe", OneHotEncoder(handle_unknown="ignore")),
]
)
preprocessor = ColumnTransformer(
transformers=[
("num", numeric_transformer, selector(dtype_exclude="category")),
("cat", categorical_transformer, selector(dtype_include="category")),
]
)
With our preprocessor defined, we can now build a new pipeline which includes an Estimator:
unmitigated_predictor = Pipeline(
steps=[
("preprocessor", preprocessor),
(
"classifier",
LogisticRegression(solver="liblinear", fit_intercept=True),
),
]
)
With the pipeline fully defined, we can first train it with the training data, and then generate predictions from the test data.
Analysing the Model with Metrics¶
After our data manipulations and model training, we have the following from our test set:
A vector of true values called
y_test
A vector of model predictions called
y_pred
A DataFrame of categorical features relevant to fairness called
A_test
In a traditional model analysis, we would now look at some metrics
evaluated on the entire dataset. Suppose in this case, the relevant
metrics are fairlearn.metrics.selection_rate()
and
sklearn.metrics.fbeta_score()
(with
beta=0.6
).
We can evaluate these metrics directly:
print("Selection Rate:", selection_rate(y_test, y_pred))
print("fbeta:", skm.fbeta_score(y_test, y_pred, beta=0.6))
Out:
Selection Rate: 0.1947041561454992
fbeta: 0.6827826864569057
We know that there are sensitive features in our data, and we want to
ensure that we’re not harming individuals due to membership in any of
these groups. For this purpose, Fairlearn provides the
fairlearn.metrics.MetricFrame
class. Let us construct an instance of this class, and then look at
its capabilities:
fbeta_06 = functools.partial(skm.fbeta_score, beta=0.6)
metric_fns = {'selection_rate': selection_rate, 'fbeta_06': fbeta_06, 'count': count}
grouped_on_sex = MetricFrame(metrics=metric_fns,
y_true=y_test,
y_pred=y_pred,
sensitive_features=A_test['sex'])
The fairlearn.metrics.MetricFrame
object requires a
minimum of four arguments:
The underlying metric function(s) to be evaluated
The true values
The predicted values
The sensitive feature values
These are all passed as arguments to the constructor. If more than one underlying metric is required (as in this case), then we must provide them in a dictionary.
The underlying metrics must have a signature fn(y_true, y_pred)
,
so we have to use functools.partial()
on fbeta_score()
to
furnish beta=0.6
(we will show how to pass in extra array
arguments such as sample weights shortly).
We will now take a closer look at the fairlearn.metrics.MetricFrame
object. First, there is the overall
property, which contains
the metrics evaluated on the entire dataset. We see that this contains the
same values calculated above:
assert grouped_on_sex.overall['selection_rate'] == selection_rate(y_test, y_pred)
assert grouped_on_sex.overall['fbeta_06'] == skm.fbeta_score(y_test, y_pred, beta=0.6)
print(grouped_on_sex.overall)
Out:
selection_rate 0.194704
fbeta_06 0.682783
count 14653
dtype: object
The other property in the fairlearn.metrics.MetricFrame
object
is by_group
. This contains the metrics evaluated on each subgroup defined
by the categories in the sensitive_features=
argument. Note that
fairlearn.metrics.count()
can be used to display the number of
data points in each subgroup. In this case, we have results for males and females:
grouped_on_sex.by_group
We can immediately see a substantial disparity in the selection rate between males and females.
We can also create another fairlearn.metrics.MetricFrame
object
using race as the sensitive feature:
The overall
property is unchanged:
assert (grouped_on_sex.overall == grouped_on_race.overall).all()
The by_group
property now contains the metrics evaluated based on the ‘race’
column:
grouped_on_race.by_group
We see that there is also a significant disparity in selection rates when grouping by race.
Sample weights and other arrays¶
We noted above that the underlying metric functions passed to the
fairlearn.metrics.MetricFrame
constructor need to be of
the form fn(y_true, y_pred)
- we do not support scalar arguments
such as pos_label=
or beta=
in the constructor. Such
arguments should be bound into a new function using
functools.partial()
, and the result passed in. However, we do
support arguments which have one entry for each sample, with an array
of sample weights being the most common example. These are divided
into subgroups along with y_true
and y_pred
, and passed along
to the underlying metric.
To use these arguments, we pass in a dictionary as the sample_params=
argument of the constructor. Let us generate some random weights, and
pass these along:
random_weights = np.random.rand(len(y_test))
example_sample_params = {
'selection_rate': {'sample_weight': random_weights},
'fbeta_06': {'sample_weight': random_weights},
}
grouped_with_weights = MetricFrame(metrics=metric_fns,
y_true=y_test,
y_pred=y_pred,
sensitive_features=A_test['sex'],
sample_params=example_sample_params)
We can inspect the overall values, and check they are as expected:
assert grouped_with_weights.overall['selection_rate'] == \
selection_rate(y_test, y_pred, sample_weight=random_weights)
assert grouped_with_weights.overall['fbeta_06'] == \
skm.fbeta_score(y_test, y_pred, beta=0.6, sample_weight=random_weights)
print(grouped_with_weights.overall)
Out:
selection_rate 0.194733
fbeta_06 0.679909
count 14653
dtype: object
We can also see the effect on the metric being evaluated on the subgroups:
grouped_with_weights.by_group
Quantifying Disparities¶
We now know that our model is selecting individuals who are female far less often than individuals who are male. There is a similar effect when examining the results by race, with blacks being selected far less often than whites (and those classified as ‘other’). However, there are many cases where presenting all these numbers at once will not be useful (for example, a high level dashboard which is monitoring model performance). Fairlearn provides several means of aggregating metrics across the subgroups, so that disparities can be readily quantified.
The simplest of these aggregations is group_min()
, which reports the
minimum value seen for a subgroup for each underlying metric (we also provide
group_max()
). This is
useful if there is a mandate that “no subgroup should have an fbeta_score()
of less than 0.6.” We can evaluate the minimum values easily:
grouped_on_race.group_min()
Out:
selection_rate 0.068198
fbeta_06 0.592125
count 692
dtype: object
As noted above, the selection rates varies greatly by race and by sex.
This can be quantified in terms of a difference between the subgroup with
the highest value of the metric, and the subgroup with the lowest value.
For this, we provide the method difference(method='between_groups)
:
grouped_on_race.difference(method='between_groups')
Out:
selection_rate 0.142518
fbeta_06 0.101591
count 11832
dtype: object
We can also evaluate the difference relative to the corresponding overall value of the metric. In this case we take the absolute value, so that the result is always positive:
grouped_on_race.difference(method='to_overall')
Out:
selection_rate 0.126507
fbeta_06 0.090657
count 13961
dtype: object
There are situations where knowing the ratios of the metrics evaluated on
the subgroups is more useful. For this we have the ratio()
method.
We can take the ratios between the minimum and maximum values of each metric:
grouped_on_race.ratio(method='between_groups')
Out:
selection_rate 0.323648
fbeta_06 0.853555
count 0.055254
dtype: object
We can also compute the ratios relative to the overall value for each metric. Analogous to the differences, the ratios are always in the range \([0,1]\):
grouped_on_race.ratio(method='to_overall')
Out:
selection_rate 0.350263
fbeta_06 0.867223
count 0.047226
dtype: float64
Intersections of Features¶
So far we have only considered a single sensitive feature at a time,
and we have already found some serious issues in our example data.
However, sometimes serious issues can be hiding in intersections of
features. For example, the
Gender Shades project
found that facial recognition algorithms performed worse for blacks
than whites, and also worse for women than men (despite overall high
accuracy score). Moreover, performance on black females was terrible.
We can examine the intersections of sensitive features by passing
multiple columns to the fairlearn.metrics.MetricFrame
constructor:
The overall values are unchanged, but the by_group
table now
shows the intersections between subgroups:
assert (grouped_on_race_and_sex.overall == grouped_on_race.overall).all()
grouped_on_race_and_sex.by_group
The aggregations are still performed across all subgroups for each metric,
so each continues to reduce to a single value. If we look at the
group_min()
, we see that we violate the mandate we specified for the
fbeta_score()
suggested above (for females with a race of ‘Other’ in
fact):
grouped_on_race_and_sex.group_min()
Out:
selection_rate 0.032258
fbeta_06 0.503704
count 254
dtype: object
Looking at the ratio()
method, we see that the disparity is worse
(specifically between white males and black females, if we check in
the by_group
table):
grouped_on_race_and_sex.ratio(method='between_groups')
Out:
selection_rate 0.11893
fbeta_06 0.690978
count 0.029354
dtype: object
Control Features¶
There is a further way we can slice up our data. We have (completely made up) features for the individuals’ credit scores (in three bands) and also the size of the loan requested (large or small). In our loan scenario, it is acceptable that individuals with high credit scores are selected more often than individuals with low credit scores. However, within each credit score band, we do not want a disparity between (say) black females and white males. To example these cases, we have the concept of control features.
Control features are introduced by the control_features=
argument to the fairlearn.metrics.MetricFrame
object:
Out:
/home/circleci/.pyenv/versions/3.8.12/lib/python3.8/site-packages/sklearn/metrics/_classification.py:1592: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 due to no true nor predicted samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, "true nor predicted", "F-score is", len(true_sum))
/home/circleci/.pyenv/versions/3.8.12/lib/python3.8/site-packages/sklearn/metrics/_classification.py:1592: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 due to no true nor predicted samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, "true nor predicted", "F-score is", len(true_sum))
This has an immediate effect on the overall
property. Instead
of having one value for each metric, we now have a value for each
unique value of the control feature:
cond_credit_score.overall
The by_group
property is similarly expanded:
cond_credit_score.by_group
The aggregates are also evaluated once for each group identified by the control feature:
cond_credit_score.group_min()
And:
cond_credit_score.ratio(method='between_groups')
In our data, we see that we have a dearth of positive results for high income non-whites, which significantly affects the aggregates.
We can continue adding more control features:
Out:
/home/circleci/.pyenv/versions/3.8.12/lib/python3.8/site-packages/sklearn/metrics/_classification.py:1592: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 due to no true nor predicted samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, "true nor predicted", "F-score is", len(true_sum))
Found 36 subgroups. Evaluation may be slow
/home/circleci/.pyenv/versions/3.8.12/lib/python3.8/site-packages/sklearn/metrics/_classification.py:1592: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 due to no true nor predicted samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, "true nor predicted", "F-score is", len(true_sum))
/home/circleci/.pyenv/versions/3.8.12/lib/python3.8/site-packages/sklearn/metrics/_classification.py:1592: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 due to no true nor predicted samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, "true nor predicted", "F-score is", len(true_sum))
/home/circleci/.pyenv/versions/3.8.12/lib/python3.8/site-packages/sklearn/metrics/_classification.py:1592: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 due to no true nor predicted samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, "true nor predicted", "F-score is", len(true_sum))
/home/circleci/.pyenv/versions/3.8.12/lib/python3.8/site-packages/sklearn/metrics/_classification.py:1592: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 due to no true nor predicted samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, "true nor predicted", "F-score is", len(true_sum))
/home/circleci/.pyenv/versions/3.8.12/lib/python3.8/site-packages/sklearn/metrics/_classification.py:1592: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 due to no true nor predicted samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, "true nor predicted", "F-score is", len(true_sum))
/home/circleci/.pyenv/versions/3.8.12/lib/python3.8/site-packages/sklearn/metrics/_classification.py:1592: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 due to no true nor predicted samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, "true nor predicted", "F-score is", len(true_sum))
/home/circleci/.pyenv/versions/3.8.12/lib/python3.8/site-packages/sklearn/metrics/_classification.py:1592: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 due to no true nor predicted samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, "true nor predicted", "F-score is", len(true_sum))
/home/circleci/.pyenv/versions/3.8.12/lib/python3.8/site-packages/sklearn/metrics/_classification.py:1592: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 due to no true nor predicted samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, "true nor predicted", "F-score is", len(true_sum))
/home/circleci/.pyenv/versions/3.8.12/lib/python3.8/site-packages/sklearn/metrics/_classification.py:1592: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 due to no true nor predicted samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, "true nor predicted", "F-score is", len(true_sum))
The overall
property now splits into more values:
cond_both.overall
As does the by_groups
property, where NaN
values
indicate that there were no samples in the cell:
cond_both.by_group
The aggregates behave similarly. By this point, we are having significant issues with under-populated intersections. Consider:
Out:
Found 36 subgroups. Evaluation may be slow
Loan Size Credit Score race sex
Large High Black Female 5
Male 1
Other Female 3
Male NaN
White Female 13
Male 1
Low Black Female 52
Male 33
Other Female 3
Male 14
White Female 133
Male 225
Medium Black Female 7
Male 38
Other Female 9
Male 19
White Female 28
Male 333
Small High Black Female 49
Male 14
Other Female 18
Male 4
White Female 293
Male 69
Low Black Female 517
Male 357
Other Female 163
Male 147
White Female 2784
Male 2857
Medium Black Female 83
Male 281
Other Female 58
Male 254
White Female 620
Male 5168
Name: member_counts, dtype: object
Recall that NaN
indicates that there were no individuals
in a cell - member_counts()
will not even have been called.
Exporting from MetricFrame¶
Sometimes, we need to extract our data for use in other tools.
For this, we can use the pandas.DataFrame.to_csv()
method,
since the by_group()
property
will be a pandas.DataFrame
(or in a few cases, it will be
a pandas.Series
, but that has a similar
to_csv()
method):
csv_output = cond_credit_score.by_group.to_csv()
print(csv_output)
Out:
Credit Score,race,sex,selection_rate,fbeta_06,count
High,Black,Female,0.0,0.0,54
High,Black,Male,0.06666666666666667,1.0,15
High,Other,Female,0.0,0.0,21
High,Other,Male,0.0,0.0,4
High,White,Female,0.0196078431372549,0.5295950155763239,306
High,White,Male,0.14285714285714285,0.7593052109181142,70
Low,Black,Female,0.007029876977152899,0.6267281105990783,569
Low,Black,Male,0.020512820512820513,0.56353591160221,390
Low,Other,Female,0.012048192771084338,0.5190839694656488,166
Low,Other,Male,0.037267080745341616,0.6938775510204082,161
Low,White,Female,0.015083990401097017,0.5257731958762887,2917
Low,White,Male,0.033419857235561325,0.5502497502497502,3082
Medium,Black,Female,0.2111111111111111,0.6396526772793053,90
Medium,Black,Male,0.20689655172413793,0.5775764439411097,319
Medium,Other,Female,0.23880597014925373,0.5,67
Medium,Other,Male,0.336996336996337,0.7320574162679426,273
Medium,White,Female,0.3734567901234568,0.6808811402992107,648
Medium,White,Male,0.40610798036720597,0.700837357443748,5501
The pandas.DataFrame.to_csv()
method has a large number of
arguments to control the exported CSV. For example, it can write
directly to a CSV file, rather than returning a string (as shown
above).
The overall()
property can
be handled similarly, in the cases that it is not a scalar.
Total running time of the script: ( 0 minutes 10.981 seconds)