Metrics with Multiple Features — Fairlearn 0.7.0 documentation

Getting the Data¶

This section may be skipped. It simply creates a dataset for illustrative purposes

We will use the well-known UCI ‘Adult’ dataset as the basis of this demonstration. This is not for a lending scenario, but we will regard it as one for the purposes of this example. We will use the existing ‘race’ and ‘sex’ columns (trimming the former to three unique values), and manufacture credit score bands and loan sizes from other columns. We start with some uncontroversial import statements:

import functools
import numpy as np

import sklearn.metrics as skm
from sklearn.compose import ColumnTransformer
from sklearn.datasets import fetch_openml
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import make_column_selector as selector
from sklearn.pipeline import Pipeline

from fairlearn.metrics import MetricFrame
from fairlearn.metrics import selection_rate, count

Next, we import the data:

data = fetch_openml(data_id=1590, as_frame=True)
X_raw = data.data
y = (data.target == '>50K') * 1

For purposes of clarity, we consolidate the ‘race’ column to have three unique values:

def race_transform(input_str):
    """Reduce values to White, Black and Other."""
    result = 'Other'
    if input_str == 'White' or input_str == 'Black':
        result = input_str
    return result


X_raw['race'] = X_raw['race'].map(race_transform).fillna('Other').astype('category')
print(np.unique(X_raw['race']))

Out:

/tmp/tmp8dm8m5o_/5f4919440d858d282f49b305702eb26df3476228/examples/plot_new_metrics.py:91: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_raw['race'] = X_raw['race'].map(race_transform).fillna('Other').astype('category')
['Black' 'Other' 'White']

Now, we manufacture the columns for the credit score band and requested loan size. These are wholly constructed, and not part of the actual dataset in any way. They are simply for illustrative purposes.

def marriage_transform(m_s_string):
    """Perform some simple manipulations."""
    result = 'Low'
    if m_s_string.startswith("Married"):
        result = 'Medium'
    elif m_s_string.startswith("Widowed"):
        result = 'High'
    return result


def occupation_transform(occ_string):
    """Perform some simple manipulations."""
    result = 'Small'
    if occ_string.startswith("Machine"):
        result = 'Large'
    return result


col_credit = X_raw['marital-status'].map(marriage_transform).fillna('Low')
col_credit.name = "Credit Score"
col_loan_size = X_raw['occupation'].map(occupation_transform).fillna('Small')
col_loan_size.name = "Loan Size"

A = X_raw[['race', 'sex']]
A['Credit Score'] = col_credit
A['Loan Size'] = col_loan_size
A

Out:

/tmp/tmp8dm8m5o_/5f4919440d858d282f49b305702eb26df3476228/examples/plot_new_metrics.py:125: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  A['Credit Score'] = col_credit
/tmp/tmp8dm8m5o_/5f4919440d858d282f49b305702eb26df3476228/examples/plot_new_metrics.py:126: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  A['Loan Size'] = col_loan_size

	race	sex	Credit Score	Loan Size
0	Black	Male	Low	Large
1	White	Male	Medium	Small
2	White	Male	Medium	Small
3	Black	Male	Medium	Large
4	White	Female	Low	Small
...	...	...	...	...
48837	White	Female	Medium	Small
48838	White	Male	Medium	Large
48839	White	Female	High	Small
48840	White	Male	Low	Small
48841	White	Female	Medium	Small

48842 rows × 4 columns

Now that we have imported our dataset and manufactured a few features, we can perform some more conventional processing. To avoid the problem of data leakage, we need to split the data into training and test sets before applying any transforms or scaling:

(X_train, X_test, y_train, y_test, A_train, A_test) = train_test_split(
    X_raw, y, A, test_size=0.3, random_state=54321, stratify=y
)

# Ensure indices are aligned between X, y and A,
# after all the slicing and splitting of DataFrames
# and Series

X_train = X_train.reset_index(drop=True)
X_test = X_test.reset_index(drop=True)
y_train = y_train.reset_index(drop=True)
y_test = y_test.reset_index(drop=True)
A_train = A_train.reset_index(drop=True)
A_test = A_test.reset_index(drop=True)

Next, we build two Pipeline objects to process the columns, one for numeric data, and the other for categorical data. Both impute missing values; the difference is whether the data are scaled (numeric columns) or one-hot encoded (categorical columns). Imputation of missing values should generally be done with care, since it could potentially introduce biases. Of course, removing rows with missing data could also cause trouble, if particular subgroups have poorer data quality.

numeric_transformer = Pipeline(
    steps=[
        ("impute", SimpleImputer()),
        ("scaler", StandardScaler()),
    ]
)
categorical_transformer = Pipeline(
    [
        ("impute", SimpleImputer(strategy="most_frequent")),
        ("ohe", OneHotEncoder(handle_unknown="ignore")),
    ]
)
preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, selector(dtype_exclude="category")),
        ("cat", categorical_transformer, selector(dtype_include="category")),
    ]
)

With our preprocessor defined, we can now build a new pipeline which includes an Estimator:

unmitigated_predictor = Pipeline(
    steps=[
        ("preprocessor", preprocessor),
        (
            "classifier",
            LogisticRegression(solver="liblinear", fit_intercept=True),
        ),
    ]
)

With the pipeline fully defined, we can first train it with the training data, and then generate predictions from the test data.

unmitigated_predictor.fit(X_train, y_train)
y_pred = unmitigated_predictor.predict(X_test)

Analysing the Model with Metrics¶

After our data manipulations and model training, we have the following from our test set:

A vector of true values called y_test
A vector of model predictions called y_pred
A DataFrame of categorical features relevant to fairness called A_test

In a traditional model analysis, we would now look at some metrics evaluated on the entire dataset. Suppose in this case, the relevant metrics are fairlearn.metrics.selection_rate() and sklearn.metrics.fbeta_score() (with beta=0.6). We can evaluate these metrics directly:

print("Selection Rate:", selection_rate(y_test, y_pred))
print("fbeta:", skm.fbeta_score(y_test, y_pred, beta=0.6))

Out:

Selection Rate: 0.1947041561454992
fbeta: 0.6827826864569057

We know that there are sensitive features in our data, and we want to ensure that we’re not harming individuals due to membership in any of these groups. For this purpose, Fairlearn provides the fairlearn.metrics.MetricFrame class. Let us construct an instance of this class, and then look at its capabilities:

fbeta_06 = functools.partial(skm.fbeta_score, beta=0.6)

metric_fns = {'selection_rate': selection_rate, 'fbeta_06': fbeta_06, 'count': count}

grouped_on_sex = MetricFrame(metrics=metric_fns,
                             y_true=y_test,
                             y_pred=y_pred,
                             sensitive_features=A_test['sex'])

The fairlearn.metrics.MetricFrame object requires a minimum of four arguments:

The underlying metric function(s) to be evaluated
The true values
The predicted values
The sensitive feature values

These are all passed as arguments to the constructor. If more than one underlying metric is required (as in this case), then we must provide them in a dictionary.

The underlying metrics must have a signature fn(y_true, y_pred), so we have to use functools.partial() on fbeta_score() to furnish beta=0.6 (we will show how to pass in extra array arguments such as sample weights shortly).

We will now take a closer look at the fairlearn.metrics.MetricFrame object. First, there is the overall property, which contains the metrics evaluated on the entire dataset. We see that this contains the same values calculated above:

assert grouped_on_sex.overall['selection_rate'] == selection_rate(y_test, y_pred)
assert grouped_on_sex.overall['fbeta_06'] == skm.fbeta_score(y_test, y_pred, beta=0.6)
print(grouped_on_sex.overall)

Out:

selection_rate    0.194704
fbeta_06          0.682783
count                14653
dtype: object

The other property in the fairlearn.metrics.MetricFrame object is by_group. This contains the metrics evaluated on each subgroup defined by the categories in the sensitive_features= argument. Note that fairlearn.metrics.count() can be used to display the number of data points in each subgroup. In this case, we have results for males and females:

grouped_on_sex.by_group

	selection_rate	fbeta_06	count
sex
Female	0.06883	0.634014	4838
Male	0.25675	0.689789	9815

We can immediately see a substantial disparity in the selection rate between males and females.

We can also create another fairlearn.metrics.MetricFrame object using race as the sensitive feature:

grouped_on_race = MetricFrame(metrics=metric_fns,
                              y_true=y_test,
                              y_pred=y_pred,
                              sensitive_features=A_test['race'])

The overall property is unchanged:

assert (grouped_on_sex.overall == grouped_on_race.overall).all()

The by_group property now contains the metrics evaluated based on the ‘race’ column:

grouped_on_race.by_group

	selection_rate	fbeta_06	count
race
Black	0.068198	0.592125	1437
Other	0.16763	0.693717	692
White	0.210715	0.686081	12524

We see that there is also a significant disparity in selection rates when grouping by race.

Sample weights and other arrays¶

We noted above that the underlying metric functions passed to the fairlearn.metrics.MetricFrame constructor need to be of the form fn(y_true, y_pred) - we do not support scalar arguments such as pos_label= or beta= in the constructor. Such arguments should be bound into a new function using functools.partial(), and the result passed in. However, we do support arguments which have one entry for each sample, with an array of sample weights being the most common example. These are divided into subgroups along with y_true and y_pred, and passed along to the underlying metric.

To use these arguments, we pass in a dictionary as the sample_params= argument of the constructor. Let us generate some random weights, and pass these along:

random_weights = np.random.rand(len(y_test))

example_sample_params = {
    'selection_rate': {'sample_weight': random_weights},
    'fbeta_06': {'sample_weight': random_weights},
}


grouped_with_weights = MetricFrame(metrics=metric_fns,
                                   y_true=y_test,
                                   y_pred=y_pred,
                                   sensitive_features=A_test['sex'],
                                   sample_params=example_sample_params)

We can inspect the overall values, and check they are as expected:

assert grouped_with_weights.overall['selection_rate'] == \
    selection_rate(y_test, y_pred, sample_weight=random_weights)
assert grouped_with_weights.overall['fbeta_06'] == \
    skm.fbeta_score(y_test, y_pred, beta=0.6, sample_weight=random_weights)
print(grouped_with_weights.overall)

Out:

selection_rate    0.194733
fbeta_06          0.679909
count                14653
dtype: object

We can also see the effect on the metric being evaluated on the subgroups:

grouped_with_weights.by_group

	selection_rate	fbeta_06	count
sex
Female	0.068897	0.648059	4838
Male	0.257318	0.684546	9815

Quantifying Disparities¶

We now know that our model is selecting individuals who are female far less often than individuals who are male. There is a similar effect when examining the results by race, with blacks being selected far less often than whites (and those classified as ‘other’). However, there are many cases where presenting all these numbers at once will not be useful (for example, a high level dashboard which is monitoring model performance). Fairlearn provides several means of aggregating metrics across the subgroups, so that disparities can be readily quantified.

The simplest of these aggregations is group_min(), which reports the minimum value seen for a subgroup for each underlying metric (we also provide group_max()). This is useful if there is a mandate that “no subgroup should have an fbeta_score() of less than 0.6.” We can evaluate the minimum values easily:

grouped_on_race.group_min()

Out:

selection_rate    0.068198
fbeta_06          0.592125
count                  692
dtype: object

As noted above, the selection rates varies greatly by race and by sex. This can be quantified in terms of a difference between the subgroup with the highest value of the metric, and the subgroup with the lowest value. For this, we provide the method difference(method='between_groups):

grouped_on_race.difference(method='between_groups')

Out:

selection_rate    0.142518
fbeta_06          0.101591
count                11832
dtype: object

We can also evaluate the difference relative to the corresponding overall value of the metric. In this case we take the absolute value, so that the result is always positive:

grouped_on_race.difference(method='to_overall')

Out:

selection_rate    0.126507
fbeta_06          0.090657
count                13961
dtype: object

There are situations where knowing the ratios of the metrics evaluated on the subgroups is more useful. For this we have the ratio() method. We can take the ratios between the minimum and maximum values of each metric:

grouped_on_race.ratio(method='between_groups')

Out:

selection_rate    0.323648
fbeta_06          0.853555
count             0.055254
dtype: object

We can also compute the ratios relative to the overall value for each metric. Analogous to the differences, the ratios are always in the range \([0,1]\):

grouped_on_race.ratio(method='to_overall')

Out:

selection_rate    0.350263
fbeta_06          0.867223
count             0.047226
dtype: float64

Intersections of Features¶

So far we have only considered a single sensitive feature at a time, and we have already found some serious issues in our example data. However, sometimes serious issues can be hiding in intersections of features. For example, the Gender Shades project found that facial recognition algorithms performed worse for blacks than whites, and also worse for women than men (despite overall high accuracy score). Moreover, performance on black females was terrible. We can examine the intersections of sensitive features by passing multiple columns to the fairlearn.metrics.MetricFrame constructor:

grouped_on_race_and_sex = MetricFrame(metrics=metric_fns,
                                      y_true=y_test,
                                      y_pred=y_pred,
                                      sensitive_features=A_test[['race', 'sex']])

The overall values are unchanged, but the by_group table now shows the intersections between subgroups:

assert (grouped_on_race_and_sex.overall == grouped_on_race.overall).all()
grouped_on_race_and_sex.by_group

		selection_rate	fbeta_06	count
race	sex
Black	Female	0.032258	0.630316	713
Black	Male	0.103591	0.580624	724
Other	Female	0.070866	0.503704	254
Other	Male	0.223744	0.728972	438
White	Female	0.075433	0.642076	3871
White	Male	0.271235	0.692069	8653

The aggregations are still performed across all subgroups for each metric, so each continues to reduce to a single value. If we look at the group_min(), we see that we violate the mandate we specified for the fbeta_score() suggested above (for females with a race of ‘Other’ in fact):

grouped_on_race_and_sex.group_min()

Out:

selection_rate    0.032258
fbeta_06          0.503704
count                  254
dtype: object

Looking at the ratio() method, we see that the disparity is worse (specifically between white males and black females, if we check in the by_group table):

grouped_on_race_and_sex.ratio(method='between_groups')

Out:

selection_rate     0.11893
fbeta_06          0.690978
count             0.029354
dtype: object

Control Features¶

There is a further way we can slice up our data. We have (completely made up) features for the individuals’ credit scores (in three bands) and also the size of the loan requested (large or small). In our loan scenario, it is acceptable that individuals with high credit scores are selected more often than individuals with low credit scores. However, within each credit score band, we do not want a disparity between (say) black females and white males. To example these cases, we have the concept of control features.

Control features are introduced by the control_features= argument to the fairlearn.metrics.MetricFrame object:

cond_credit_score = MetricFrame(metrics=metric_fns,
                                y_true=y_test,
                                y_pred=y_pred,
                                sensitive_features=A_test[['race', 'sex']],
                                control_features=A_test['Credit Score'])

Out:

/home/circleci/.pyenv/versions/3.8.12/lib/python3.8/site-packages/sklearn/metrics/_classification.py:1592: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 due to no true nor predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))
/home/circleci/.pyenv/versions/3.8.12/lib/python3.8/site-packages/sklearn/metrics/_classification.py:1592: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 due to no true nor predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))

This has an immediate effect on the overall property. Instead of having one value for each metric, we now have a value for each unique value of the control feature:

cond_credit_score.overall

	selection_rate	fbeta_06	count
Credit Score
High	0.03617	0.664928	470
Low	0.022924	0.549994	7285
Medium	0.386924	0.695034	6898

The by_group property is similarly expanded:

cond_credit_score.by_group

			selection_rate	fbeta_06	count
Credit Score	race	sex
High	Black	Female	0.0	0.0	54
	Black	Male	0.066667	1.0	15
	Other	Female	0.0	0.0	21
	Other	Male	0.0	0.0	4
	White	Female	0.019608	0.529595	306
	White	Male	0.142857	0.759305	70
Low	Black	Female	0.00703	0.626728	569
	Black	Male	0.020513	0.563536	390
	Other	Female	0.012048	0.519084	166
	Other	Male	0.037267	0.693878	161
	White	Female	0.015084	0.525773	2917
	White	Male	0.03342	0.55025	3082
Medium	Black	Female	0.211111	0.639653	90
	Black	Male	0.206897	0.577576	319
	Other	Female	0.238806	0.5	67
	Other	Male	0.336996	0.732057	273
	White	Female	0.373457	0.680881	648
	White	Male	0.406108	0.700837	5501

The aggregates are also evaluated once for each group identified by the control feature:

cond_credit_score.group_min()

	selection_rate	fbeta_06	count
Credit Score
High	0.000000	0.000000	4
Low	0.007030	0.519084	161
Medium	0.206897	0.500000	67

And:

cond_credit_score.ratio(method='between_groups')

	selection_rate	fbeta_06	count
Credit Score
High	0.000000	0.000000	0.013072
Low	0.188635	0.748092	0.052239
Medium	0.509462	0.683007	0.012180

In our data, we see that we have a dearth of positive results for high income non-whites, which significantly affects the aggregates.

We can continue adding more control features:

cond_both = MetricFrame(metrics=metric_fns,
                        y_true=y_test,
                        y_pred=y_pred,
                        sensitive_features=A_test[['race', 'sex']],
                        control_features=A_test[['Loan Size', 'Credit Score']])

Out:

/home/circleci/.pyenv/versions/3.8.12/lib/python3.8/site-packages/sklearn/metrics/_classification.py:1592: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 due to no true nor predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))
Found 36 subgroups. Evaluation may be slow
/home/circleci/.pyenv/versions/3.8.12/lib/python3.8/site-packages/sklearn/metrics/_classification.py:1592: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 due to no true nor predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))
/home/circleci/.pyenv/versions/3.8.12/lib/python3.8/site-packages/sklearn/metrics/_classification.py:1592: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 due to no true nor predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))
/home/circleci/.pyenv/versions/3.8.12/lib/python3.8/site-packages/sklearn/metrics/_classification.py:1592: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 due to no true nor predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))
/home/circleci/.pyenv/versions/3.8.12/lib/python3.8/site-packages/sklearn/metrics/_classification.py:1592: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 due to no true nor predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))
/home/circleci/.pyenv/versions/3.8.12/lib/python3.8/site-packages/sklearn/metrics/_classification.py:1592: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 due to no true nor predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))
/home/circleci/.pyenv/versions/3.8.12/lib/python3.8/site-packages/sklearn/metrics/_classification.py:1592: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 due to no true nor predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))
/home/circleci/.pyenv/versions/3.8.12/lib/python3.8/site-packages/sklearn/metrics/_classification.py:1592: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 due to no true nor predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))
/home/circleci/.pyenv/versions/3.8.12/lib/python3.8/site-packages/sklearn/metrics/_classification.py:1592: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 due to no true nor predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))
/home/circleci/.pyenv/versions/3.8.12/lib/python3.8/site-packages/sklearn/metrics/_classification.py:1592: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 due to no true nor predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))

The overall property now splits into more values:

cond_both.overall

		selection_rate	fbeta_06	count
Loan Size	Credit Score
Large	High	0.0	0.0	23
	Low	0.004348	0.60177	460
	Medium	0.071429	0.388325	434
Small	High	0.038031	0.664928	447
	Low	0.024176	0.549299	6825
	Medium	0.408106	0.700288	6464

As does the by_groups property, where NaN values indicate that there were no samples in the cell:

cond_both.by_group

				selection_rate	fbeta_06	count
Loan Size	Credit Score	race	sex
Large	High	Black	Female	0.0	0.0	5
		Black	Male	0.0	0.0	1
		Other	Female	0.0	0.0	3
		Other	Male	NaN	NaN	NaN
		White	Female	0.0	0.0	13
		White	Male	0.0	0.0	1
	Low	Black	Female	0.0	0.0	52
		Black	Male	0.030303	1.0	33
		Other	Female	0.0	0.0	3
		Other	Male	0.0	0.0	14
		White	Female	0.0	0.0	133
		White	Male	0.004444	0.557377	225
	Medium	Black	Female	0.0	0.0	7
		Black	Male	0.026316	0.295652	38
		Other	Female	0.111111	0.0	9
		Other	Male	0.0	0.0	19
		White	Female	0.0	0.0	28
		White	Male	0.087087	0.420976	333
Small	High	Black	Female	0.0	0.0	49
		Black	Male	0.071429	1.0	14
		Other	Female	0.0	0.0	18
		Other	Male	0.0	0.0	4
		White	Female	0.020478	0.529595	293
		White	Male	0.144928	0.759305	69
	Low	Black	Female	0.007737	0.626728	517
		Black	Male	0.019608	0.518293	357
		Other	Female	0.01227	0.519084	163
		Other	Male	0.040816	0.715789	147
		White	Female	0.015805	0.527656	2784
		White	Male	0.035702	0.550162	2857
	Medium	Black	Female	0.228916	0.648094	83
		Black	Male	0.231317	0.590371	281
		Other	Female	0.258621	0.524085	58
		Other	Male	0.362205	0.740024	254
		White	Female	0.390323	0.682328	620
		White	Male	0.426664	0.705861	5168

The aggregates behave similarly. By this point, we are having significant issues with under-populated intersections. Consider:

def member_counts(y_true, y_pred):
    assert len(y_true) == len(y_pred)
    return len(y_true)


counts = MetricFrame(metrics=member_counts,
                     y_true=y_test,
                     y_pred=y_pred,
                     sensitive_features=A_test[['race', 'sex']],
                     control_features=A_test[['Loan Size', 'Credit Score']])

counts.by_group

Out:

Found 36 subgroups. Evaluation may be slow

Loan Size  Credit Score  race   sex
Large      High          Black  Female       5
                                Male         1
                         Other  Female       3
                                Male       NaN
                         White  Female      13
                                Male         1
           Low           Black  Female      52
                                Male        33
                         Other  Female       3
                                Male        14
                         White  Female     133
                                Male       225
           Medium        Black  Female       7
                                Male        38
                         Other  Female       9
                                Male        19
                         White  Female      28
                                Male       333
Small      High          Black  Female      49
                                Male        14
                         Other  Female      18
                                Male         4
                         White  Female     293
                                Male        69
           Low           Black  Female     517
                                Male       357
                         Other  Female     163
                                Male       147
                         White  Female    2784
                                Male      2857
           Medium        Black  Female      83
                                Male       281
                         Other  Female      58
                                Male       254
                         White  Female     620
                                Male      5168
Name: member_counts, dtype: object

Recall that NaN indicates that there were no individuals in a cell - member_counts() will not even have been called.

Exporting from MetricFrame¶

Sometimes, we need to extract our data for use in other tools. For this, we can use the pandas.DataFrame.to_csv() method, since the by_group() property will be a pandas.DataFrame (or in a few cases, it will be a pandas.Series, but that has a similar to_csv() method):

csv_output = cond_credit_score.by_group.to_csv()
print(csv_output)

Out:

Credit Score,race,sex,selection_rate,fbeta_06,count
High,Black,Female,0.0,0.0,54
High,Black,Male,0.06666666666666667,1.0,15
High,Other,Female,0.0,0.0,21
High,Other,Male,0.0,0.0,4
High,White,Female,0.0196078431372549,0.5295950155763239,306
High,White,Male,0.14285714285714285,0.7593052109181142,70
Low,Black,Female,0.007029876977152899,0.6267281105990783,569
Low,Black,Male,0.020512820512820513,0.56353591160221,390
Low,Other,Female,0.012048192771084338,0.5190839694656488,166
Low,Other,Male,0.037267080745341616,0.6938775510204082,161
Low,White,Female,0.015083990401097017,0.5257731958762887,2917
Low,White,Male,0.033419857235561325,0.5502497502497502,3082
Medium,Black,Female,0.2111111111111111,0.6396526772793053,90
Medium,Black,Male,0.20689655172413793,0.5775764439411097,319
Medium,Other,Female,0.23880597014925373,0.5,67
Medium,Other,Male,0.336996336996337,0.7320574162679426,273
Medium,White,Female,0.3734567901234568,0.6808811402992107,648
Medium,White,Male,0.40610798036720597,0.700837357443748,5501

The pandas.DataFrame.to_csv() method has a large number of arguments to control the exported CSV. For example, it can write directly to a CSV file, rather than returning a string (as shown above).

The overall() property can be handled similarly, in the cases that it is not a scalar.

Total running time of the script: ( 0 minutes 10.981 seconds)

Gallery generated by Sphinx-Gallery

Versions

Metrics with Multiple Features¶

Getting the Data¶

Analysing the Model with Metrics¶

Sample weights and other arrays¶

Quantifying Disparities¶

Intersections of Features¶

Control Features¶

Exporting from MetricFrame¶