Note

Go to the end to download the full example code. or to run this example in your browser via JupyterLite

Credit Loan Decisions#

Package Imports#

import warnings

import lightgbm as lgb
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.metrics import balanced_accuracy_score, confusion_matrix, roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

from fairlearn.metrics import (
    MetricFrame,
    count,
    equalized_odds_difference,
    false_negative_rate,
    false_positive_rate,
    selection_rate,
)
from fairlearn.postprocessing import ThresholdOptimizer
from fairlearn.reductions import EqualizedOdds, ExponentiatedGradient

warnings.simplefilter("ignore")

rand_seed = 1234
np.random.seed(rand_seed)

Fairness considerations of credit loan decisions#

Fairness and credit lending in the US#

In 2019, Apple received backlash on social media after its newly launched Apple Card product appeared to offer higher credit limits to men compared to women [1]. In multiple cases, married couples found the husband received a credit limit that was 10-20x higher than the wife’s even when the couple had joint assets.

From a regulatory perspective, financial institutions that operate within the United States are subject to legal regulations prohibiting discrimination on the basis of race, gender, or other protected classes [2]. With the increasing prevalence of automated decision-systems in the financial lending space, experts have raised concerns about whether these systems could exacerabate existing inequalities in financial lending.

Although the two concepts are intertwined, algorithmic fairness is not the same concept as anti-discrimination law. An AI system can comply with anti-discrimination law while exhibiting fairness-related concerns. On the other hand, some fairness interventions may be illegal under anti-discrimination law. Xiang and Raji[3] discuss the compatibilities and disconnects between anti-discrimination law and algorithmic notions of fairness. In this case study, we focus on fairness in financial services rather than compliance with financial anti-discrimination regulations.

Ernst & Young (EY) case study#

In this case study, we aim to replicate the work done in a white paper [4], co-authored by Microsoft and EY, on mitigating gender-related performance disparities in financial lending decisions. In their analysis, Microsoft and EY demonstrated how Fairlearn could be used to measure and mitigate unfairness in the loan adjudication process.

Using a dataset of credit loan outcomes (whether an individual defaulted on a credit loan), we train a fairness-unaware model to predict the likelihood an individual will default on a given loan. We use the Fairlearn toolkit for assessing the fairness of our model, according to several metrics. Finally, we perform two unfairness mitigation strategies on our model and compare the results to our original model.

Because the dataset used in the white paper is not publicly available, we will introduce a semi-synthetic feature into an existing publicly available dataset to replicate the outcome disparity found in the original dataset.

Credit decisions dataset#

As mentioned, we will not be able to use the original loans dataset, and instead will be working with a publicly available dataset of credit card defaults in Taiwan collected in 2005. This dataset represents binary credit card default outcomes for 30,000 applicants with information pertaining to an applicant’s payment history and bill statements over a six-month period from April 2005 to September 2005, as well as demographic information, such as sex, age, marital status, and education level of the applicant. A full summary of features is provided below:

features	description
sex, education, marriage, age	demographic features
pay_0, pay_2, pay_3, pay_4, pay_5, pay_6	repayment status (ordinal)
bill_amt1, bill_amt2, bill_amt3, bill_amt4, bill_amt5, bill_amt_6	bill statement amount (Taiwan dollars)
pay_amt1, pay_amt2, pay_amt3, pay_amt4, pay_amt5, pay_amt6	previous statement amount (Taiwan dollars)
default payment next month	default information (1 = YES, 0 = NO)

Let’s pretend we are a data scientist at a financial institution who is tasked with developing a classification model to predict whether an applicant will default on a personal loan. A positive prediction by the model means the applicant would default on the credit loan. Defaulting on a loan means the client fails to make payments within a 30-day window, and the lender can take legal actions against the client.

Although we do not have a dataset of loan default history, we do have this data set of credit card payment history. We assume customers who make monthly credit card payments on time are more creditworthy, and thus less likely to default on a personal credit loan.

Decision point: task definition

Defaulting on a credit card payment can be viewed as a proxy for the fact that an applicant might not be a good candidate for a personal loan.
Because most customers did not default on their credit card payment, we will need to take this class imbalance into account during our modeling process.

As the data is read in-memory, we will change the column PAY_0 to PAY_1 to make the naming more consistent with the naming of the other columns. In addition, the target variable default payment next month is changed to default to reduce verbosity.

data_url = "http://archive.ics.uci.edu/ml/machine-learning-databases/00350/default%20of%20credit%20card%20clients.xls"
dataset = (
    pd.read_excel(io=data_url, header=1)
    .drop(columns=["ID"])
    .rename(columns={"PAY_0": "PAY_1", "default payment next month": "default"})
)

dataset.shape

(30000, 24)

dataset.head()

	LIMIT_BAL	SEX	EDUCATION	MARRIAGE	AGE	PAY_1	PAY_2	PAY_3	PAY_4	PAY_5	PAY_6	BILL_AMT1	BILL_AMT2	BILL_AMT3	BILL_AMT4	BILL_AMT5	BILL_AMT6	PAY_AMT1	PAY_AMT2	PAY_AMT3	PAY_AMT4	PAY_AMT5	PAY_AMT6	default
0	20000	2	2	1	24	2	2	-1	-1	-2	-2	3913	3102	689	0	0	0	0	689	0	0	0	0	1
1	120000	2	2	2	26	-1	2	0	0	0	2	2682	1725	2682	3272	3455	3261	0	1000	1000	1000	0	2000	1
2	90000	2	2	2	34	0	0	0	0	0	0	29239	14027	13559	14331	14948	15549	1518	1500	1000	1000	1000	5000	0
3	50000	2	2	1	37	0	0	0	0	0	0	46990	48233	49291	28314	28959	29547	2000	2019	1200	1100	1069	1000	0
4	50000	1	2	1	57	-1	0	-1	0	0	0	8617	5670	35835	20940	19146	19131	2000	36681	10000	9000	689	679	0

From the dataset description [5], we see there are three categorical features:

SEX: Sex of the applicant (as a binary feature)
EDUCATION: Highest level of education achieved by the applicant.
MARRIAGE: Marital status of the applicant.

categorical_features = ["SEX", "EDUCATION", "MARRIAGE"]

for col_name in categorical_features:
    dataset[col_name] = dataset[col_name].astype("category")

Y, A = dataset.loc[:, "default"], dataset.loc[:, "SEX"]
X = pd.get_dummies(dataset.drop(columns=["default", "SEX"]))

A_str = A.map({1: "male", 2: "female"})

Dataset imbalances#

Before we start training a classifier model, we want to explore the dataset for any characteristics that may lead to fairness-related harms later on in the modeling process. In particular, we will focus on the distribution of sensitive feature SEX and the target label default.

As part of an exploratory data analysis, let’s explore the distribution of our sensitive feature SEX. We see that 60% of loan applicants were labeled as female and 40% as male, so we do not need to worry about imbalance in this feature.

A_str.value_counts(normalize=True)

SEX
female    0.603733
male      0.396267
Name: proportion, dtype: float64

Next, let’s explore the distribution of the loan default rate Y. We see that around 78% of individuals in the dataset do not default on their credit loan. While the target label does not display extreme imbalance, we will need to account for this imbalance in our modeling section. As opposed to the sensitive feature SEX, an imbalance in the target label may result in a classifier that over-optimizes for the majority class. For example, a classifier that predicts an applicant will not default would achieve an accuracy of 78%, so we will use the balanced_accuracy score as our evaluation metric to counteract the label imbalance.

Y.value_counts(normalize=True)

default
0    0.7788
1    0.2212
Name: proportion, dtype: float64

Check if this will lead to disparity in naive model#

Now that we have created our synthetic feature, let’s check how this new feature interacts with our sensitive_feature Sex and our target label default. We see that for both sexes, the Interest feature is higher for individuals who defaulted on their loan.

fig, (ax_1, ax_2) = plt.subplots(ncols=2, figsize=(10, 4), sharex=True, sharey=True)
X["Interest"][(A == 1) & (Y == 0)].plot(
    kind="kde", label="Payment on Time", ax=ax_1, title="INTEREST for Men"
)
X["Interest"][(A == 1) & (Y == 1)].plot(kind="kde", label="Payment Default", ax=ax_1)
X["Interest"][(A == 2) & (Y == 0)].plot(
    kind="kde",
    label="Payment on Time",
    ax=ax_2,
    legend=True,
    title="INTEREST for Women",
)
X["Interest"][(A == 2) & (Y == 1)].plot(
    kind="kde", label="Payment Default", ax=ax_2, legend=True
).legend(bbox_to_anchor=(1.6, 1))

<matplotlib.legend.Legend object at 0x7fb5d757dee0>

Training an initial model#

In this section, we will train a fairness-unaware model on the training data. However because of the imbalances in the dataset, we will first resample the training data to produce a new balanced training dataset.

def resample_training_data(X_train, Y_train, A_train):
    """Down-sample the majority class in the training dataset to produce a
    balanced dataset with a 50/50 split in the predictive labels.

    Parameters:
    X_train: The training split of the features
    Y_train: The training split of the target labels
    A_train: The training split of the sensitive features

    Returns:
    Tuple of X_train, Y_train, A_train where each dataset has been re-balanced.
    """
    negative_ids = Y_train[Y_train == 0].index
    positive_ids = Y_train[Y_train == 1].index
    balanced_ids = positive_ids.union(np.random.choice(a=negative_ids, size=len(positive_ids)))

    X_train = X_train.loc[balanced_ids, :]
    Y_train = Y_train.loc[balanced_ids]
    A_train = A_train.loc[balanced_ids]
    return X_train, Y_train, A_train

X_train, X_test, y_train, y_test, A_train, A_test = train_test_split(
    X, Y, A_str, test_size=0.35, stratify=Y
)

X_train, y_train, A_train = resample_training_data(X_train, y_train, A_train)

At this stage, we will train a gradient-boosted tree classifier using the lightgbm package on the balanced training dataset. When we evaluate the model, we will use the unbalanced testing dataset.

lgb_params = {
    "objective": "binary",
    "metric": "auc",
    "learning_rate": 0.03,
    "num_leaves": 10,
    "max_depth": 3,
    "random_state": rand_seed,
    "n_jobs": 1,
    "verbose": -1,
}

estimator = Pipeline(
    steps=[
        ("preprocessing", StandardScaler()),
        ("classifier", lgb.LGBMClassifier(**lgb_params)),
    ]
)

estimator.fit(X_train, y_train)

Pipeline(steps=[('preprocessing', StandardScaler()),
                ('classifier',
                 LGBMClassifier(learning_rate=0.03, max_depth=3, metric='auc',
                                n_jobs=1, num_leaves=10, objective='binary',
                                random_state=1234, verbose=-1))])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

We compute the binary predictions and the prediction probabilities for the testing data points.

Y_pred_proba = estimator.predict_proba(X_test)[:, 1]
Y_pred = estimator.predict(X_test)

From the ROC Score, we see the model appears to be differentiating between true positives and false positives well. This is to be expected given the INTEREST feature provides a strong discriminant feature for the classification task.

roc_auc_score(y_test, Y_pred_proba)

0.8782234705862874

Feature Importance of the Unmitigated Classifier#

As a model validation check, let’s explore the feature importances of our classifier. As expected, our synthetic feature INTEREST has the highest feature importance because it is highly correlated with the target variable, by construction.

lgb.plot_importance(
    estimator.named_steps["classifier"],
    height=0.6,
    title="Feature Importance",
    importance_type="gain",
    max_num_features=15,
)

<Axes: title={'center': 'Feature Importance'}, xlabel='Feature importance', ylabel='Features'>

Fairness assessment of unmitigated model#

Now that we have trained our initial fairness-unaware model, let’s perform our fairness assessment for this model. When conducting a fairness assessment, there are three main steps we want to perform:

Identify who will be harmed.
Identify the types of harms we anticipate.
Define fairness metrics based on the anticipated harms.

Who will be harmed?#

Based on the incident with Apple credit card mentioned at the beginning of this notebook, we believe the model may incorrectly predict women will default on the credit loan. The system may unfairly allocate less loans to women and over-allocate loans to men.

Types of harm experienced#

When discussing fairness in AI systems, the first step is understanding what types of harms we anticipate the system may produce. Using the harms taxonomy in the Fairlearn User Guide, we expect this system to produce harms of allocation. In addition, we also anticipate the long-term impact on an individual’s credit score if an individual is unable to repay a loan they receive or if they are rejected for a loan application. A harm of allocation occurs when an AI system extends or withholds resources, opportunities, information. In this scenario, the AI system is extending or withholding financial assets from individuals. A review of historical incidents shows these types of automated lending decision systems may discriminate unfairly based on sex.

Negative impact of credit score

A secondary harm that is somewhat unique to credit lending decisions is the long-term impact on an individual’s credit score. In the United States, a FICO credit score is a number between 300 and 850 that represents a customer’s creditworthiness. An applicant’s credit score is used by many financial institutions for lending decisions. An applicant’s credit score usually increases after a successful repayment of a loan and decreases if the applicant fails to repay the loan.

When applying for a credit loan, there are three major outcomes:

The individual receives the credit loan and pays back the loan. In this scenario, we expect the individual’s credit score to increase as a result of the successful repayment of the loan.
The individual receives the credit loan but defaults on the loan. In this scenario, the individual’s credit score will drop drastically due to the failure to repay the loan. In the modeling process, this outcome is tied to a false negative (the model predicts the individual will repay the loan, but the individual is unsuccessful in doing so).
In certain countries, such as the United States, an individual receives a small drop (up to five points) to their credit score after a lender performs a hard inquiry on the applicant’s credit history. If the applicant applies for a loan but does not receive it, the small decrease in their credit score will impact their ability to successfully apply for a future loan. In the modeling process, this outcome is tied to the selection rate (the proportion of positive predictions outputted by the model).

Prevention of wealth accumulation

One other type of harm we anticipate in this scenario is the long-term effects of denying loans to applicants who would have successfully paid back the loan. By receiving a loan, an applicant is able to purchase a home, start a business, or pursue some other economic activity that they are not able to do otherwise. These outcomes are tied to false positive error rates in which the model predicts an applicant will default on the loan, but the individual would have successfully paid the loan back. In the United States, the practice of redlining [6], denying mortgage loans and other financial services to predominantly Black or other minority communities, has resulted in a vast racial wealth gap between white and Black Americans. Although the practice of redlining was banned in 1968 with the Fair Housing Act, the long-term impact of these practices [7] is reflected in the lack of economic investment in Black communities, and Black applicants are denied loans at a higher rate compared to white Americans.

Define fairness metrics based on harms#

Now that we have identified the relevant harms we anticipate users will experience, we can define our fairness metrics. In addition to the metrics, we will quantify the uncertainty around each metric using custom functions to compute the standard error for each metric at the \(\alpha=0.95\) confidence level.

def compute_error_metric(metric_value, sample_size):
    """Compute standard error of a given metric based on the assumption of
    normal distribution.

    Parameters:
    metric_value: Value of the metric
    sample_size: Number of data points associated with the metric

    Returns:
    The standard error of the metric
    """
    metric_value = metric_value / sample_size
    return 1.96 * np.sqrt(metric_value * (1.0 - metric_value)) / np.sqrt(sample_size)


def false_positive_error(y_true, y_pred):
    """Compute the standard error for the false positive rate estimate."""
    tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
    return compute_error_metric(fp, tn + fp)


def false_negative_error(y_true, y_pred):
    """Compute the standard error for the false negative rate estimate."""
    tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
    return compute_error_metric(fn, fn + tp)


def balanced_accuracy_error(y_true, y_pred):
    """Compute the standard error for the balanced accuracy estimate."""
    fpr_error, fnr_error = false_positive_error(y_true, y_pred), false_negative_error(
        y_true, y_pred
    )
    return np.sqrt(fnr_error**2 + fpr_error**2) / 2


fairness_metrics = {
    "count": count,
    "balanced_accuracy": balanced_accuracy_score,
    "balanced_acc_error": balanced_accuracy_error,
    "selection_rate": selection_rate,
    "false_positive_rate": false_positive_rate,
    "false_positive_error": false_positive_error,
    "false_negative_rate": false_negative_rate,
    "false_negative_error": false_negative_error,
}

Select a subset of metrics to report to avoid information overload

metrics_to_report = [
    "balanced_accuracy",
    "false_positive_rate",
    "false_negative_rate",
]

To compute the disaggregated performance metrics, we will use the MetricFrame object within the Fairlearn library. We will pass in our dictionary of metrics fairness_metrics, along with our test labels y_test and test predictions Y_pred. In addition, we pass in the sensitive_features A_test to disaggregate our model results.

Instantiate the MetricFrame for the unmitigated model

metricframe_unmitigated = MetricFrame(
    metrics=fairness_metrics,
    y_true=y_test,
    y_pred=Y_pred,
    sensitive_features=A_test,
)

metricframe_unmitigated.by_group[metrics_to_report]

metricframe_unmitigated.difference()[metrics_to_report]

metricframe_unmitigated.overall[metrics_to_report]

balanced_accuracy      0.801682
false_positive_rate    0.207656
false_negative_rate    0.188980
dtype: float64

def plot_group_metrics_with_error_bars(metricframe, metric, error_name):
    """Plot the disaggregated metric for each group with an associated
    error bar. Both metric and the error bar are provided as columns in the
    provided MetricFrame.

    Parameters
    ----------
    metricframe : MetricFrame
        The MetricFrame containing the metrics and their associated
        uncertainty quantification.
    metric : str
        The metric to plot
    error_name : str
        The associated standard error for each metric in metric

    Returns
    -------
    Matplotlib Plot of point estimates with error bars
    """
    grouped_metrics = metricframe.by_group
    point_estimates = grouped_metrics[metric]
    error_bars = grouped_metrics[error_name]
    lower_bounds = point_estimates - error_bars
    upper_bounds = point_estimates + error_bars

    x_axis_names = [str(name) for name in error_bars.index.to_flat_index().tolist()]
    plt.vlines(
        x_axis_names,
        lower_bounds,
        upper_bounds,
        linestyles="dashed",
        alpha=0.45,
    )
    plt.scatter(x_axis_names, point_estimates, s=25)
    plt.xticks(rotation=0)
    y_start, y_end = np.round(min(lower_bounds), decimals=2), np.round(
        max(upper_bounds), decimals=2
    )
    plt.yticks(np.arange(y_start, y_end, 0.05))
    plt.ylabel(metric)

plot_group_metrics_with_error_bars(
    metricframe_unmitigated, "false_positive_rate", "false_positive_error"
)

plot_group_metrics_with_error_bars(
    metricframe_unmitigated, "false_negative_rate", "false_negative_error"
)

metricframe_unmitigated.by_group[metrics_to_report].plot.bar(
    subplots=True, layout=[1, 3], figsize=[12, 4], legend=None, rot=0
)

balanced_accuracy, false_positive_rate, false_negative_rate

array([[<Axes: title={'center': 'balanced_accuracy'}, xlabel='SEX'>,
        <Axes: title={'center': 'false_positive_rate'}, xlabel='SEX'>,
        <Axes: title={'center': 'false_negative_rate'}, xlabel='SEX'>]],
      dtype=object)

Finally, let’s compute the equalized_odds_difference for this unmitigated model. The equalized_odds_difference is the maximum of the false_positive_rate_difference and false_negative_rate_difference. In our lending context, both false_negative_rate_disparities and false_positive_rate_disparities result in fairness-related harms. Therefore, we attempt to minimize both of these metrics by minimizing the equalized_odds_difference.

balanced_accuracy_unmitigated = balanced_accuracy_score(y_test, Y_pred)
equalized_odds_unmitigated = equalized_odds_difference(y_test, Y_pred, sensitive_features=A_test)

One key assumption here is we assume that false positives and false negatives have the equally adverse costs to each group. In practice, we would develop some weighting mechanism to assign a weight to each false negative and false positive event.

Mitigating Unfairness in ML models#

In the previous section, we identified disparities in the model’s performance with respect to SEX. In particular, we found that model produces a significantly higher false_negative_rate and false_positive_rate for the applicants labeled female compared to those labeled male. In the context of credit decision scenario, this means the model under-allocates loans to women who would have paid the loan, but over-allocates loans to women who go on to default on their loan.

In this section, we will discuss strategies for mitigating the performance disparities we found in our unmitigated model. We will apply two different mitigation strategies:

Postprocessing: In the postprocessing approach, the outputs of a trained classifier are transformed to satisfy some fairness criterion.
Reductions: In the reductions approach, we take in a model class and iteratively create a sequence of models that optimize some fairness constraint. Compared to the postprocessing approach, the fairness constraint is satisfied during the model training time rather than afterwards.

Postprocessing mitigations: ThresholdOptimizer#

In the Fairlearn package, postprocessing mitigation is offered through the ThresholdOptimizer algorithm, following Hardt, Price, and Srebro[8]. The ThresholdOptimizer takes in an existing (possibly pre-fit) machine learning model whose predictions acts as a scoring function to identify separate thresholds for each sensitive feature group. The ThresholdOptimizer optimizes a specified objective metric (in our case, balanced_accuracy) subject to some fairness constraint (equalized_odds), resulting in a thresholded version of the underlying machine learning model.

To instantiate our ThresholdOptimizer, we need to specify our fairness constraint as a model parameter. Because both false_negative_rate disparities and false_positive_rate disparities translate into real-world harms in our scenario, we will aim to minimize the equalized_odds difference as our fairness constraint.

postprocess_est = ThresholdOptimizer(
    estimator=estimator,
    constraints="equalized_odds",  # Optimize FPR and FNR simultaneously
    objective="balanced_accuracy_score",
    prefit=True,
    predict_method="predict_proba",
)

One key limitation of the ThresholdOptimizer is the need for sensitive features during training and prediction time. If we do not have access to the sensitive_features during prediction time, we cannot use the ThresholdOptimizer.

We pass in A_train to the fit function with the sensitive_features parameter.

postprocess_est.fit(X=X_train, y=y_train, sensitive_features=A_train)

postprocess_pred = postprocess_est.predict(X_test, sensitive_features=A_test)

postprocess_pred_proba = postprocess_est._pmf_predict(X_test, sensitive_features=A_test)

Fairness assessment of postprocessing model#

def compare_metricframe_results(mframe_1, mframe_2, metrics, names):
    """Concatenate the results of two MetricFrames along a subset of metrics.

    Parameters
    ----------
    mframe_1: First MetricFrame for comparison
    mframe_2: Second MetricFrame for comparison
    metrics: The subset of metrics for comparison
    names: The names of the selected metrics

    Returns
    -------
    MetricFrame : MetricFrame
        The concatenation of the two MetricFrames, restricted to the metrics
        specified.

    """
    return pd.concat(
        [mframe_1.by_group[metrics], mframe_2.by_group[metrics]],
        keys=names,
        axis=1,
    )

bal_acc_postprocess = balanced_accuracy_score(y_test, postprocess_pred)
eq_odds_postprocess = equalized_odds_difference(
    y_test, postprocess_pred, sensitive_features=A_test
)

metricframe_postprocess = MetricFrame(
    metrics=fairness_metrics,
    y_true=y_test,
    y_pred=postprocess_pred,
    sensitive_features=A_test,
)

metricframe_postprocess.overall[metrics_to_report]

metricframe_postprocess.difference()[metrics_to_report]

balanced_accuracy      0.019392
false_positive_rate    0.002971
false_negative_rate    0.035814
dtype: float64

Now, let’s compare the performance of our thresholded classifier with the original unmitigated model.

compare_metricframe_results(
    metricframe_unmitigated,
    metricframe_postprocess,
    metrics=metrics_to_report,
    names=["Unmitigated", "PostProcess"],
)

metricframe_postprocess.by_group[metrics_to_report].plot.bar(
    subplots=True, layout=[1, 3], figsize=[12, 4], legend=None, rot=0
)

array([[<Axes: title={'center': 'balanced_accuracy'}, xlabel='SEX'>,
        <Axes: title={'center': 'false_positive_rate'}, xlabel='SEX'>,
        <Axes: title={'center': 'false_negative_rate'}, xlabel='SEX'>]],
      dtype=object)

We see that the ThresholdOptimizer algorithm achieves a much lower disparity between the two groups compared to the unmitigated model. However, this does come with the trade-off that the ThresholdOptimizer achieves a lower balanced_accuracy score for male applicants.

Reductions approach to unfairness mitigation#

In the previous section, we took a fairness-unaware model and used the ThresholdOptimizer to transform the model’s decision boundary to satisfy our fairness constraints. One key limitation of the ThresholdOptimizer is needing access to our sensitive_feature during prediction time.

In this section, we will use the reductions approach of Agarwal et. al (2018) [9] to produce models that satisfy the fairness constraint without needing access to the sensitive features at deployment time.

The main reduction algorithm in Fairlearn is ExponentiatedGradient. The algorithm creates a sequence of re-weighted datasets and retrains the wrapped classifier on each of the datasets. This re-training process is guaranteed to find a model that satisfies the fairness constraints while optimizing the performance metric.

The model returned by ExponentiatedGradient consists of several inner models, returned by a wrapped estimator.

To instantiate an ExponentiatedGradient model, we pass in two parameters:

a base estimator (object that supports training)
fairness constraints (object of type fairlearn.reductions.Moment)

When passing in a fairness constraint as a Moment, we can specify an epsilon value representing the maximum allowed difference or ratio between our largest and smallest value. For example, in the below code, EqualizedOdds(difference_bound=epsilon) means that we are using EqualizedOdds as our fairness constraint, and we will allow a maximal difference of epsilon between our largest and smallest equalized odds value.

def get_expgrad_models_per_epsilon(estimator, epsilon, X_train, y_train, A_train):
    """Instantiate and train an ExponentiatedGradient model on the
    balanced training dataset.

    Parameters
    ----------
    Estimator: Base estimator to contains a fit and predict function.
    Epsilon: Float representing maximum difference bound for the fairness Moment constraint

    Returns
    -------
    Predictors
        List of inner model predictors learned by the ExponentiatedGradient
        model during the training process.

    """
    exp_grad_est = ExponentiatedGradient(
        estimator=estimator,
        sample_weight_name="classifier__sample_weight",
        constraints=EqualizedOdds(difference_bound=epsilon),
    )
    # Is this an issue - Re-runs
    exp_grad_est.fit(X_train, y_train, sensitive_features=A_train)
    predictors = exp_grad_est.predictors_
    return predictors

Because the performance-fairness trade-off learned by the ExponentiatedGradient model is sensitive to our chosen epsilon value, we can treat epsilon as a hyperparameter and iterate over a range of potential values. Here, we will train two ExponentiatedGradient models, one with epsilon=0.01 and the second with epsilon=0.02, and store the inner models learned through each of the training processes.

In practice, we recommend choosing smaller values of epsilon on the order of the square root of the number of samples in the training dataset: \(\dfrac{1}{\sqrt{\text{numberSamples}}} \approx \dfrac{1}{\sqrt{25000}} \approx 0.01\)

epsilons = [0.01, 0.02]

all_models = {}
for eps in epsilons:
    all_models[eps] = get_expgrad_models_per_epsilon(
        estimator=estimator,
        epsilon=eps,
        X_train=X_train,
        y_train=y_train,
        A_train=A_train,
    )

for epsilon, models in all_models.items():
    print(f"For epsilon {epsilon}, ExponentiatedGradient learned {len(models)} inner models")

For epsilon 0.01, ExponentiatedGradient learned 17 inner models
For epsilon 0.02, ExponentiatedGradient learned 19 inner models

Here, we can see all the inner models learned for each value of epsilon. With the ExponentiatedGradient model, we specify an epsilon parameter that represents the maximal disparity in our fairness metric that our final model should satisfy. For example, an epsilon=0.02 means that the training value of the equalized odds difference of the returned model is at most 0.02 (if the algorithm converges).

Reviewing inner models of ExponentiatedGradient#

In many situations due to regulation or other technical restrictions, the randomized nature of ExponentiatedGradient algorithm may be undesirable. In addition, the multiple inner models of the algorithm introduce challenges for model interpretability. One potential workaround to avoid these issues is to select one of the inner models and deploy it instead.

In the previous section, we trained multiple ExponentiatedGradient models at different epsilon levels and collected all the inner models learned by this process. When picking a suitable inner model, we consider trade-offs between our two metrics of interest: balanced error rate and equalized odds difference. Since our focus is on these two metrics, we will filter out the models that are outperformed in both of the metrics by some other model (we refer to these as “dominated” models), and plot just the remaining “undominated” models.

def is_pareto_efficient(points):
    """Filter a NumPy Matrix to remove rows that are strictly dominated by
    another row in the matrix. Strictly dominated means the all the row values
    are greater than the values of another row.

    Parameters
    ----------
    Points: NumPy array (NxM) of model metrics.
        Assumption that smaller values for metrics are preferred.

    Returns
    -------
    Boolean Array
        Nx1 boolean mask representing the non-dominated indices.
    """
    n, m = points.shape
    is_efficient = np.ones(n, dtype=bool)
    for i, c in enumerate(points):
        if is_efficient[i]:
            is_efficient[is_efficient] = np.any(points[is_efficient] < c, axis=1)
            is_efficient[i] = True
    return is_efficient

def filter_dominated_rows(points):
    """Remove rows from a DataFrame that are monotonically dominated by
    another row in the DataFrame.

    Parameters
    ----------
    Points: DataFrame where each row represents the summarized performance
            (balanced accuracy, fairness metric) of an inner model.

    Returns
    -------
    pareto mask: Boolean mask representing indices of input DataFrame that are not monotonically dominated.
    masked_DataFrame: DataFrame with dominated rows filtered out.

    """
    pareto_mask = is_pareto_efficient(points.to_numpy())
    return pareto_mask, points.loc[pareto_mask, :]

def aggregate_predictor_performances(predictors, metric, X_test, Y_test, A_test=None):
    """Compute the specified metric for all classifiers in predictors.
    If no sensitive features are present, the metric is computed without
    disaggregation.

    Parameters
    ----------
    predictors: A set of classifiers to generate predictions from.
    metric: The metric (callable) to compute for each classifier in predictor
    X_test: The data features of the testing data set
    Y_test: The target labels of the teting data set
    A_test: The sensitive feature of the testing data set.

    Returns
    -------
    List of performance scores for each classifier in predictors, for the
    given metric.
    """
    all_predictions = [predictor.predict(X_test) for predictor in predictors]
    if A_test is not None:
        return [metric(Y_test, Y_sweep, sensitive_features=A_test) for Y_sweep in all_predictions]
    else:
        return [metric(Y_test, Y_sweep) for Y_sweep in all_predictions]

def model_performance_sweep(models_dict, X_test, y_test, A_test):
    """Compute the equalized_odds_difference and balanced_error_rate for a
    given list of inner models learned by the ExponentiatedGradient algorithm.
    Return a DataFrame containing the epsilon level of the model, the index
    of the model, the equalized_odds_difference score and the balanced_error
    for the model.

    Parameters
    ----------
    models_dict: Dictionary mapping model ids to a model.
    X_test: The data features of the testing data set
    y_test: The target labels of the testing data set
    A_test: The sensitive feature of the testing data set.

    Returns
    -------
    DataFrame where each row represents a model (epsilon, index) and its
    performance metrics
    """
    performances = []
    for eps, models in models_dict.items():
        eq_odds_difference = aggregate_predictor_performances(
            models, equalized_odds_difference, X_test, y_test, A_test
        )
        bal_acc_score = aggregate_predictor_performances(
            models, balanced_accuracy_score, X_test, y_test
        )
        for i, score in enumerate(eq_odds_difference):
            performances.append((eps, i, score, (1 - bal_acc_score[i])))
    performances_df = pd.DataFrame.from_records(
        performances,
        columns=["epsilon", "index", "equalized_odds", "balanced_error"],
    )
    return performances_df

performance_df = model_performance_sweep(all_models, X_test, y_test, A_test)

performance_subset = performance_df.loc[:, ["equalized_odds", "balanced_error"]]

mask, pareto_subset = filter_dominated_rows(performance_subset)

performance_df_masked = performance_df.loc[mask, :]

Now, let’s plot the performance trade-offs between all of our models.

for index, row in performance_df_masked.iterrows():
    bal_error, eq_odds_diff = row["balanced_error"], row["equalized_odds"]
    epsilon_, index_ = row["epsilon"], row["index"]
    plt.scatter(bal_error, eq_odds_diff, color="green", label="ExponentiatedGradient")
    plt.text(
        bal_error + 0.001,
        eq_odds_diff + 0.0001,
        f"Eps: {epsilon_}, Idx: {int(index_)}",
        fontsize=10,
    )
plt.scatter(
    1.0 - balanced_accuracy_unmitigated,
    equalized_odds_unmitigated,
    label="UnmitigatedModel",
)
plt.scatter(1.0 - bal_acc_postprocess, eq_odds_postprocess, label="PostProcess")
plt.xlabel("Weighted Error Rate")
plt.ylabel("Equalized Odds")
plt.legend(bbox_to_anchor=(1.85, 1))

<matplotlib.legend.Legend object at 0x7fb3d28d5d60>

With the above plot, we can see how the performance of the non-dominated inner models compares to the original unmitigated model. In many cases, we see that a reduction in the equalized_odds_difference is accompanied by a small increase in the weighted error rate.

Selecting a suitable inner model#

One strategy we can use to select a model is creating a threshold based on the balanced error rate of the unmitigated model. Then out of the filtered models, we select the model that minimizes the equalized_odds_difference. The process can be broken down into the three steps below:

Create threshold based on balanced_error of the unmitigated model.
Filter only models whose balanced_error are below the threshold.
Choose the model with smallest equalized_odds difference.

Within the context of fair lending in the United States, if a financial institution is found to be engaging in discriminatory behavior, they must produce documentation that demonstrates the model chosen is the least discriminatory model while satisfying profitability and other business needs. In our approach, the business need of profitability is simulated by thresholding based on the balanced_error rate of the unmitigated model, and we choose the least discriminatory model based on the smallest equalized_odds_difference value.

def filter_models_by_unmitigiated_score(
    all_models,
    models_frames,
    unmitigated_score,
    performance_metric="balanced_error",
    fairness_metric="equalized_odds",
    threshold=0.01,
):
    """Filter out models whose performance score is above the desired
    threshold. Out of the remaining model, return the models with the best
    score on the fairness metric.

    Parameters
    ----------
    all_models: Dictionary (Epsilon, Index) mapping (epilson, index number) pairs to a Model object
    models_frames: A DataFrame representing each model's performance and fairness score.
    unmitigated_score: The performance score of the unmitigated model.
    performance_metric: The model performance metric to threshold on.
    fairness_metric: The fairness metric to optimize for
    threshold: The threshold padding added to the :code:`unmitigated_score`.

    """
    # Create threshold based on balanced_error of unmitigated model and filter
    models_filtered = models_frames.query(
        f"{performance_metric} <= {unmitigated_score + threshold}"
    )
    best_row = models_filtered.sort_values(by=[fairness_metric]).iloc[0]
    # Choose the model with smallest equalized_odds difference
    epsilon, index = best_row[["epsilon", "index"]]
    return {
        "model": all_models[epsilon][index],
        "epsilon": epsilon,
        "index": index,
    }

best_model = filter_models_by_unmitigiated_score(
    all_models,
    models_frames=performance_df,
    unmitigated_score=(1.0 - balanced_accuracy_unmitigated),
    threshold=0.015,
)

print(
    f"Epsilon for best model: {best_model.get('epsilon')}, Index number: {best_model.get('index')}"
)
inprocess_model = best_model.get("model")

Epsilon for best model: 0.01, Index number: 14.0

Now we have selected our best inner model, let’s collect the model’s predictions on the test dataset and compute the relevant performance metrics.

y_pred_inprocess = inprocess_model.predict(X_test)

bal_acc_inprocess = balanced_accuracy_score(y_test, y_pred_inprocess)
eq_odds_inprocess = equalized_odds_difference(y_test, y_pred_inprocess, sensitive_features=A_test)

metricframe_inprocess = MetricFrame(
    metrics=fairness_metrics,
    y_true=y_test,
    y_pred=y_pred_inprocess,
    sensitive_features=A_test,
)

metricframe_inprocess.difference()[metrics_to_report]

metricframe_inprocess.overall[metrics_to_report]

metricframe_inprocess.by_group[metrics_to_report].plot.bar(
    subplots=True, layout=[1, 3], figsize=[12, 4], legend=None, rot=0
)

array([[<Axes: title={'center': 'balanced_accuracy'}, xlabel='SEX'>,
        <Axes: title={'center': 'false_positive_rate'}, xlabel='SEX'>,
        <Axes: title={'center': 'false_negative_rate'}, xlabel='SEX'>]],
      dtype=object)

Discuss Performance and Trade-Offs#

Now we have trained two different fairness-aware models using the postprocessing approach and the reductions approach. Let’s compare the performance of these models to our original fairness-unaware model.

metric_error_pairs = [
    ("balanced_accuracy", "balanced_acc_error"),
    ("false_positive_rate", "false_positive_error"),
    ("false_negative_rate", "false_negative_error"),
]


def create_metricframe_w_errors(mframe, metrics_to_report, metric_error_pair):
    mframe_by_group = mframe.by_group.copy()
    for metric_name, error_name in metric_error_pair:
        mframe_by_group[metric_name] = mframe_by_group[metric_name].apply(lambda x: f"{x:.3f}")
        mframe_by_group[error_name] = mframe_by_group[error_name].apply(lambda x: f"{x:.3f}")
        mframe_by_group[metric_name] = mframe_by_group[metric_name].str.cat(
            mframe_by_group[error_name], sep="±"
        )
    return mframe_by_group[metrics_to_report]

Report model performance error bars for metrics#

Unmitigated model

create_metricframe_w_errors(metricframe_unmitigated, metrics_to_report, metric_error_pairs)

metricframe_unmitigated.overall[metrics_to_report]

balanced_accuracy      0.801682
false_positive_rate    0.207656
false_negative_rate    0.188980
dtype: float64

ExponentiatedGradient model

create_metricframe_w_errors(metricframe_inprocess, metrics_to_report, metric_error_pairs)

	balanced_accuracy	false_positive_rate	false_negative_rate
SEX
female	0.763±0.013	0.227±0.012	0.248±0.023
male	0.847±0.013	0.145±0.012	0.161±0.023

ThresholdOptimizer

metricframe_inprocess.overall[metrics_to_report]

create_metricframe_w_errors(metricframe_postprocess, metrics_to_report, metric_error_pairs)

metricframe_postprocess.overall[metrics_to_report]

balanced_accuracy      0.761550
false_positive_rate    0.270270
false_negative_rate    0.206629
dtype: float64

We see both of our fairness-aware models yield a slight decrease in the balanced_accuracy for male applicants compared to our fairness-unaware model. In the reductions model, we see a decrease in the false positive rate for female applicants. This is accompanied by an increase in the false negative rate for male applicants. However overall, the equalized odds difference for the reductions models is lower than that of the original fairness-unaware model.

Conclusion and Discussion#

In this case study, we walked through the process of assessing a credit decision model for gender-related performance disparities. Our analysis follows closely the work done in the Microsoft/EY white paper [4] where they used the Fairlearn toolkit to perform an audit of a fairness-unaware tree-based model. We applied a postprocessing and reductions mitigation techniques to mitigate the equalized odds difference in our model.

Through the reductions process, we generated a model that reduces the equalized odds difference of the original model without a drastic increase in the balanced error score. If this were a real model being developed a financial institution, the balanced error score would be a proxy for the profitability of the model. By maintaining a relatively similar balanced error score, we’ve produced a model that preserves profitability to the firm while producing more fair and equitable outcomes for women in this scenario.

Designing a Model Card#

A key facet of Responsible Machine Learning is responsible documentation practices. Mitchell et al.[10] proposed the model card framework for documenting and reporting model training details and deployment considerations.

A model card contains sections for documenting training and evaluation dataset descriptions, ethical concerns, and quantitative evaluation summaries. In practice, we would ideally create a model card for our model before deploying it in production. Although we will not be producing a model card in this case study, interested readers can learn more about creating model cards using the Model Card Toolkit from the Fairlearn PyCon tutorial.

Fairness under unawareness#

When proving credit models are compliant with fair lending laws, practitoners may run into the issue of not having access to the sensitive demographic features. As a result, financial institutions are often tasked with proving their models are compliant with fair lending laws by imputing these demographics features. However, Chen et al.[11] show these imputation methods often introduce new fairness-related issues.

References#

Total running time of the script: (0 minutes 44.170 seconds)

Estimated memory usage: 185 MB

Gallery generated by Sphinx-Gallery

	steps steps: list of tuples List of (name of step, estimator) tuples that are to be chained in sequential order. To be compatible with the scikit-learn API, all steps must define `fit`. All non-last steps must also define `transform`. See :ref:`Combining Estimators ` for more details.	[('preprocessing', ...), ('classifier', ...)]
	transform_input transform_input: list of str, default=None The names of the :term:`metadata` parameters that should be transformed by the pipeline before passing it to the step consuming it. This enables transforming some input arguments to ``fit`` (other than ``X``) to be transformed by the steps of the pipeline up to the step which requires them. Requirement is defined via :ref:`metadata routing `. For instance, this can be used to pass a validation set through the pipeline. You can only set this if metadata routing is enabled, which you can enable using ``sklearn.set_config(enable_metadata_routing=True)``. .. versionadded:: 1.6	None
	memory memory: str or object with the joblib.Memory interface, default=None Used to cache the fitted transformers of the pipeline. The last step will never be cached, even if it is a transformer. By default, no caching is performed. If a string is given, it is the path to the caching directory. Enabling caching triggers a clone of the transformers before fitting. Therefore, the transformer instance given to the pipeline cannot be inspected directly. Use the attribute ``named_steps`` or ``steps`` to inspect estimators within the pipeline. Caching the transformers is advantageous when fitting is time consuming. See :ref:`sphx_glr_auto_examples_neighbors_plot_caching_nearest_neighbors.py` for an example on how to enable caching.	None
	verbose verbose: bool, default=False If True, the time elapsed while fitting each step will be printed as it is completed.	False

	copy copy: bool, default=True If False, try to avoid a copy and do inplace scaling instead. This is not guaranteed to always work inplace; e.g. if the data is not a NumPy array or scipy.sparse CSR matrix, a copy may still be returned.	True
	with_mean with_mean: bool, default=True If True, center the data before scaling. This does not work (and will raise an exception) when attempted on sparse matrices, because centering them entails building a dense matrix which in common use cases is likely to be too large to fit in memory.	True
	with_std with_std: bool, default=True If True, scale the data to unit variance (or equivalently, unit standard deviation).	True

	boosting_type	'gbdt'
	num_leaves	10
	max_depth	3
	learning_rate	0.03
	n_estimators	100
	subsample_for_bin	200000
	objective	'binary'
	class_weight	None
	min_split_gain	0.0
	min_child_weight	0.001
	min_child_samples	20
	subsample	1.0
	subsample_freq	0
	colsample_bytree	1.0
	reg_alpha	0.0
	reg_lambda	0.0
	random_state	1234
	n_jobs	1
	importance_type	'split'
	metric	'auc'
	verbose	-1

Credit Loan Decisions#

Package Imports#

Fairness considerations of credit loan decisions#

Fairness and credit lending in the US#

Ernst & Young (EY) case study#

Credit decisions dataset#

Dataset imbalances#

Add synthetic noise that is related to the outcome and sex#

Check if this will lead to disparity in naive model#

Training an initial model#

Feature Importance of the Unmitigated Classifier#

Fairness assessment of unmitigated model#

Who will be harmed?#

Types of harm experienced#

Define fairness metrics based on harms#

Mitigating Unfairness in ML models#

Postprocessing mitigations: ThresholdOptimizer#

Fairness assessment of postprocessing model#

Reductions approach to unfairness mitigation#

Reviewing inner models of ExponentiatedGradient#

Selecting a suitable inner model#

Discuss Performance and Trade-Offs#

Report model performance error bars for metrics#

Conclusion and Discussion#

Designing a Model Card#

Fairness under unawareness#

References#