Adversarial Mitigation#

Fairlearn provides an implementation of the adversarial mitigation method of Zhang et al.[1]. The input to the method consists of features \(X,\) labels \(Y,\) and sensitive features \(A\). The goal is to fit an estimator that predicts \(Y\) from \(X\) while enforcing fairness constraints with respect to \(A\). Both classification and regression are supported (classes AdversarialFairnessClassifier and AdversarialFairnessRegressor) with two types of fairness constraints: demographic parity and equalized odds.

To train an adversarial mitigation algorithm, the user needs to provide two neural networks, a predictor network and an adversary network, with learnable weights \(W\) and \(U,\) respectively. The predictor network is constructed to solve the underlying supervised learning task, without considering fairness, by minimizing the predictor loss \(L_P.\) However, to improve fairness, we do not only minimize the predictor loss, but we also want to decrease the adversary’s ability to predict the sensitive features from the predictor’s predictions (when implementing demographic parity), or jointly from the predictor’s predictions and true labels (when implementing equalized odds).

Suppose the adversary has the loss term \(L_A.\) The algorithm updates adversary weights \(U\) by descending along the gradient \(\nabla_U L_A\). However, when updating the predictor weights \(W\), the algorithm uses

\[\nabla_W L_P - \text{proj}_{\nabla_W L_A} \nabla_W L_P - \alpha \nabla_W L_A.\]

instead of just gradient. Compared with standard stochastic gradient descent, there are two additional terms that seek to prevent the decrease of the adversary loss. The hyperparameter \(\alpha\) specifies the strength of enforcing the fairness constraint. For details, see Zhang et al.[1].

In Models, we discuss the models that this implementation accepts. In Data types and loss functions, we discuss the input format of \(X,\) how \(Y\) and \(A\) are preprocessed, and how the loss functions \(L_P\) and \(L_A\) are chosen. Finally, in Training we give some useful tips to keep in mind when training this model, as adversarial methods such as these can be difficult to train.

Models#

One can implement the predictor and adversarial neural networks as a torch.nn.Module (using PyTorch) or as a keras.Model (using TensorFlow). This implementation has a soft dependency on either PyTorch or TensorFlow, and the user needs to have installed either one of the two soft dependencies. It is not possible to mix these dependencies, so a PyTorch predictor with a TensorFlow loss function is not possible.

It is very important to define the neural network models with no activation function or discrete prediction function on the final layer. So, for instance, when predicting a categorical feature that is one-hot-encoded, the neural network should output a vector of real-valued scores, not the one-hot-encoded discrete prediction:

predictor_model = tf.keras.Sequential([
    tf.keras.layers.Dense(50, activation='relu'),
    tf.keras.layers.Dense(1)
])
adversary_model = tf.keras.Sequential([
    tf.keras.layers.Dense(3, activation='relu'),
    tf.keras.layers.Dense(1)
])
mitigator = AdversarialFairnessClassifier(
    predictor_model=predictor_model,
    adversary_model=adversary_model
)

For simple or exploratory use cases, Fairlearn provides a very basic neural network builder. Instead of a neural network model, it is possible to pass a list \([k_1, k_2, \dots]\), where each \(k_i\) either indicates the number of nodes (if \(k_i\) is an integer) or an activation function (if \(k_i\) is a string) or a layer or activation function instance directly (if \(k_i\) is a callable). However, the number of nodes in the input and output layer is automatically inferred from data, and the final activation function (such as softmax for categorical predictors) is also inferred from data. So, in the following example, the predictor model is a neural network with an input layer of the appropriate number of nodes, a hidden layer with 50 nodes and ReLU activations, and an output layer with an appropriate activation function. The appropriate function in case of classification will be softmax for one hot encoded \(Y\) and sigmoid for binary \(Y\):

mitigator = AdversarialFairnessClassifier(
    predictor_model=[50, "relu"],
    adversary_model=[3, "relu"]
)

Data types and loss functions#

We require the provided data \(X\) to be provided as a matrix (2d array-like) of floats; this data is directly passed to neural network models.

Labels \(Y\) and sensitive features \(A\) are automatically preprocessed based on their type: binary data is represented as 0/1, categorical data is one-hot encoded, float data is left unchanged.

Zhang et al.[1] do not explicitly define loss functions. In AdversarialFairnessClassifier and AdversarialFairnessRegressor, the loss functions are automatically inferred based on the data type of the label and sensitive features. For binary and categorical target variables, the training loss is cross-entropy. For float targets variables, the training loss is the mean squared error.

To summarize:

label \(Y\)	derived label \(Y'\)	network output \(Z\)	probabilistic prediction	loss function	prediction
binary	0/1	\(\mathbb{R}\)	\(\mathbb{P}(Y'=1)\) \(\;\;=1/(1+e^{-Z})\)	\(-Y'\log\mathbb{P}(Y'=1)\) \(\;\;-(1-Y')\log\mathbb{P}(Y'=0)\)	1 if \(Z\ge 0\), else 0
categorical (\(k\) values)	one-hot encoding	\(\mathbb{R}^k\)	\(\mathbb{P}(Y'=\mathbf{e}_j)\) \(\;\;=e^{Z_j}/\sum_{\ell=1}^k e^{Z_{\ell}}\)	\(-\sum_{j=1}^k Y'_j\log\mathbb{P}(Y'=\mathbf{e}_j)\)	\(\text{argmax}_j\,Z_j\)
continuous (in \(\mathbb{R}^k\))	unchanged	\(\mathbb{R}^k\)	not available	\(\Vert Z-Y\Vert^2\)	\(Z\)

The label is treated as binary if it takes on two distinct int or str values, as categorical if it takes on \(k\) distinct int or str values (with \(k>2\)), and as continuous if it is a float or a vector of floats. Sensitive features are treated similarly.

Note: currently, all data needs to be passed to the model in the first call to fit.

Training#

Adversarial learning is inherently difficult because of various issues, such as mode collapse, divergence, and diminishing gradients. Mode collapse is the scenario where the predictor learns to produce one output, and because it does this relatively well, it will never learn any other output. Diminishing gradients are common as well, and could be due to an adversary that is trained too well in comparison to the predictor. Such problems have been studied extensively by others, so we encourage the user to find remedies elsewhere from more extensive sources. As a general rule of thumb, training adversarially is best done with a lower and possibly decaying learning rate while ensuring the losses remain balanced, and keeping track of validation accuracies every few iterations may save you a lot of headaches if the model suddenly diverges or collapses.

Some pieces of advice regarding training with adversarial fairness:

For some tabular datasets, we found that single hidden layer neural networks are easier to train than deeper networks.
Validate your model! Provide this model with a callback function in the constructor’s keyword callbacks (see AdversarialFairness Fine Tuning). Optionally, have this function return True to indicate early stopping.
Zhang et al.[1] have found it to be useful to maintain a global step count and gradually increase \(\alpha\) while decreasing the learning rate \(\eta\) and taking \(\alpha \eta \rightarrow 0\) as the global step count increases. In particular, use a callback function to perform these hyperparameter updates. An example can be seen in the example notebook.

Refer to the following examples for more details: