# Fairness in Machine Learning#

## Fairness of AI systems#

AI systems can behave unfairly for a variety of reasons. Sometimes it is because of societal biases reflected in the training data and in the decisions made during the development and deployment of these systems. In other cases, AI systems behave unfairly not because of societal biases, but because of characteristics of the data (e.g., too few data points about some group of people) or characteristics of the systems themselves. It can be hard to distinguish between these reasons, especially since they are not mutually exclusive and often exacerbate one another. Therefore, we define whether an AI system is behaving unfairly in terms of its impact on people — i.e., in terms of harms — and not in terms of specific causes, such as societal biases, or in terms of intent, such as prejudice.

**Usage of the word bias.** Since we define fairness in terms of harms
rather than specific causes (such as societal biases), we avoid the usage of
the words *bias* or *debiasing* in describing the functionality of Fairlearn.

## Types of harms#

There are many types of harms (see, e.g., the keynote by K. Crawford at NeurIPS 2017). Some of these are:

*Allocation harms*can occur when AI systems extend or withhold opportunities, resources, or information. Some of the key applications are in hiring, school admissions, and lending.*Quality-of-service harms*can occur when a system does not work as well for one person as it does for another, even if no opportunities, resources, or information are extended or withheld. Examples include varying accuracy in face recognition, document search, or product recommendation.*Stereotyping harms*can occur when a system suggests completions which perpetuate stereotypes. These are often seen when search engines propose completions to partially typed queries. See Umoja Noble[1] for an in-depth look at this issue. Note that even stereotypes which are nominally positive are also problematic, since they still create expectations based on outward characteristics, rather than treating people as individuals.*Erasure harms*can occur when a system behaves as if groups (or their works) do not exist. For example, a text generator prompted about “Female scientists of the 1800s” might not produce a result. When asked about historical sites near St. Louis, Missouri, a search engine might fail to mention Cahokia. A similar query about southern Africa might overlook Great Zimbabwe, instead concentrating on colonial era sites. More subtly, a short biography of Alan Turing might not mention his sexuality.

This list is not exhaustive, and it is important to remember that harms are not mutually exclusive. A system can harm multiple groups of people in different ways, and also visit multiple harms on a single group of people. The Fairlearn package is most applicable to allocation and quality of service harms, since these are easiest to measure.

## Concept glossary#

The concepts outlined in this glossary are relevant to sociotechnical contexts.

### Construct validity#

In many cases, fairness-related harms can be traced back to the way a real-world problem is translated into a machine learning task. Which target variable do we intend to predict? What features will be included? What (fairness) constraints do we consider? Many of these decisions boil down to what social scientists refer to as measurement: the way we measure (abstract) phenomena.

The concepts outlined in this glossary give an introduction into the language of measurement modeling - as described in Jacobs and Wallach[2]. This framework can be a useful tool to test the validity of (implicit) assumptions of a problem formulation. In this way, it can help to mitigate fairness-related harms that can arise from mismatches between the formulation and the real-world context of an application.

#### Key Terms#

**Sociotechnical context**– The context surrounding a technical system, including both social aspects (e.g., people, institutions, communities) and technical aspects (e.g., algorithms, technical processes). The sociotechnical context of a system shapes who might benefit or is harmed by AI systems.**Unobservable theoretical construct**– An idea or concept that is unobservable and cannot be directly measured but must instead be inferred through observable measurements defined in a measurement model.**Measurement model**– The method and approach used to measure the unobservable theoretical construct.**Construct reliability**– This can be thought of as the extent to which the measurements of an unobservable theoretical construct remain the same when measured at different points in time. A lack of construct reliability can either be due to a misalignment between the understanding of the unobservable theoretical construct and the methods being used to measure that construct, or to changes to the construct itself. Construct validity and construct reliability are complementary.**Construct validity**– This can be thought of as the extent to which the measurement model measures the intended construct in way that is meaningful and useful.

#### Key Term Examples - Unobservable theoretical constructs and Measurement models#

**Fairness**is an example of an unobservable theoretical construct. Several measurement models exist for measuring fairness, including demographic parity. These measurements may come together to form a measurement model, where several measurements are combined to ultimately measure fairness.See`fairlearn.metrics`

for more examples of measurement models for measuring fairness.**Teacher effectiveness**is an example of an unobservable theoretical construct. Common measurement models include student performance on standardized exams and qualitative feedback for the teacher’s students.**Socioeconomic status**is an example of an unobservable theoretical construct. A common measurement model includes annual household income.**Patient benefit**is an example of an unobservable theoretical construct. A common measurement model involves patient care costs. See [3] for a related example.

**Note:**
We cite several examples of unobservable theoretical constructs and measurement models for the purpose of explaining the key terms outlined above.
Please reference Jacobs and Wallach[2] for more detailed examples.

#### What is construct validity?#

Though Jacobs and Wallach[2] explore both construct reliability and construct validity, we focus our exploration below on construct Validity. We note that both play an important role in understanding fairness in sociotechnical contexts. With that said, Jacobs and Wallach[2] offers a fairness-oriented conceptualization of construct validity, that is helpful in thinking about fairness in sociotechnical contexts. We capture the idea in seven key parts that when combined can serve as a framework for analyzing an AI task and attempting to establish construct validity:

**Face validity**– On the surface, how plausible do the measurements produced by the measurement model look?**Content validity**– This has three subcomponents:**Contestedness**– Is there a single understanding of the unobservable theoretical construct? Or is that understanding contested (and thus context dependent).**Substantive validity**– Can we demonstrate that the measurement model contains the observable properties and other unobservable theoretical constructs related to the construct of interest (and only those)?**Structural validity**– Does the measurement model appropriately capture the relationships between the construct of interest and the measured observable properties and other unobservable theoretical constructs?

**Convergent validity**– Do the measurements obtained correlate with other measurements (that exist) from measurement models for which construct validity has been established?**Discriminant validity**– Do the measurements obtained for the construct of interest correlate with related constructs as appropriate?**Predictive validity**– Are the measurements obtained from the measurement model predictive of measurements of any relevant observable properties or other unobservable theoretical constructs?**Hypothesis validity**– This describes the nature of the hypotheses that could emerge from the measurements produced by the measurement model, and whether those are “substantively interesting”.**Consequential validity**– Identify and evaluate the consequences and societal impacts of using the measurements obtained for the measurement model. Framed as questions: how is the world shaped by using the measurements, and what world do we wish to live in?

**Note:** The order in which the parts above are explored and the way in which they are used may vary depending on the specific
sociotechnical context. This is only intended to explain the key concepts that could be used in a
framework for analyzing a task.

## Fairness assessment and unfairness mitigation#

In Fairlearn, we provide tools to assess fairness of predictors for classification and regression. We also provide tools that mitigate unfairness in classification and regression. In both assessment and mitigation scenarios, fairness is quantified using disparity metrics as we describe below.

### Group fairness, sensitive features#

There are many approaches to conceptualizing fairness. In Fairlearn, we follow
the approach known as group fairness, which asks: *Which groups of individuals
are at risk for experiencing harms?*

The relevant groups (also called subpopulations) are defined using **sensitive
features** (or sensitive attributes), which are passed to a Fairlearn
estimator as a vector or a matrix called `sensitive_features`

(even if it is
only one feature). The term suggests that the system designer should be
sensitive to these features when assessing group fairness. Although these
features may sometimes have privacy implications (e.g., gender or age) in
other cases they may not (e.g., whether or not someone is a native speaker of
a particular language). Moreover, the word sensitive does not imply that
these features should not be used to make predictions – indeed, in some cases
it may be better to include them.

Fairness literature also uses the term *protected attribute* in a similar
sense as sensitive feature. The term is based on anti-discrimination laws
that define specific *protected classes*. Since we seek to apply group
fairness in a wider range of settings, we avoid this term.

### Parity constraints#

Group fairness is typically formalized by a set of constraints on the behavior
of the predictor called **parity constraints** (also called criteria). Parity
constraints require that some aspect (or aspects) of the predictor behavior be
comparable across the groups defined by sensitive features.

Let \(X\) denote a feature vector used for predictions, \(A\) be a single sensitive feature (such as age or race), and \(Y\) be the true label. Parity constraints are phrased in terms of expectations with respect to the distribution over \((X,A,Y)\). For example, in Fairlearn, we consider the following types of parity constraints.

*Binary classification*:

*Demographic parity*(also known as*statistical parity*): A classifier \(h\) satisfies demographic parity under a distribution over \((X, A, Y)\) if its prediction \(h(X)\) is statistically independent of the sensitive feature \(A\). This is equivalent to \(\E[h(X) \given A=a] = \E[h(X)] \quad \forall a\). [4]*Equalized odds*: A classifier \(h\) satisfies equalized odds under a distribution over \((X, A, Y)\) if its prediction \(h(X)\) is conditionally independent of the sensitive feature \(A\) given the label \(Y\). This is equivalent to \(\E[h(X) \given A=a, Y=y] = \E[h(X) \given Y=y] \quad \forall a, y\). [4]*Equal opportunity*: a relaxed version of equalized odds that only considers conditional expectations with respect to positive labels, i.e., \(Y=1\). [5]

*Regression*:

*Demographic parity*: A predictor \(f\) satisfies demographic parity under a distribution over \((X, A, Y)\) if \(f(X)\) is independent of the sensitive feature \(A\). This is equivalent to \(\P[f(X) \geq z \given A=a] = \P[f(X) \geq z] \quad \forall a, z\). [6]*Bounded group loss*: A predictor \(f\) satisfies bounded group loss at level \(\zeta\) under a distribution over \((X, A, Y)\) if \(\E[loss(Y, f(X)) \given A=a] \leq \zeta \quad \forall a\). [6]

Above, demographic parity seeks to mitigate allocation harms, whereas bounded group loss primarily seeks to mitigate quality-of-service harms. Equalized odds and equal opportunity can be used as a diagnostic for both allocation harms as well as quality-of-service harms.

### Disparity metrics, group metrics#

Disparity metrics evaluate how far a given predictor departs from satisfying a parity constraint. They can either compare the behavior across different groups in terms of ratios or in terms of differences. For example, for binary classification:

*Demographic parity difference*is defined as \((\max_a \E[h(X) \given A=a]) - (\min_a \E[h(X) \given A=a])\).*Demographic parity ratio*is defined as \(\dfrac{\min_a \E[h(X) \given A=a]}{\max_a \E[h(X) \given A=a]}\).

The Fairlearn package provides the functionality to convert common accuracy
and error metrics from scikit-learn to *group metrics*, i.e., metrics that
are evaluated on the entire data set and also on each group individually.
Additionally, group metrics yield the minimum and maximum metric value and for
which groups these values were observed, as well as the difference and ratio
between the maximum and the minimum values. For more information refer to the
subpackage `fairlearn.metrics`

.