fairlearn.datasets package#

This module contains datasets that can be used for benchmarking and education.

fairlearn.datasets.fetch_acs_income(*, cache=True, data_home=None, as_frame=False, return_X_y=False, states=None)[source]#

Load the ACS Income dataset (regression).

Download it if necessary.

Samples total

1664500

Dimensionality

10

Features

numeric, categorical

Target

numeric

Source: Paper: Ding et al. (2021) [1]

and corresponding repository zykls/folktables

Read more in the User Guide.

New in version 0.8.0.

Parameters
  • cache (bool, default=True) – Whether to cache downloaded datasets using joblib.

  • data_home (str, default=None) – Specify another download and cache folder for the datasets. By default, all fairlearn data is stored in ‘~/.fairlearn-data’ subfolders.

  • as_frame (bool, default=False) – If True, the data is a pandas DataFrame including columns with appropriate dtypes (numeric, string or categorical). The target is a pandas DataFrame or Series depending on the number of target_columns. The Bunch will contain a frame attribute with the target and the data. If return_X_y is True, then (data, target) will be pandas DataFrames or Series as describe above.

  • return_X_y (bool, default=False) – If True, returns (data.data, data.target) instead of a Bunch object.

  • states (list, default=None) – List containing two letter (capitalized) state abbreviations. If None, data from all 50 US states and Puerto Rico will be returned. Note that Puerto Rico is the only US territory included in this dataset. The state abbreviations and codes can be found on page 1 of the data dictionary at ACS PUMS [2].

Returns

  • dataset (Bunch) – Dictionary-like object, with the following attributes.

    datandarray, shape (1664500, 10)

    Each row corresponding to the 10 feature values in order. If as_frame is True, data is a pandas object.

    targetnumpy array of shape (1664500,)

    Integer denoting each person’s annual income. A threshold can be applied as a postprocessing step to frame this as a binary classification problem. If as_frame is True, target is a pandas object.

    feature_nameslist of length 10

    Array of ordered feature names used in the dataset.

    DESCRstring

    Description of the ACSIncome dataset.

  • (data, target) (tuple of (numpy.ndarray, numpy.ndarray)) – if return_X_y is True and as_frame is False

  • (data, target) (tuple of (pandas.DataFrame, pandas.Series)) – if return_X_y is True and as_frame is True

References

1

Ding, F., Hardt, M., Miller, J., & Schmidt, L. (2021). “Retiring Adult: New Datasets for Fair Machine Learning.” Advances in Neural Information Processing Systems, 34.

2

“2018 ACS PUMS Data Dictionary”. United States Census Bureau.

fairlearn.datasets.fetch_adult(*, cache=True, data_home=None, as_frame=False, return_X_y=False)[source]#

Load the UCI Adult dataset (binary classification).

Read more in the User Guide.

Download it if necessary.

Samples total

48842

Dimensionality

14

Features

numeric, categorical

Classes

2

Source: UCI Repository [1] , Paper: R. Kohavi (1996) [2]

Prediction task is to determine whether a person makes over $50,000 a year.

Read more in the User Guide.

New in version 0.5.0.

Parameters
  • cache (bool, default=True) – Whether to cache downloaded datasets using joblib.

  • data_home (str, default=None) – Specify another download and cache folder for the datasets. By default, all fairlearn data is stored in ‘~/.fairlearn-data’ subfolders.

  • as_frame (bool, default=False) – If True, the data is a pandas DataFrame including columns with appropriate dtypes (numeric, string or categorical). The target is a pandas DataFrame or Series depending on the number of target_columns. The Bunch will contain a frame attribute with the target and the data. If return_X_y is True, then (data, target) will be pandas DataFrames or Series as describe above.

  • return_X_y (bool, default=False) – If True, returns (data.data, data.target) instead of a Bunch object.

Returns

  • dataset (Bunch) – Dictionary-like object, with the following attributes.

    datandarray, shape (48842, 14)

    Each row corresponding to the 14 feature values in order. If as_frame is True, data is a pandas object.

    targetnumpy array of shape (48842,)

    Each value represents whether the person earns more than $50,000 a year (>50K) or not (<=50K). If as_frame is True, target is a pandas object.

    feature_nameslist of length 14

    Array of ordered feature names used in the dataset.

    DESCRstring

    Description of the UCI Adult dataset.

  • (data, target) (tuple of (numpy.ndarray, numpy.ndarray)) – if return_X_y is True and as_frame is False

  • (data, target) (tuple of (pandas.DataFrame, pandas.Series)) – if return_X_y is True and as_frame is True

References

1

R. Kohavi and B. Becker, UCI Machine Learning Repository: Adult Data Set, 01-May-1996. [Online]. Available: https://archive.ics.uci.edu/ml/datasets/adult.

2

R. Kohavi, “Scaling Up the Accuracy of Naive-Bayes Classifiers: a Decision-Tree Hybrid,” in Second International Conference on knowledge discovery and data mining: proceedings: August 2-4, 1996, Portland, Oregon, 1996, pp. 202–207.

fairlearn.datasets.fetch_bank_marketing(*, cache=True, data_home=None, as_frame=False, return_X_y=False)[source]#

Load the UCI bank marketing dataset (binary classification).

Download it if necessary.

Samples total

45211

Dimensionality

17

Features

numeric, categorical

Classes

2

Source: UCI Repository [3] Paper: Moro et al., 2014 [4]

The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be (or not) subscribed.

The classification goal is to predict if the client will subscribe a term deposit (variable y).

New in version 0.5.0.

Parameters
  • cache (bool, default=True) – Whether to cache downloaded datasets using joblib.

  • data_home (str, default=None) – Specify another download and cache folder for the datasets. By default, all fairlearn data is stored in ‘~/.fairlearn-data’ subfolders.

  • as_frame (bool, default=False) – If True, the data is a pandas DataFrame including columns with appropriate dtypes (numeric, string or categorical). The target is a pandas DataFrame or Series depending on the number of target_columns. The Bunch will contain a frame attribute with the target and the data. If return_X_y is True, then (data, target) will be pandas DataFrames or Series as describe above.

  • return_X_y (bool, default=False) – If True, returns (data.data, data.target) instead of a Bunch object.

Returns

  • dataset (Bunch) – Dictionary-like object, with the following attributes.

    datandarray, shape (45211, 17)

    Each row corresponding to the 17 feature values in order. If as_frame is True, data is a pandas object.

    targetnumpy array of shape (45211,)

    Each value represents whether the client subscribed a term deposit which is ‘yes’ if the client subscribed and ‘no’ otherwise. If as_frame is True, target is a pandas object.

    feature_nameslist of length 17

    Array of ordered feature names used in the dataset.

    DESCRstring

    Description of the UCI bank marketing dataset.

  • (data, target) (tuple of (numpy.ndarray, numpy.ndarray)) – if return_X_y is True and as_frame is False

  • (data, target) (tuple of (pandas.DataFrame, pandas.Series)) – if return_X_y is True and as_frame is True

References

3

S. Moro, P. Cortez, and P. Rita, UCI Machine Learning Repository: Bank Marketing Data Set, 14-Feb-2014. [Online]. Available: https://archive.ics.uci.edu/ml/datasets/Bank+Marketing.

4

S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014

fairlearn.datasets.fetch_boston(*, cache=True, data_home=None, as_frame=False, return_X_y=False, warn=True)[source]#

Load the boston housing dataset (regression).

Download it if necessary.

Samples total

506

Dimensionality

13

Features

real

Target

real 5. - 50.

Source: OpenML [5] Paper: D. Harrison (1978) [6]

The Boston house-price data of D. Harrison, and D.L. Rubinfeld [6].

Referenced in Belsley, Kuh & Welsch, ‘Regression diagnostics…’, Wiley, 1980. N.B. [7].

This dataset has known fairness issues [8]. There’s a “lower status of population” (LSTAT) parameter that you need to look out for and a column that is a derived from the proportion of people with a black skin color that live in a neighborhood (B) [9]. See the references at the bottom for more detailed information.

Here’s a table of all the variables in order:

CRIM

per capita crime rate by town

ZN

proportion of residential land zoned for lots over 25,000 sq.ft.

INDUS

proportion of non-retail business acres per town

CHAS

Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)

NOX

nitric oxides concentration (parts per 10 million)

RM

average number of rooms per dwelling

AGE

proportion of owner-occupied units built prior to 1940

DIS

weighted distances to five Boston employment centres

RAD

index of accessibility to radial highways

TAX

full-value property-tax rate per $10,000

PTRATIO

pupil-teacher ratio by town

B

1000(Bk - 0.63)^2 where Bk is the proportion of Black people by town

LSTAT

% lower status of the population

MEDV

Median value of owner-occupied homes in $1000’s

Read more in the User Guide.

New in version 0.5.0.

Parameters
  • cache (bool, default=True) – Whether to cache downloaded datasets using joblib.

  • data_home (str, default=None) – Specify another download and cache folder for the datasets. By default, all fairlearn data is stored in ‘~/.fairlearn-data’ subfolders.

  • as_frame (bool, default=False) – If True, the data is a pandas DataFrame including columns with appropriate dtypes (numeric, string or categorical). The target is a pandas DataFrame or Series depending on the number of target_columns. The Bunch will contain a frame attribute with the target and the data. If return_X_y is True, then (data, target) will be pandas DataFrames or Series as describe above.

  • return_X_y (bool, default=False) – If True, returns (data.data, data.target) instead of a Bunch object.

  • warn (bool, default=True) – If True, it raises an extra warning to make users aware of the unfairness aspect of this dataset.

Returns

  • dataset (Bunch) – Dictionary-like object, with the following attributes.

    datandarray, shape (506, 13)

    Each row corresponding to the 13 feature values in order. If as_frame is True, data is a pandas object.

    targetnumpy array of shape (506,)

    Each value corresponds to the average house value in units of 100,000. If as_frame is True, target is a pandas object.

    feature_nameslist of length 13

    Array of ordered feature names used in the dataset.

    DESCRstring

    Description of the Boston housing dataset.

  • (data, target) (tuple of (numpy.ndarray, numpy.ndarray)) – if return_X_y is True and as_frame is False

  • (data, target) (tuple of (pandas.DataFrame, pandas.Series)) – if return_X_y is True and as_frame is True

Notes

This dataset consists of 506 samples and 13 features. It is notorious for the fairness issues related to the B column. There’s more information in the references.

References

5

J. Vanschoren, “boston,” OpenML, 29-Sep-2014. [Online]. Available: https://www.openml.org/d/531.

6(1,2)

D. Harrison and D. L. Rubinfeld, “Hedonic housing prices and the demand for clean air,” Journal of Environmental Economics and Management, vol. 5, no. 1, pp. 81–102, Mar. 1978.

7

D. A. Belsley, E. Kuh, and R. E. Welsch, Regression diagnostics identifying influential data and sources of collinearity. Hoboken, NJ, NJ: Wiley-Interscience, 1980.

8

J. Sykes, “- B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town · Issue #16155 · scikit-learn/scikit-learn,” GitHub, 18-Jan-2020. [Online]. Available: scikit-learn/scikit-learn#16155.

9

M. Carlisle, “racist data destruction?,” Medium, 13-Jun-2019. [Online]. Available: https://medium.com/@docintangible/racist-data-destruction-113e3eff54a8.

fairlearn.datasets.fetch_diabetes_hospital(*, cache=True, data_home=None, return_X_y=False)[source]#

Load the preprocessed Diabetes 130-Hospitals dataset (binary classification).

Download it if necessary.

Samples total

101766

Dimensionality

25

Features

numeric, categorical, string

Classes

2

Source: UCI Repository 1 Paper: Strack et al., 2014 2

The “Diabetes 130-Hospitals” dataset represents 10 years of clinical care at 130 U.S. hospitals and delivery networks, collected from 1999 to 2008. Each record represents the hospital admission record for a patient diagnosed with diabetes whose stay lasted between one to fourteen days.

The original “Diabetes 130-Hospitals” dataset was collected by Beata Strack, Jonathan P. DeShazo, Chris Gennings, Juan L. Olmo, Sebastian Ventura, Krzysztof J. Cios, and John N. Clore in 2014.

This version of the dataset was derived by the Fairlearn team for the SciPy 2021 tutorial “Fairness in AI Systems: From social context to practice using Fairlearn”. In this version, the target variable “readmitted” is binarized into whether the patient was re-admitted within thirty days. The full pre-processing script is available here.

Read more in the User Guide.

Note

The dataset is always returned as a pandas object, because string attributes are not supported for array representation, resulting in a ValueError.

New in version 0.8.0.

Parameters
  • cache (bool, default=True) – Whether to cache downloaded datasets using joblib.

  • data_home (str, default=None) – Specify another download and cache folder for the datasets. By default, all fairlearn data is stored in ‘~/.fairlearn-data’ subfolders.

  • return_X_y (bool, default=False) – If True, returns (data.data, data.target) instead of a Bunch object.

Returns

  • dataset (Bunch) – Dictionary-like object, with the following attributes.

    datandarray, shape (101766, 25)

    Each row corresponding to the 25 feature values in order. If as_frame is True, data is a pandas object.

    targetnumpy array of shape (101766,)

    Each value represents whether readmission of the patient occurred within 30 days of the release.

    feature_nameslist of length 25

    Array of ordered feature names used in the dataset.

    DESCRstring

    Description of the Diabetes 130-Hospitals dataset.

  • (data, target) (tuple of (pandas.DataFrame, pandas.Series)) – if return_X_y is True

References

1

Beata Strack, Jonathan Deshazo, Chris Gennings, Juan Luis Olmo Ortiz, Sebastian Ventura, Krzysztof Cios, and John Clore. Diabetes 130-us hospitals for years 1999-2008 data set. 05 2014. URL: https://archive.ics.uci.edu/ml/datasets/Diabetes+130-US+hospitals+for+years+1999-2008.

2

Beata Strack, Jonathan Deshazo, Chris Gennings, Juan Luis Olmo Ortiz, Sebastian Ventura, Krzysztof Cios, and John Clore. Impact of hba1c measurement on hospital readmission rates: analysis of 70,000 clinical database patient records. BioMed research international, 2014:781670, 04 2014. doi:10.1155/2014/781670.