fairlearn.datasets package¶
This module contains datasets that can be used for benchmarking and education.
- fairlearn.datasets.fetch_adult(*, cache=True, data_home=None, as_frame=False, return_X_y=False)[source]¶
Load the UCI Adult dataset (binary classification).
Download it if necessary.
Samples total
48842
Dimensionality
14
Features
real
Classes
2
Source: UCI Repository [1] , Paper: R. Kohavi (1996) [2]
Prediction task is to determine whether a person makes over $50,000 a year.
- Parameters
cache (bool, default=True) – Whether to cache downloaded datasets using joblib.
data_home (str, default=None) – Specify another download and cache folder for the datasets. By default, all scikit-learn data is stored in ‘~/.fairlearn-data’ subfolders.
as_frame (bool, default=False) – If True, the data is a pandas DataFrame including columns with appropriate dtypes (numeric, string or categorical). The target is a pandas DataFrame or Series depending on the number of target_columns. The Bunch will contain a
frame
attribute with the target and the data. Ifreturn_X_y
is True, then(data, target)
will be pandas DataFrames or Series as describe above.return_X_y (bool, default=False) – If True, returns
(data.data, data.target)
instead of a Bunch object.
- Returns
dataset (
Bunch
) – Dictionary-like object, with the following attributes.- datandarray, shape (48842, 14)
Each row corresponding to the 14 feature values in order. If
as_frame
is True,data
is a pandas object.- targetnumpy array of shape (48842,)
Each value represents whether the person earns more than $50,000 a year (>50K) or not (<=50K). If
as_frame
is True,target
is a pandas object.- feature_nameslist of length 14
Array of ordered feature names used in the dataset.
- DESCRstring
Description of the UCI Adult dataset.
(data, target) (tuple of (numpy.ndarray, numpy.ndarray) or (pandas.DataFrame, pandas.Series)) – if
return_X_y
is True andas_frame
is False(data, target) (tuple of (pandas.DataFrame, pandas.Series)) – if
return_X_y
is True andas_frame
is True
References
- 1
R. Kohavi and B. Becker, UCI Machine Learning Repository: Adult Data Set, 01-May-1996. [Online]. Available: https://archive.ics.uci.edu/ml/datasets/adult.
- 2
R. Kohavi, “Scaling Up the Accuracy of Naive-Bayes Classifiers: a Decision-Tree Hybrid,” in Second International Conference on knowledge discovery and data mining: proceedings: August 2-4, 1996, Portland, Oregon, 1996, pp. 202–207.
- fairlearn.datasets.fetch_bank_marketing(*, cache=True, data_home=None, as_frame=False, return_X_y=False)[source]¶
Load the UCI bank marketing dataset (binary classification).
Download it if necessary.
Samples total
45211
Dimensionality
17
Features
numeric, categorical
Classes
2
Source: UCI Repository [3] Paper: Moro et al., 2014 [4]
The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be (or not) subscribed.
The classification goal is to predict if the client will subscribe a term deposit (variable y).
- Parameters
cache (bool, default=True) – Whether to cache downloaded datasets using joblib.
data_home (str, default=None) – Specify another download and cache folder for the datasets. By default, all data is stored in ‘~/.fairlearn-data’ subfolders.
as_frame (bool, default=False) – If True, the data is a pandas DataFrame including columns with appropriate dtypes (numeric, string or categorical). The target is a pandas DataFrame or Series depending on the number of target_columns. The Bunch will contain a
frame
attribute with the target and the data. Ifreturn_X_y
is True, then(data, target)
will be pandas DataFrames or Series as describe above.return_X_y (bool, default=False) – If True, returns
(data.data, data.target)
instead of a Bunch object.
- Returns
dataset (
Bunch
) – Dictionary-like object, with the following attributes.- datandarray, shape (45211, 17)
Each row corresponding to the 17 feature values in order. If
as_frame
is True,data
is a pandas object.- targetnumpy array of shape (45211,)
Each value represents whether the client subscribed a term deposit which is ‘yes’ if the client subscribed and ‘no’ otherwise. If
as_frame
is True,target
is a pandas object.- feature_nameslist of length 17
Array of ordered feature names used in the dataset.
- DESCRstring
Description of the UCI bank marketing dataset.
(data, target) (tuple of (numpy.ndarray, numpy.ndarray) or (pandas.DataFrame, pandas.Series)) – if
return_X_y
is True andas_frame
is False(data, target) (tuple of (pandas.DataFrame, pandas.Series)) – if
return_X_y
is True andas_frame
is True
References
- 3
S. Moro, P. Cortez, and P. Rita, UCI Machine Learning Repository: Bank Marketing Data Set, 14-Feb-2014. [Online]. Available: https://archive.ics.uci.edu/ml/datasets/Bank+Marketing.
- 4
S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014
- fairlearn.datasets.fetch_boston(*, cache=True, data_home=None, as_frame=False, return_X_y=False, warn=True)[source]¶
Load the boston housing dataset (regression).
Download it if necessary.
Samples total
506
Dimensionality
13
Features
real
Target
real 5. - 50.
Source: OpenML [5] Paper: D. Harrison (1978) [6]
The Boston house-price data of D. Harrison, and D.L. Rubinfeld [6].
Referenced in Belsley, Kuh & Welsch, ‘Regression diagnostics…’, Wiley, 1980. N.B. [7].
This dataset has known fairness issues [8]. There’s a “lower status of population” (LSTAT) parameter that you need to look out for and a column that is a derived from the proportion of people with a black skin color that live in a neighborhood (B) [9]. See the references at the bottom for more detailed information.
Here’s a table of all the variables in order:
CRIM
per capita crime rate by town
ZN
proportion of residential land zoned for lots over 25,000 sq.ft.
INDUS
proportion of non-retail business acres per town
CHAS
Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
NOX
nitric oxides concentration (parts per 10 million)
RM
average number of rooms per dwelling
AGE
proportion of owner-occupied units built prior to 1940
DIS
weighted distances to five Boston employment centres
RAD
index of accessibility to radial highways
TAX
full-value property-tax rate per $10,000
PTRATIO
pupil-teacher ratio by town
B
1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
LSTAT
% lower status of the population
MEDV
Median value of owner-occupied homes in $1000’s
- Parameters
cache (bool, default=True) – Whether to cache downloaded datasets using joblib.
data_home (str, default=None) – Specify another download and cache folder for the datasets. By default, all scikit-learn data is stored in ‘~/.fairlearn-data’ subfolders.
as_frame (bool, default=False) – If True, the data is a pandas DataFrame including columns with appropriate dtypes (numeric, string or categorical). The target is a pandas DataFrame or Series depending on the number of target_columns. The Bunch will contain a
frame
attribute with the target and the data. Ifreturn_X_y
is True, then(data, target)
will be pandas DataFrames or Series as describe above.return_X_y (bool, default=False) – If True, returns
(data.data, data.target)
instead of a Bunch object.warn (bool, default=True) – If True, it raises an extra warning to make users aware of the unfairness aspect of this dataset.
- Returns
dataset (
Bunch
) – Dictionary-like object, with the following attributes.- datandarray, shape (506, 13)
Each row corresponding to the 13 feature values in order. If
as_frame
is True,data
is a pandas object.- targetnumpy array of shape (506,)
Each value corresponds to the average house value in units of 100,000. If
as_frame
is True,target
is a pandas object.- feature_nameslist of length 13
Array of ordered feature names used in the dataset.
- DESCRstring
Description of the Boston housing dataset.
(data, target) (tuple of (numpy.ndarray, numpy.ndarray) or (pandas.DataFrame, pandas.Series)) – if
return_X_y
is True andas_frame
is False(data, target) (tuple of (pandas.DataFrame, pandas.Series)) – if
return_X_y
is True andas_frame
is True
Notes
This dataset consists of 506 samples and 13 features. It is notorious for the fairness issues related to the B column. There’s more information in the references.
References
- 5
J. Vanschoren, “boston,” OpenML, 29-Sep-2014. [Online]. Available: https://www.openml.org/d/531.
- 6(1,2)
D. Harrison and D. L. Rubinfeld, “Hedonic housing prices and the demand for clean air,” Journal of Environmental Economics and Management, vol. 5, no. 1, pp. 81–102, Mar. 1978.
- 7
D. A. Belsley, E. Kuh, and R. E. Welsch, Regression diagnostics identifying influential data and sources of collinearity. Hoboken, NJ, NJ: Wiley-Interscience, 1980.
- 8
J. Sykes, “- B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town · Issue #16155 · scikit-learn/scikit-learn,” GitHub, 18-Jan-2020. [Online]. Available: https://github.com/scikit-learn/scikit-learn/issues/16155.
- 9
M. Carlisle, “racist data destruction?,” Medium, 13-Jun-2019. [Online]. Available: https://medium.com/@docintangible/racist-data-destruction-113e3eff54a8.