ACSIncome#
Introduction#
The ACSIncome dataset is one of five datasets created by Ding et al. [1] as an improved alternative to the popular UCI Adult dataset. [2] Briefly, the UCI Adult dataset is commonly used as a benchmark dataset when comparing different algorithmic fairness interventions. ACSIncome offers a few improvements, such as providing more datapoints (1,664,500 vs. 48,842) and more recent data (2018 vs. 1994). Further, the binary labels in the UCI Adult dataset indicate whether an individual earned more than $50k US dollars in that year. Ding et al. show that the choice of threshold impacts the amount of disparity in proportion of positives, so they allow users to define any threshold rather than fixing it at $50k.
Ding et al. compiled data from the American Community Survey (ACS) Public Use Microdata Sample (PUMS). Note that this is a different source than the Annual Social and Economic Supplement (ASEC) of the Current Population Survey (CPS) used to construct the original UCI Adult dataset. Ding et al. filtered the data such that ACSIncome only includes individuals above 16 years old who worked at least 1 hour per week in the past year and had an income of at least $100.
Dataset Description#
Ding et al. provide data from 2014-2018 for all 50 states and Puerto Rico. Note that Puerto Rico is the only US territory included in this dataset. We uploaded the 2018 data to OpenML. The dataset contains 1,664,500 rows. Each row describes a person and contains 10 features, which we describe below:
Column name |
Description |
---|---|
AGEP |
Age as an integer from 0 to 99 |
COW |
|
SCHL |
|
MAR |
|
OCCP |
Occupation. There are over 500 categories. Please see data dictionary at ACS PUMS documentation [3] for the full list of occupation codes. |
POBP |
Place of birth. There are over 200 categories, including the 50 US states and several countries. Please see the data dictionary at ACS PUMS documentation [3] for the full list. |
RELP |
|
WKHP |
Usual hours worked per week in the past 12 months. Values are an integer from 1 to 99. Any hours above 99 are rounded down to 99 |
SEX |
|
RAC1P |
|
The target label is given by PINCP. For generalizability, the integer value is provided. A threshold can be applied to PINCP to frame this as a binary classification task.
Column name |
Description |
---|---|
PINCP |
Total annual income per person, denoted as an integer ranging from 104 to 1,423,000. |
Using the dataset#
The dataset can be loaded via the fairlearn.datasets.fetch_acs_income()
function. By default, the dataset is returned as a pandas.DataFrame
.