fairlearn.datasets.fetch_diabetes_hospital#

fairlearn.datasets.fetch_diabetes_hospital(*, as_frame=True, cache=True, data_home=None, return_X_y=False)[source]#

Load the preprocessed Diabetes 130-Hospitals dataset (binary classification).

Download it if necessary.

Samples total

101766

Dimensionality

24

Features

numeric, categorical, string

Classes

2

Source: UCI Repository 1 Paper: Strack et al., 2014 2

The “Diabetes 130-Hospitals” dataset represents 10 years of clinical care at 130 U.S. hospitals and delivery networks, collected from 1999 to 2008. Each record represents the hospital admission record for a patient diagnosed with diabetes whose stay lasted between one to fourteen days.

The original “Diabetes 130-Hospitals” dataset was collected by Beata Strack, Jonathan P. DeShazo, Chris Gennings, Juan L. Olmo, Sebastian Ventura, Krzysztof J. Cios, and John N. Clore in 2014.

This version of the dataset was derived by the Fairlearn team for the SciPy 2021 tutorial “Fairness in AI Systems: From social context to practice using Fairlearn”. In this version, the target variable “readmitted” is binarized into whether the patient was re-admitted within thirty days. The full pre-processing script is available here.

Read more in the User Guide.

New in version 0.8.0.

Parameters
  • as_frame (bool, default=True) –

    If True, the data is a pandas DataFrame including columns with appropriate dtypes (numeric, string or categorical).

    Note

    If set to False, this will raise an exception because of a type mismatch in the OpenML dataset.

    New in version 0.9.0.

  • cache (bool, default=True) – Whether to cache downloaded datasets using joblib.

  • data_home (str, default=None) – Specify another download and cache folder for the datasets. By default, all fairlearn data is stored in ‘~/.fairlearn-data’ subfolders.

  • return_X_y (bool, default=False) – If True, returns (data.data, data.target) instead of a Bunch object.

Returns

  • dataset (Bunch) – Dictionary-like object, with the following attributes:

    datandarray, shape (101766, 24)

    Each row corresponding to the 24 feature values in order. If as_frame is True, data is a pandas object.

    targetnumpy array of shape (101766,)

    Each value represents whether readmission of the patient occurred within 30 days of the release.

    feature_nameslist of length 24

    Array of ordered feature names used in the dataset.

    DESCRstring

    Description of the Diabetes 130-Hospitals dataset.

    categoriesdict or None

    Maps each categorical feature name to a list of values, such that the value encoded as i is ith in the list. If as_frame is True, this is None.

    framepandas DataFrame

    Only present when as_frame is True. DataFrame with data and target.

  • (data, target) (tuple if return_X_y is True)

Notes

Our API largely follows the API of sklearn.datasets.fetch_openml().

References

1

Beata Strack, Jonathan Deshazo, Chris Gennings, Juan Luis Olmo Ortiz, Sebastian Ventura, Krzysztof Cios, and John Clore. Diabetes 130-us hospitals for years 1999-2008 data set. 05 2014. URL: https://archive.ics.uci.edu/ml/datasets/Diabetes+130-US+hospitals+for+years+1999-2008.

2

Beata Strack, Jonathan Deshazo, Chris Gennings, Juan Luis Olmo Ortiz, Sebastian Ventura, Krzysztof Cios, and John Clore. Impact of hba1c measurement on hospital readmission rates: analysis of 70,000 clinical database patient records. BioMed research international, 2014:781670, 04 2014. doi:10.1155/2014/781670.