scikit-learn: API for returning datasets as DataFrames

Users should be able to get datasets returned as [Sparse]DataFrames with named columns in this day and age, for those datasets otherwise providing a Bunch with feature_names. This would be controlled with an as_frame parameter (though return_X_y='frame' would mean the common usage is more succinct).

About this issue

Original URL
State: closed
Created 6 years ago
Reactions: 1
Comments: 79 (57 by maintainers)

Most upvoted comments

Just an FYI for everybody: the decision was to include the target into the dataframe.

amueller on Nov 2, 2019

ok, then @tianchuliang and I would take this for the sprint

bsipocz on Jun 6, 2020

Be sure to look at https://github.com/scikit-learn/scikit-learn/issues/10733#issuecomment-640086273 when working on this locally.

thomasjpfan on Jun 6, 2020

@jaketae go for it!

bsipocz on Jun 6, 2020

@tianchuliang @bsipocz I think that one PR per dataset fetcher will be easier to review. Thanks!

cmarmo on Jun 6, 2020

@thomasjpfan sorry I picked it up since it was still on the to do pane… okay will leave you guys to work on it

erodrago on Jun 6, 2020

@cmarmo @bsipocz and I will basically split 4 half and half; I will work on first 2, @bsipocz works on the second 2

tianchuliang on Jun 6, 2020

When testing be sure to set SKLEARN_SKIP_NETWORK_TESTS=0 and the fixtures in sklearn/datasets/tests/conftest.py.

thomasjpfan on Jun 6, 2020

@lucyleeow and @cmarmo - do your comments about suggesting this issue for the sprint, or saying that you’re planning to work on it?

Suggesting for the sprint. 😃

cmarmo on Jun 6, 2020

Here are some dataset fetchers that would still benefit from a treatment similar to #15980 and #15950.

fetch_covtype but we would need to hunt the original website to find the feature names (and data types);
fetch_kddcup99: this one is “interesting” as it can already return a recarray with a structured dtype with named int or float fields that should be mapped to similarly named and typed columns in a dataframe;
fetch_20newsgroups_vectorized that should use the tokenized words as column names;
~~fetch_rcv1 similar but I am not sure we can retrieve the words…~~

There are weirder dataset loaders though:

fetch_olivetti_faces, fetch_lfw_pairs and fetch_lfw_people load pixel values hence would not really benefit from being loaded as a dataframe.
fetch_species_distributions is weird and would require its own issue to specify what we would want to do with it.

ogrisel on Jun 6, 2020

@wconnell yeah that sounds about right.

@gitsteph yes we should return a bunch containing a dataframe I think.

amueller on Nov 2, 2019

@gitsteph not sure I follow. I think for many of the cases (thought not for openml probably) we can use the existing bunch to create a dataframe. The bunch has numpy arrays and standardized ways to name features.

amueller on Nov 2, 2019

Referencing the example from the linked issue’s implementation (#13902), which adds as_frame to fetch_openml – it seems that some of these datasets have different formats that require custom handling. For example the openml dataset has an ARFF file format, while the one (_california_housing.py) I’m working on uses RST.

I agree that it’d be good to follow a standard and generalized naming convention and pattern, though I don’t think we could simply call a shared _bunch_to_dataframe without shadowing it in individual loaders to handle (or at least check) these varying formats. Also, the existing merged openml implementation adds frame to the bunch, rather than converting the bunch to dataframe as the proposed name (_bunch_to_dataframe) implies. I propose an abstract method called _convert_data_dataframe, which is a generalization of openml.py’s _convert_arff_data_dataframe method.

gitsteph on Nov 2, 2019

Hey guys I think we should write a general _bunch_to_dataframe() function that can be called in all of these loaders. Instead of writing separate code in each loader, are we on the same page?

wconnell on Nov 2, 2019

you are welcome to

jnothman on Mar 1, 2018