scikit-learn: API for returning datasets as DataFrames

Users should be able to get datasets returned as [Sparse]DataFrames with named columns in this day and age, for those datasets otherwise providing a Bunch with feature_names. This would be controlled with an as_frame parameter (though return_X_y='frame' would mean the common usage is more succinct).

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Reactions: 1
  • Comments: 79 (57 by maintainers)

Most upvoted comments

Just an FYI for everybody: the decision was to include the target into the dataframe.

ok, then @tianchuliang and I would take this for the sprint

@jaketae go for it!

@tianchuliang @bsipocz I think that one PR per dataset fetcher will be easier to review. Thanks!

@thomasjpfan sorry I picked it up since it was still on the to do pane… okay will leave you guys to work on it

@cmarmo @bsipocz and I will basically split 4 half and half; I will work on first 2, @bsipocz works on the second 2

When testing be sure to set SKLEARN_SKIP_NETWORK_TESTS=0 and the fixtures in sklearn/datasets/tests/conftest.py.

@lucyleeow and @cmarmo - do your comments about suggesting this issue for the sprint, or saying that you’re planning to work on it?

Suggesting for the sprint. 😃

Here are some dataset fetchers that would still benefit from a treatment similar to #15980 and #15950.

  • fetch_covtype but we would need to hunt the original website to find the feature names (and data types);
  • fetch_kddcup99: this one is “interesting” as it can already return a recarray with a structured dtype with named int or float fields that should be mapped to similarly named and typed columns in a dataframe;
  • fetch_20newsgroups_vectorized that should use the tokenized words as column names;
  • fetch_rcv1 similar but I am not sure we can retrieve the words…

There are weirder dataset loaders though:

  • fetch_olivetti_faces, fetch_lfw_pairs and fetch_lfw_people load pixel values hence would not really benefit from being loaded as a dataframe.
  • fetch_species_distributions is weird and would require its own issue to specify what we would want to do with it.

@wconnell yeah that sounds about right.

@gitsteph yes we should return a bunch containing a dataframe I think.

@gitsteph not sure I follow. I think for many of the cases (thought not for openml probably) we can use the existing bunch to create a dataframe. The bunch has numpy arrays and standardized ways to name features.

Referencing the example from the linked issue’s implementation (#13902), which adds as_frame to fetch_openml – it seems that some of these datasets have different formats that require custom handling. For example the openml dataset has an ARFF file format, while the one (_california_housing.py) I’m working on uses RST.

I agree that it’d be good to follow a standard and generalized naming convention and pattern, though I don’t think we could simply call a shared _bunch_to_dataframe without shadowing it in individual loaders to handle (or at least check) these varying formats. Also, the existing merged openml implementation adds frame to the bunch, rather than converting the bunch to dataframe as the proposed name (_bunch_to_dataframe) implies. I propose an abstract method called _convert_data_dataframe, which is a generalization of openml.py’s _convert_arff_data_dataframe method.

Hey guys I think we should write a general _bunch_to_dataframe() function that can be called in all of these loaders. Instead of writing separate code in each loader, are we on the same page?

you are welcome to