scikit-learn: API for returning datasets as DataFrames
Users should be able to get datasets returned as [Sparse]DataFrames with named columns in this day and age, for those datasets otherwise providing a Bunch with feature_names
. This would be controlled with an as_frame
parameter (though return_X_y='frame'
would mean the common usage is more succinct).
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Reactions: 1
- Comments: 79 (57 by maintainers)
Just an FYI for everybody: the decision was to include the target into the dataframe.
ok, then @tianchuliang and I would take this for the sprint
Be sure to look at https://github.com/scikit-learn/scikit-learn/issues/10733#issuecomment-640086273 when working on this locally.
@jaketae go for it!
@tianchuliang @bsipocz I think that one PR per dataset fetcher will be easier to review. Thanks!
@thomasjpfan sorry I picked it up since it was still on the to do pane… okay will leave you guys to work on it
@cmarmo @bsipocz and I will basically split 4 half and half; I will work on first 2, @bsipocz works on the second 2
When testing be sure to set
SKLEARN_SKIP_NETWORK_TESTS=0
and the fixtures insklearn/datasets/tests/conftest.py
.Suggesting for the sprint. 😃
Here are some dataset fetchers that would still benefit from a treatment similar to #15980 and #15950.
fetch_covtype
but we would need to hunt the original website to find the feature names (and data types);fetch_kddcup99
: this one is “interesting” as it can already return a recarray with a structured dtype with named int or float fields that should be mapped to similarly named and typed columns in a dataframe;fetch_20newsgroups_vectorized
that should use the tokenized words as column names;fetch_rcv1
similar but I am not sure we can retrieve the words…There are weirder dataset loaders though:
fetch_olivetti_faces
,fetch_lfw_pairs
andfetch_lfw_people
load pixel values hence would not really benefit from being loaded as a dataframe.fetch_species_distributions
is weird and would require its own issue to specify what we would want to do with it.@wconnell yeah that sounds about right.
@gitsteph yes we should return a bunch containing a dataframe I think.
@gitsteph not sure I follow. I think for many of the cases (thought not for openml probably) we can use the existing bunch to create a dataframe. The bunch has numpy arrays and standardized ways to name features.
Referencing the example from the linked issue’s implementation (#13902), which adds
as_frame
tofetch_openml
– it seems that some of these datasets have different formats that require custom handling. For example the openml dataset has an ARFF file format, while the one (_california_housing.py
) I’m working on uses RST.I agree that it’d be good to follow a standard and generalized naming convention and pattern, though I don’t think we could simply call a shared
_bunch_to_dataframe
without shadowing it in individual loaders to handle (or at least check) these varying formats. Also, the existing merged openml implementation addsframe
to the bunch, rather than converting the bunch to dataframe as the proposed name (_bunch_to_dataframe
) implies. I propose an abstract method called_convert_data_dataframe
, which is a generalization ofopenml.py
’s_convert_arff_data_dataframe
method.Hey guys I think we should write a general
_bunch_to_dataframe()
function that can be called in all of these loaders. Instead of writing separate code in each loader, are we on the same page?you are welcome to