kedro: Make import failures in `kedro-datasets` clearer
Description
[!NOTE]
See https://github.com/kedro-org/kedro/issues/2943#issuecomment-1944348860 for current status
Disambiguate the “module does not exist” error from the ImportError in messages like:
DatasetError: An exception occurred when parsing config for dataset 'companies':
Class 'polars.CSVDataSet' not found or one of its dependencies has not been installed.
This task will require investigation into how the error messages are swallowed, which will inform the proper implementation. Probably change: https://github.com/kedro-org/kedro/blob/4194fbd16a992af0320a395c0060aaaea356efb2/kedro/io/core.py#L433
Context
This week I’ve been battling some puzzling documentation errors and after a while I noticed that, if the dependencies of a particular dataset are not present, the ImportError is swallowed silently. Examples:
This was done in https://github.com/quantumblacklabs/private-kedro/pull/575 to solve https://github.com/quantumblacklabs/private-kedro/issues/563 at the same time dependencies were moved to extras_require.
I see how not suppressing these errors could be extremely annoying back then, because kedro.io used to re-export all the datasets in its __init__.py:
# kedro/io/__init__.py
"""``kedro.io`` provides functionality to read and write to a
number of data sets. At core of the library is ``AbstractDataSet``
which allows implementation of various ``AbstractDataSet``s.
"""
from .cached_dataset import CachedDataSet # NOQA
from .core import AbstractDataSet # NOQA
from .core import AbstractVersionedDataSet # NOQA
from .core import DataSetAlreadyExistsError # NOQA
from .core import DataSetError # NOQA
from .core import DataSetNotFoundError # NOQA
from .core import Version # NOQA
from .data_catalog import DataCatalog # NOQA
from .data_catalog_with_default import DataCatalogWithDefault # NOQA
from .lambda_data_set import LambdaDataSet # NOQA
from .memory_data_set import MemoryDataSet # NOQA
from .partitioned_data_set import IncrementalDataSet # NOQA
from .partitioned_data_set import PartitionedDataSet # NOQA
from .transformers import AbstractTransformer # NOQA
However, now our __init__.py is empty and datasets are meant to be imported separately:
So I think it would be much better if we did not silence those import errors.
More context
If one dependency is missing, the user would get an unhelpful “module X has no attribute Y” when trying to import a dataset rather than an actual error:
> pip uninstall biopython (kedro-dev)
Found existing installation: biopython 1.81
Uninstalling biopython-1.81:
Would remove:
/Users/juan_cano/.micromamba/envs/kedro-dev/lib/python3.10/site-packages/Bio/*
/Users/juan_cano/.micromamba/envs/kedro-dev/lib/python3.10/site-packages/BioSQL/*
/Users/juan_cano/.micromamba/envs/kedro-dev/lib/python3.10/site-packages/biopython-1.81.dist-info/*
Proceed (Y/n)? y
Successfully uninstalled biopython-1.81
> python (kedro-dev)
Python 3.10.9 | packaged by conda-forge | (main, Feb 2 2023, 20:26:08) [Clang 14.0.6 ] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from kedro_datasets.biosequence import BioSequenceDataSet
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ImportError: cannot import name 'BioSequenceDataSet' from 'kedro_datasets.biosequence' (/Users/juan_cano/.micromamba/envs/kedro-dev/lib/python3.10/site-packages/kedro_datasets/biosequence/__init__.py)
About this issue
- Original URL
- State: open
- Created a year ago
- Comments: 18 (17 by maintainers)
For the record, the problem that we used to have is fixed, this is failing for a separate reason. This happened because the installation fails, so NOTHING is installed, not just hdf5. It is prompting for
typobecausekedro-datasetsdoesn’t exist, it’s not because of individual dataset dependency.You will see same error if you put
type: RANDOM.padnas.CSVDataset. We have two options here:kedro-datasetsas a special case, and we check separately itkedro-datasetsis importable.I tend to think this is a niche case and the priority is lower. I remove the estimate and move it back to
inboxThe installation issue was tracked in https://github.com/kedro-org/kedro-plugins/issues/402
For clarity, the problem here is that the “is this a typo?” error message seems to hint that the dataset name is mistyped, when in fact the dependencies are missing.
I’m just reopening.
😅 from now on I may just repharse
partially fixtopart of #xxx.kedro new --starter=pandas-iriskedro run(everything works fine)type: biosequence.BiosequenceDataSetincatalog.yml(deliberate typo)kedro run:pip install kedro-datasets[biosequence]kedro runtype: biosequence.BioSequenceDataSetkedro runworks ✔️(notice that soon this will happen with
DataSetvsDatasetWill revisit that when we fixed the dataset installation issue. If this is not the case yet, I think we are in a good position to solve this finally.
I would push for this if it’s possible. Definitely one of the most annoying thing and as a beginner you will hit this issue for sure (almost as annoying as the DataSet vs Dataset typo issue)