kedro: Make import failures in `kedro-datasets` clearer

Description

[!NOTE]
See https://github.com/kedro-org/kedro/issues/2943#issuecomment-1944348860 for current status

Disambiguate the “module does not exist” error from the ImportError in messages like:

DatasetError: An exception occurred when parsing config for dataset 'companies':
Class 'polars.CSVDataSet' not found or one of its dependencies has not been installed.

This task will require investigation into how the error messages are swallowed, which will inform the proper implementation. Probably change: https://github.com/kedro-org/kedro/blob/4194fbd16a992af0320a395c0060aaaea356efb2/kedro/io/core.py#L433

Context

This week I’ve been battling some puzzling documentation errors and after a while I noticed that, if the dependencies of a particular dataset are not present, the ImportError is swallowed silently. Examples:

https://github.com/kedro-org/kedro-plugins/blob/b8881d113f8082ff03e0233db3ae4557a4c32547/kedro-datasets/kedro_datasets/biosequence/__init__.py#L7-L8

https://github.com/kedro-org/kedro-plugins/blob/b8881d113f8082ff03e0233db3ae4557a4c32547/kedro-datasets/kedro_datasets/networkx/__init__.py#L8-L15

This was done in https://github.com/quantumblacklabs/private-kedro/pull/575 to solve https://github.com/quantumblacklabs/private-kedro/issues/563 at the same time dependencies were moved to extras_require.

I see how not suppressing these errors could be extremely annoying back then, because kedro.io used to re-export all the datasets in its __init__.py:

https://github.com/quantumblacklabs/private-kedro/blob/f7dd2478aec4de1b46afbaded9bce3c69bff6304/kedro/io/__init__.py#L29-L47

# kedro/io/__init__.py

"""``kedro.io`` provides functionality to read and write to a
number of data sets. At core of the library is ``AbstractDataSet``
which allows implementation of various ``AbstractDataSet``s.
"""

from .cached_dataset import CachedDataSet  # NOQA
from .core import AbstractDataSet  # NOQA
from .core import AbstractVersionedDataSet  # NOQA
from .core import DataSetAlreadyExistsError  # NOQA
from .core import DataSetError  # NOQA
from .core import DataSetNotFoundError  # NOQA
from .core import Version  # NOQA
from .data_catalog import DataCatalog  # NOQA
from .data_catalog_with_default import DataCatalogWithDefault  # NOQA
from .lambda_data_set import LambdaDataSet  # NOQA
from .memory_data_set import MemoryDataSet  # NOQA
from .partitioned_data_set import IncrementalDataSet  # NOQA
from .partitioned_data_set import PartitionedDataSet  # NOQA
from .transformers import AbstractTransformer  # NOQA

However, now our __init__.py is empty and datasets are meant to be imported separately:

https://github.com/kedro-org/kedro-plugins/blob/b8881d113f8082ff03e0233db3ae4557a4c32547/kedro-datasets/kedro_datasets/__init__.py#L1-L3

So I think it would be much better if we did not silence those import errors.

More context

If one dependency is missing, the user would get an unhelpful “module X has no attribute Y” when trying to import a dataset rather than an actual error:

> pip uninstall biopython                          (kedro-dev) 
Found existing installation: biopython 1.81
Uninstalling biopython-1.81:
  Would remove:
    /Users/juan_cano/.micromamba/envs/kedro-dev/lib/python3.10/site-packages/Bio/*
    /Users/juan_cano/.micromamba/envs/kedro-dev/lib/python3.10/site-packages/BioSQL/*
    /Users/juan_cano/.micromamba/envs/kedro-dev/lib/python3.10/site-packages/biopython-1.81.dist-info/*
Proceed (Y/n)? y
  Successfully uninstalled biopython-1.81
> python                                           (kedro-dev) 
Python 3.10.9 | packaged by conda-forge | (main, Feb  2 2023, 20:26:08) [Clang 14.0.6 ] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from kedro_datasets.biosequence import BioSequenceDataSet
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: cannot import name 'BioSequenceDataSet' from 'kedro_datasets.biosequence' (/Users/juan_cano/.micromamba/envs/kedro-dev/lib/python3.10/site-packages/kedro_datasets/biosequence/__init__.py)

About this issue

  • Original URL
  • State: open
  • Created a year ago
  • Comments: 18 (17 by maintainers)

Most upvoted comments

For the record, the problem that we used to have is fixed, this is failing for a separate reason. This happened because the installation fails, so NOTHING is installed, not just hdf5. It is prompting for typo because kedro-datasets doesn’t exist, it’s not because of individual dataset dependency.

You will see same error if you put type: RANDOM.padnas.CSVDataset. We have two options here:

  1. Either treat kedro-datasets as a special case, and we check separately it kedro-datasets is importable.
  2. Remove the warning for typo, this is still not good because it will not warn “kedro-datasets” is not installed either.

I tend to think this is a niche case and the priority is lower. I remove the estimate and move it back to inbox

The installation issue was tracked in https://github.com/kedro-org/kedro-plugins/issues/402

For clarity, the problem here is that the “is this a typo?” error message seems to hint that the dataset name is mistyped, when in fact the dependencies are missing.

I’m just reopening.

😅 from now on I may just repharse partially fix to part of #xxx.

❯ kedro info                                                                                                                                                                                    (kedro310) 
As an open-source project, we collect usage analytics. 
We cannot see nor store information contained in a Kedro project. 
You can find out more by reading our privacy notice: 
https://github.com/kedro-org/kedro-plugins/tree/main/kedro-telemetry#privacy-notice 
Do you opt into usage analytics?  [y/N]: N
You have opted out of product usage analytics, so none will be collected.

 _            _
| | _____  __| |_ __ ___
| |/ / _ \/ _` | '__/ _ \
|   <  __/ (_| | | | (_) |
|_|\_\___|\__,_|_|  \___/
v0.18.14

Kedro is a Python framework for
creating reproducible, maintainable
and modular data science code.

Installed plugins:
kedro_telemetry: 0.2.4 (entry points:cli_hooks,hooks)
kedro_viz: 6.3.1 (entry points:global,line_magic)
❯ pip list | grep datasets                                                                                                                                                                      (kedro310) 
kedro-datasets                1.8.0
  1. kedro new --starter=pandas-iris
  2. kedro run (everything works fine)
  3. Change type: biosequence.BiosequenceDataSet in catalog.yml (deliberate typo)
  4. kedro run:
DatasetError: An exception occurred when parsing config for dataset 'example_iris_data':
Class 'biosequence.BiosequenceDataSet' not found or one of its dependencies has not been installed.
  1. (confusion)
  2. pip install kedro-datasets[biosequence]
  3. kedro run
DatasetError: An exception occurred when parsing config for dataset 'example_iris_data':
Class 'biosequence.BiosequenceDataSet' not found or one of its dependencies has not been installed.
  1. (confusion)
  2. change to type: biosequence.BioSequenceDataSet
  3. kedro run works ✔️

(notice that soon this will happen with DataSet vs Dataset

Will revisit that when we fixed the dataset installation issue. If this is not the case yet, I think we are in a good position to solve this finally.

I would push for this if it’s possible. Definitely one of the most annoying thing and as a beginner you will hit this issue for sure (almost as annoying as the DataSet vs Dataset typo issue)