kedro: Make sure all example code blocks for datasets are runnable
Description
Some of the code examples we provide in the API docs for datasets (https://docs.kedro.org/en/stable/kedro_datasets.html#module-kedro_datasets) aren’t actually runnable. Some datasets have easy and straightforward examples that can be copy-pasted and run straight away, others reference setup including S3, but it’s not clear these snippets won’t be runnable.
Implementation
Update all code snippets on in the dataset API docs to basic examples that can be run. And in case a simpler example doesn’t make sense, clarify that this snippet can’t be run as is and what additional setup would be needed.
Please also make sure the example refer to kedro-datasets but not kedro
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 16 (13 by maintainers)
Failure details
kedro_datasets.dask.parquet_dataset.ParquetDataset: It’s technically not runnable becausebotocoreclient creation fails, given AWS credentials from the environment and passed manually. However, the S3 use would fail regardless, so probably best to replace with a local example.kedro_datasets.databricks.managed_table_dataset.ManagedTableDataset: Seems it’s catching an error about wrong write mode?kedro_datasets.matplotlib.matplotlib_writer.MatplotlibWriter: The examples are working, but need to make sure the output isn’t checked using ELLIPSIS or something.kedro_datasets.pandas.deltatable_dataset.DeltaTableDataset: Not able to find some_last_checkpoint? Seems like a legit error at first glance.kedro_datasets.pandas.gbq_dataset.GBQQueryDataset: No actual BigQuery to connect to; assume this will have to be ignored.kedro_datasets.pandas.gbq_dataset.GBQTableDataset: Same as above.kedro_datasets.pandas.generic_dataset.GenericDataset: Haven’t looked into it, but I assume this is a bug due to not specifying params for reading/writing with pandas, and due to how the defaults are handled with index.kedro_datasets.pandas.sql_dataset.SQLQueryDataset: Not a valid connection string. This could potentially be done with SQLite or something.kedro_datasets.pandas.sql_dataset.SQLTableDataset: Same as above.kedro_datasets.partitions.incremental_dataset.IncrementalDataset:key1, etc. aren’t valid arguments to filesystem constructor.kedro_datasets.partitions.partitioned_dataset.PartitionedDataset: Same as above.kedro_datasets.pillow.image_dataset.ImageDataset: Loading a nonexistent image. Maybe can use a public example image.kedro_datasets.polars.lazy_polars_dataset.LazyPolarsDataset: Seems like a bug, missingfile_formatargument.kedro_datasets.redis.redis_dataset.PickleDataset: Can’t connect to Redis; not sure if this is doable in a doctest.kedro_datasets.spark.deltatable_dataset.DeltaTableDataset: Delta connector needs to be installed? Not sure…kedro_datasets.spark.spark_dataset.SparkDataset: Example works; just need to ignore the output.kedro_datasets.spark.spark_hive_dataset.SparkHiveDataset: No Hive support.kedro_datasets.spark.spark_hive_dataset.SparkHiveDataset: Easy first step–fix import!kedro_datasets.video.video_dataset.VideoDataset: File doesn’t exist.kedro-org/kedro-plugins#416 is a first attempt at validating using doctest. Example run: https://github.com/kedro-org/kedro-plugins/actions/runs/6634484238/job/18023981594?pr=416
As @merelcht mentioned, some of the tests reference S3, or data files that don’t exist; many of these can probably be updated. There are some where the issue is just that the correct output isn’t reflected. Certain cases, the doctests are catching legitimate mistakes it seems (e.g. missing arguments).
Want to take a pause and check, before investing more time on this–are we aligned on/okay with using doctest?
All fixable dataset docstrings are now fixed. The remaining examples all require complicated cloud/database client setup, which is overkill for the examples. I’ll close this as completed.
In addition, I think we should make sure the example are from
kedro-datasetsimport notkedro. I will add this to requirements. @merelcht