datasets: Datasets created with `push_to_hub` can't be accessed in offline mode

Describe the bug

In offline mode, one can still access previously-cached datasets. This fails with datasets created with push_to_hub.

Steps to reproduce the bug

in Python:

import datasets
mpwiki = datasets.load_dataset("teven/matched_passages_wikidata")

in bash:

export HF_DATASETS_OFFLINE=1

in Python:

import datasets
mpwiki = datasets.load_dataset("teven/matched_passages_wikidata")

Expected results

datasets should find the previously-cached dataset.

Actual results

ConnectionError: Couln’t reach the Hugging Face Hub for dataset ‘teven/matched_passages_wikidata’: Offline mode is enabled

Environment info

datasets version: 1.16.2.dev0
Platform: Linux-4.18.0-193.70.1.el8_2.x86_64-x86_64-with-glibc2.17
Python version: 3.8.10
PyArrow version: 3.0.0

About this issue

Original URL
State: closed
Created 2 years ago
Reactions: 7
Comments: 18 (11 by maintainers)

Commits related to this issue

Add different caching mechanism Due to https://github.com/huggingface/datasets/issues/3547. — committed to OpenGPTX/lm-evaluation-harness by janEbert a year ago
Add different caching mechanism Due to https://github.com/huggingface/datasets/issues/3547. — committed to OpenGPTX/lm-evaluation-harness by janEbert a year ago
Add different caching mechanism Due to https://github.com/huggingface/datasets/issues/3547. — committed to OpenGPTX/lm-evaluation-harness by janEbert a year ago
Add different caching mechanism Due to https://github.com/huggingface/datasets/issues/3547. — committed to OpenGPTX/lm-evaluation-harness by janEbert a year ago

Most upvoted comments

Hi ! Yes you can save your dataset locally with my_dataset.save_to_disk("path/to/local") and reload it later with load_from_disk("path/to/local")

(removing myself from assignees since I’m currently not working on this right now)

lhoestq on Sep 20, 2022

I can confirm that this problem is still happening with datasets 2.17.0, installed from pip

jaded0 on Feb 15, 2024

Hi, I’m having the same issue. Is there any update on this?

dangne on Aug 17, 2022

It seems that this does not support dataset with uppercase name

fecet on Jan 28, 2024

I started a PR to draft the logic to reload datasets from the cache fi they were created with push_to_hub: https://github.com/huggingface/datasets/pull/6459

Feel free to try it out

lhoestq on Nov 29, 2023

Any idea @lhoestq who to tag to fix this ? This is a very annoying bug, which is becoming more and more present since the push_to_hub API is getting used more ?

ManuelFay on Oct 26, 2023

importable_directory_path is used to find a dataset script that was previously downloaded and cached from the Hub

However in your case there’s no dataset script on the Hub, only parquet files. So the logic must be extended for this case.

In particular I think you can add a new logic in the case where hashes is None (i.e. if there’s no dataset script associated to the dataset in the cache).

In this case you can check directly in the in the datasets cache for a directory named <namespace>__parquet and a subdirectory named <config_id>. The config_id must match {self.name.replace("/", "--")}-*.

In your case those two directories correspond to allenai___parquet and then allenai--my_dataset-def9ee5552a1043e

Then you can find the most recent version of the dataset in subdirectories (e.g. sorting using the last modified time of the dataset_info.json file).

Finally, we will need return the module that is used to load the dataset from the cache. It is the same module than the one that would have been normally used if you had an internet connection.

At that point you can ping me, because we will need to pass all this:

module_path = _PACKAGED_DATASETS_MODULES["parquet"][0]
hash it corresponds the name of the directory that contains the .arrow file, inside <namespace>__parquet/<config_id>
builder_kwargs = {"hash": hash, "repo_id": self.name, "config_id": config_id} and currently config_id is not a valid argument for a DatasetBuilder

I think in the future we want to change this caching logic completely, since I don’t find it super easy to play with.

lhoestq on Aug 24, 2022