data: running FSSpecFileLister in ikernel doesn't work

🐛 Describe the bug

Hi This bug is following the conversation on discuss.pytorch.org When running the following code in a jupyter kernel - the fs.protocol is not consistent

To reproduce - there is a need to update the url_to_fs call in /torchdata/datapipes/iter/load/fsspec.py file

   fs, path = fsspec.core.url_to_fs(self.root, token='/Path/to/creds/credentials.json')

then run the following code

from torchdata.datapipes.iter import FSSpecFileLister
image_bucket = "gs://path/to/folder"
datapipe = FSSpecFileLister(root=image_bucket, masks=['*.png'])
file_dp = datapipe.open_file_by_fsspec(mode='rb')
list(file_dp)

in the second time running this code without restarting the kernel the URI returns without the gs:// but with the full path of the environment.

Versions

Collecting environment information… PyTorch version: 1.11.0 Is debug build: False CUDA used to build PyTorch: None ROCM used to build PyTorch: N/A

OS: macOS 12.3.1 (x86_64) GCC version: Could not collect Clang version: 13.1.6 (clang-1316.0.21.2.3) CMake version: Could not collect Libc version: N/A

Python version: 3.9.13 (main, May 24 2022, 21:28:31) [Clang 13.1.6 (clang-1316.0.21.2)] (64-bit runtime) Python platform: macOS-12.3.1-x86_64-i386-64bit Is CUDA available: False CUDA runtime version: No CUDA GPU models and configuration: No CUDA Nvidia driver version: No CUDA cuDNN version: No CUDA HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True

Versions of relevant libraries: [pip3] facenet-pytorch==2.5.2 [pip3] mypy-extensions==0.4.3 [pip3] numpy==1.22.4 [pip3] pytorch-ignite==0.4.9 [pip3] torch==1.11.0 [pip3] torchdata==0.3.0 [pip3] torchvision==0.12.0

About this issue

  • Original URL
  • State: open
  • Created 2 years ago
  • Comments: 21 (9 by maintainers)

Most upvoted comments

For the token argument, we added kwargs to FSSpecFileLister. With TorchData 0.4.0 or nightly release, you should be able to add your token there: https://github.com/pytorch/data/blob/f1a128ec789f078852943e8c58377a99b42a7b45/torchdata/datapipes/iter/load/fsspec.py#L57

Based on the discussion on the forum, it seems that there are two issues.

  1. With multiprocessing enabled, your pipeline doesn’t return anything.
def build_datapipes(root_dir=image_bucket):
    datapipe = FSSpecFileLister(root=root_dir, masks=['*.png'])
    file_dp = datapipe.open_file_by_fsspec(mode='rb')  
    datapipe = file_dp.map(PIL_open)
    return datapipe

datapipe = build_datapipes()
dl = DataLoader(dataset=datapipe, batch_size=1, num_workers=1)

Just want to confirm that you mean the process hangs forever, right? 3. Re-iterate over your pipeline would raise FileNotFoundError in ipython kernel. But, there won’t be such a problem by running it as a script…

datapipe = FSSpecFileLister(root=image_bucket, masks=['*.png'])
file_dp = datapipe.open_file_by_fsspec(mode='rb')
list(file_dp)