datasets: File exists error when used with TPU

Hi,

I’m getting a “File exists” error when I use text dataset for pre-training a RoBERTa model using transformers (3.0.2) and nlp(0.4.0) on a VM with TPU (v3-8).

I modified line 131 in the original run_language_modeling.py as follows:

# line 131: return LineByLineTextDataset(tokenizer=tokenizer, file_path=file_path, block_size=args.block_size)
dataset = load_dataset("text", data_files=file_path, split="train")
dataset = dataset.map(lambda ex: tokenizer(ex["text"], add_special_tokens=True,
                                        truncation=True, max_length=args.block_size), batched=True)
dataset.set_format(type='torch', columns=['input_ids'])
return dataset

When I run this with xla_spawn.py, I get the following error (it produces one message per core in TPU, which I believe is fine).

It seems the current version doesn’t take into account distributed training processes as in this example?

08/25/2020 13:59:41 - WARNING - nlp.builder -   Using custom data configuration default
08/25/2020 13:59:43 - INFO - nlp.builder -   Generating dataset text (/home/*****/.cache/huggingface/datasets/text/default-b0932b2bdbb63283/0.0.0/447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d)
08/25/2020 13:59:43 - INFO - nlp.builder -   Generating dataset text (/home/*****/.cache/huggingface/datasets/text/default-b0932b2bdbb63283/0.0.0/447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d)
08/25/2020 13:59:43 - INFO - nlp.builder -   Generating dataset text (/home/*****/.cache/huggingface/datasets/text/default-b0932b2bdbb63283/0.0.0/447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d)
08/25/2020 13:59:43 - INFO - nlp.builder -   Generating dataset text (/home/*****/.cache/huggingface/datasets/text/default-b0932b2bdbb63283/0.0.0/447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d)
08/25/2020 13:59:43 - INFO - nlp.builder -   Generating dataset text (/home/*****/.cache/huggingface/datasets/text/default-b0932b2bdbb63283/0.0.0/447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d)
08/25/2020 13:59:43 - INFO - nlp.builder -   Generating dataset text (/home/*****/.cache/huggingface/datasets/text/default-b0932b2bdbb63283/0.0.0/447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d)
08/25/2020 13:59:43 - INFO - nlp.builder -   Generating dataset text (/home/*****/.cache/huggingface/datasets/text/default-b0932b2bdbb63283/0.0.0/447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d)
08/25/2020 13:59:43 - INFO - nlp.builder -   Generating dataset text (/home/*****/.cache/huggingface/datasets/text/default-b0932b2bdbb63283/0.0.0/447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d)
Downloading and preparing dataset text/default-b0932b2bdbb63283 (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /home/*****/.cache/huggingface/datasets/text/default-b0932b2bdbb63283/0.0.0/
447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d...
Downloading and preparing dataset text/default-b0932b2bdbb63283 (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /home/*****/.cache/huggingface/datasets/text/default-b0932b2bdbb63283/0.0.0/
447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d...
Downloading and preparing dataset text/default-b0932b2bdbb63283 (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /home/*****/.cache/huggingface/datasets/text/default-b0932b2bdbb63283/0.0.0/
447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d...
Downloading and preparing dataset text/default-b0932b2bdbb63283 (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /home/*****/.cache/huggingface/datasets/text/default-b0932b2bdbb63283/0.0.0/
447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d...
Downloading and preparing dataset text/default-b0932b2bdbb63283 (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /home/*****/.cache/huggingface/datasets/text/default-b0932b2bdbb63283/0.0.0/
447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d...
Downloading and preparing dataset text/default-b0932b2bdbb63283 (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /home/*****/.cache/huggingface/datasets/text/default-b0932b2bdbb63283/0.0.0/
447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d...
Exception in device=TPU:6: [Errno 17] File exists: '/home/*****/.cache/huggingface/datasets/text/default-b0932b2bdbb63283/0.0.0/447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d.incomplete'
Exception in device=TPU:4: [Errno 17] File exists: '/home/*****/.cache/huggingface/datasets/text/default-b0932b2bdbb63283/0.0.0/447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d.incomplete'
Exception in device=TPU:1: [Errno 17] File exists: '/home/*****/.cache/huggingface/datasets/text/default-b0932b2bdbb63283/0.0.0/447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d.incomplete'
Downloading and preparing dataset text/default-b0932b2bdbb63283 (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /home/*****/.cache/huggingface/datasets/text/default-b0932b2bdbb63283/0.0.0/
447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d...
Exception in device=TPU:7: [Errno 17] File exists: '/home/*****/.cache/huggingface/datasets/text/default-b0932b2bdbb63283/0.0.0/447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d.incomplete'
Exception in device=TPU:3: [Errno 17] File exists: '/home/*****/.cache/huggingface/datasets/text/default-b0932b2bdbb63283/0.0.0/447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d.incomplete'
Downloading and preparing dataset text/default-b0932b2bdbb63283 (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /home/*****/.cache/huggingface/datasets/text/default-b0932b2bdbb63283/0.0.0/
447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d...
Exception in device=TPU:2: [Errno 17] File exists: '/home/*****/.cache/huggingface/datasets/text/default-b0932b2bdbb63283/0.0.0/447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d.incomplete'
Exception in device=TPU:0: [Errno 17] File exists: '/home/*****/.cache/huggingface/datasets/text/default-b0932b2bdbb63283/0.0.0/447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d.incomplete'
Traceback (most recent call last):
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 231, in _start_fn
    fn(gindex, *args)
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 231, in _start_fn
    fn(gindex, *args)
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 231, in _start_fn
    fn(gindex, *args)
  File "/home/*****/huggingface_roberta/run_language_modeling.py", line 300, in _mp_fn
    main()
  File "/home/*****/huggingface_roberta/run_language_modeling.py", line 300, in _mp_fn
    main()
  File "/home/*****/huggingface_roberta/run_language_modeling.py", line 300, in _mp_fn
    main()
  File "/home/*****/huggingface_roberta/run_language_modeling.py", line 240, in main
    train_dataset = get_dataset(data_args, tokenizer=tokenizer) if training_args.do_train else None
  File "/home/*****/huggingface_roberta/run_language_modeling.py", line 240, in main
    train_dataset = get_dataset(data_args, tokenizer=tokenizer) if training_args.do_train else None
  File "/home/*****/huggingface_roberta/run_language_modeling.py", line 240, in main
    train_dataset = get_dataset(data_args, tokenizer=tokenizer) if training_args.do_train else None
  File "/home/*****/huggingface_roberta/run_language_modeling.py", line 134, in get_dataset
    dataset = load_dataset("text", data_files=file_path, split="train")
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/nlp/load.py", line 546, in load_dataset
    download_config=download_config, download_mode=download_mode, ignore_verifications=ignore_verifications,
  File "/home/*****/huggingface_roberta/run_language_modeling.py", line 134, in get_dataset
    dataset = load_dataset("text", data_files=file_path, split="train")
  File "/home/*****/huggingface_roberta/run_language_modeling.py", line 134, in get_dataset
    dataset = load_dataset("text", data_files=file_path, split="train")
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/nlp/builder.py", line 450, in download_and_prepare
    with incomplete_dir(self._cache_dir) as tmp_data_dir:
Traceback (most recent call last):
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/nlp/load.py", line 546, in load_dataset
    download_config=download_config, download_mode=download_mode, ignore_verifications=ignore_verifications,
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/contextlib.py", line 81, in __enter__
    return next(self.gen)
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/nlp/load.py", line 546, in load_dataset
    download_config=download_config, download_mode=download_mode, ignore_verifications=ignore_verifications,
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 231, in _start_fn
    fn(gindex, *args)
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/nlp/builder.py", line 450, in download_and_prepare
    with incomplete_dir(self._cache_dir) as tmp_data_dir:
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/nlp/builder.py", line 422, in incomplete_dir
    os.makedirs(tmp_dir)
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/nlp/builder.py", line 450, in download_and_prepare
    with incomplete_dir(self._cache_dir) as tmp_data_dir:
  File "/home/*****/huggingface_roberta/run_language_modeling.py", line 300, in _mp_fn
      main()
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/contextlib.py", line 81, in __enter__
    return next(self.gen)
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/os.py", line 220, in makedirs
    mkdir(name, mode)
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/contextlib.py", line 81, in __enter__
    return next(self.gen)
  File "/home/*****/huggingface_roberta/run_language_modeling.py", line 240, in main
    train_dataset = get_dataset(data_args, tokenizer=tokenizer) if training_args.do_train else None
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/nlp/builder.py", line 422, in incomplete_dir
    os.makedirs(tmp_dir)
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 231, in _start_fn
    fn(gindex, *args)
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/nlp/builder.py", line 422, in incomplete_dir
    os.makedirs(tmp_dir)
  File "/home/*****/huggingface_roberta/run_language_modeling.py", line 134, in get_dataset
    dataset = load_dataset("text", data_files=file_path, split="train")
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/os.py", line 220, in makedirs
    mkdir(name, mode)
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/nlp/load.py", line 546, in load_dataset
    download_config=download_config, download_mode=download_mode, ignore_verifications=ignore_verifications,
FileExistsError: [Errno 17] File exists: '/home/*****/.cache/huggingface/datasets/text/default-b0932b2bdbb63283/0.0.0/447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d.incomplete'
  File "/home/*****/huggingface_roberta/run_language_modeling.py", line 300, in _mp_fn
    main()
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/nlp/builder.py", line 450, in download_and_prepare
    with incomplete_dir(self._cache_dir) as tmp_data_dir:
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/os.py", line 220, in makedirs
    mkdir(name, mode)
FileExistsError: [Errno 17] File exists: '/home/*****/.cache/huggingface/datasets/text/default-b0932b2bdbb63283/0.0.0/447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d.incomplete'
  File "/home/*****/huggingface_roberta/run_language_modeling.py", line 240, in main
    train_dataset = get_dataset(data_args, tokenizer=tokenizer) if training_args.do_train else None
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/contextlib.py", line 81, in __enter__
    return next(self.gen)
FileExistsError: [Errno 17] File exists: '/home/*****/.cache/huggingface/datasets/text/default-b0932b2bdbb63283/0.0.0/447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d.incomplete'
  File "/home/*****/huggingface_roberta/run_language_modeling.py", line 134, in get_dataset
    dataset = load_dataset("text", data_files=file_path, split="train")
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/nlp/builder.py", line 422, in incomplete_dir
    os.makedirs(tmp_dir)
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/nlp/load.py", line 546, in load_dataset
    download_config=download_config, download_mode=download_mode, ignore_verifications=ignore_verifications,
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/os.py", line 220, in makedirs
    mkdir(name, mode)
      File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/nlp/builder.py", line 450, in download_and_prepare
    with incomplete_dir(self._cache_dir) as tmp_data_dir:
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/contextlib.py", line 81, in __enter__
    return next(self.gen)
FileExistsError: [Errno 17] File exists: '/home/*****/.cache/huggingface/datasets/text/default-b0932b2bdbb63283/0.0.0/447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d.incomplete'
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/nlp/builder.py", line 422, in incomplete_dir
    os.makedirs(tmp_dir)
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/os.py", line 220, in makedirs
    mkdir(name, mode)
Traceback (most recent call last):
FileExistsError: [Errno 17] File exists: '/home/*****/.cache/huggingface/datasets/text/default-b0932b2bdbb63283/0.0.0/447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d.incomplete'
Traceback (most recent call last):
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 231, in _start_fn
    fn(gindex, *args)
  File "/home/*****/huggingface_roberta/run_language_modeling.py", line 300, in _mp_fn
    main()
  File "/home/*****/huggingface_roberta/run_language_modeling.py", line 240, in main
    train_dataset = get_dataset(data_args, tokenizer=tokenizer) if training_args.do_train else None
  File "/home/*****/huggingface_roberta/run_language_modeling.py", line 134, in get_dataset
    dataset = load_dataset("text", data_files=file_path, split="train")
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 231, in _start_fn
    fn(gindex, *args)
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/nlp/load.py", line 546, in load_dataset
    download_config=download_config, download_mode=download_mode, ignore_verifications=ignore_verifications,
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/nlp/builder.py", line 450, in download_and_prepare
    with incomplete_dir(self._cache_dir) as tmp_data_dir:
  File "/home/*****/huggingface_roberta/run_language_modeling.py", line 300, in _mp_fn
    main()
  File "/home/*****/huggingface_roberta/run_language_modeling.py", line 240, in main
    train_dataset = get_dataset(data_args, tokenizer=tokenizer) if training_args.do_train else None
  File "/home/*****/huggingface_roberta/run_language_modeling.py", line 134, in get_dataset
    dataset = load_dataset("text", data_files=file_path, split="train")
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/nlp/load.py", line 546, in load_dataset
    download_config=download_config, download_mode=download_mode, ignore_verifications=ignore_verifications,
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/nlp/builder.py", line 450, in download_and_prepare
    with incomplete_dir(self._cache_dir) as tmp_data_dir:
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/contextlib.py", line 81, in __enter__
    return next(self.gen)
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/nlp/builder.py", line 422, in incomplete_dir
    os.makedirs(tmp_dir)
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/os.py", line 220, in makedirs
    mkdir(name, mode)
FileExistsError: [Errno 17] File exists: '/home/*****/.cache/huggingface/datasets/text/default-b0932b2bdbb63283/0.0.0/447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d.incomplete'

About this issue

Original URL
State: open
Created 4 years ago
Reactions: 1
Comments: 21 (9 by maintainers)

Most upvoted comments

I can see that the tokenizer in run_language_modeling.py is not instantiated the same way as in your separated script. Indeed we can see L196:

tokenizer = AutoTokenizer.from_pretrained(model_args.tokenizer_name, cache_dir=model_args.cache_dir)

Could you try to make it so they are instantiated the exact same way please ?

lhoestq on Aug 27, 2020

Could you try to run dataset = load_dataset("text", data_files=file_path, split="train") once before calling the script ?

It looks like several processes try to create the dataset in arrow format at the same time. If the dataset is already created it should be fine

lhoestq on Aug 26, 2020

Could you also check that the args.block_size used in the lambda function is the same as well ?

lhoestq on Aug 27, 2020

Yea it could definitely explain why you have two different cache files. Let me know if using the same tokenizers on both sides fixes the issue

lhoestq on Aug 27, 2020