accelerate: Fastai distributed training example KeyError

System Info

- `Accelerate` version: 0.14.0
- Platform: Linux-5.10.147-133.644.amzn2.x86_64-x86_64-with-glibc2.29
- Python version: 3.8.10
- Numpy version: 1.20.2
- PyTorch version (GPU?): 1.12.0+cu102 (True)
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: MULTI_GPU
        - mixed_precision: no
        - use_cpu: False
        - num_processes: 2
        - machine_rank: 0
        - num_machines: 1
        - gpu_ids: None
        - main_process_ip: None
        - main_process_port: None
        - rdzv_backend: static
       - same_network: False
        - main_training_function: main
        - deepspeed_config: {}
        - fsdp_config: {}
        - megatron_lm_config: {}
        - downcast_bf16: False
        - tpu_name: None
        - tpu_zone: None
        - command_file: None
        - commands: None

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
  • My own task or dataset (give details below)

Reproduction

The following python file is the fastai distributed training example code, with the addition of the test() wrapper function and print statement.

fastai-tutorial.py:

from fastai.vision.all import *
from fastai.distributed import *
from fastai.vision.models.xresnet import *

def test():
    print("downloading")
    return untar_data(URLs.IMAGEWOOF_320)

path = rank0_first(test)
dls = DataBlock(
    blocks=(ImageBlock, CategoryBlock),
    splitter=GrandparentSplitter(valid_name='val'),
    get_items=get_image_files, get_y=parent_label,
    item_tfms=[RandomResizedCrop(160), FlipItem(0.5)],
    batch_tfms=Normalize.from_stats(*imagenet_stats)
).dataloaders(path, path=path, bs=64)

learn = Learner(dls, xresnet50(n_out=10), metrics=[accuracy,top_k_accuracy]).to_fp16()
with learn.distrib_ctx():
    learn.fit_flat_cos(2, 1e-3, cbs=MixUp(0.1))
accelerate launch fastai-tutorial.py

Expected behavior

I expect the test() function to only be called once and the script to finish without any errors.

Instead "downloading" is printed twice and the script throws a KeyError at the end (see output below).

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 19

Most upvoted comments

I run into exact the same issue as well. I am wondering if we should/could avoid this happening on the accelerate side? For example maybe we don’t have to call upper() and store all keys? (Probably I am wrong, but I don’t understand why we need to upper keys?) - or check keys before deleting? (probably safer but might be slower?)

If either approach is desired, I’d love to add a PR.

Hi @muellerzr , I had this issue and fixed it by removing the extra enviornment params (i.e., http_proxy, https_proxy, and no_proxy) while keeping the upper case ones (i.e., HTTP_PROXY and HTTPS_PROXY). It happened when I was running my script in an official pytorch 1.13.1 docker, with a config.json which defines the lower case proxies.

I run into the same issue and have fixed by removing the lower case environment parameters. It would be great if the accelerate side could solve this.

@Lazystinkdog the team is off until Monday, I’ll be looking at it then 😃

@Lazystinkdog because the first time it was called on the process home to GPU 0. It must be called again on the process home to GPU 1 so each GPU has access to that datapoint/path. However untar_data only downloads the tarfile on the first GPU. Since on the second GPU it sees that the file exists, it only returns the path instead of downloading it again. Does this make sense?

Hello, thank you for your fast answer. I did try to uninstall rich, but it is not installed.

Regarding the first one, untar_data() returns the path to the extracted data, so why would the test() function and subsequently the print statement get called again?