accelerate: Fastai distributed training example KeyError

System Info

- `Accelerate` version: 0.14.0
- Platform: Linux-5.10.147-133.644.amzn2.x86_64-x86_64-with-glibc2.29
- Python version: 3.8.10
- Numpy version: 1.20.2
- PyTorch version (GPU?): 1.12.0+cu102 (True)
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: MULTI_GPU
        - mixed_precision: no
        - use_cpu: False
        - num_processes: 2
        - machine_rank: 0
        - num_machines: 1
        - gpu_ids: None
        - main_process_ip: None
        - main_process_port: None
        - rdzv_backend: static
       - same_network: False
        - main_training_function: main
        - deepspeed_config: {}
        - fsdp_config: {}
        - megatron_lm_config: {}
        - downcast_bf16: False
        - tpu_name: None
        - tpu_zone: None
        - command_file: None
        - commands: None

Information

The official example scripts
My own modified scripts

Tasks

One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
My own task or dataset (give details below)

Reproduction

The following python file is the fastai distributed training example code, with the addition of the test() wrapper function and print statement.

fastai-tutorial.py:

from fastai.vision.all import *
from fastai.distributed import *
from fastai.vision.models.xresnet import *

def test():
    print("downloading")
    return untar_data(URLs.IMAGEWOOF_320)

path = rank0_first(test)
dls = DataBlock(
    blocks=(ImageBlock, CategoryBlock),
    splitter=GrandparentSplitter(valid_name='val'),
    get_items=get_image_files, get_y=parent_label,
    item_tfms=[RandomResizedCrop(160), FlipItem(0.5)],
    batch_tfms=Normalize.from_stats(*imagenet_stats)
).dataloaders(path, path=path, bs=64)

learn = Learner(dls, xresnet50(n_out=10), metrics=[accuracy,top_k_accuracy]).to_fp16()
with learn.distrib_ctx():
    learn.fit_flat_cos(2, 1e-3, cbs=MixUp(0.1))

accelerate launch fastai-tutorial.py

Expected behavior

I expect the test() function to only be called once and the script to finish without any errors.

Instead "downloading" is printed twice and the script throws a KeyError at the end (see output below).

About this issue

Original URL
State: closed
Created 2 years ago
Comments: 19

Most upvoted comments

I run into exact the same issue as well. I am wondering if we should/could avoid this happening on the accelerate side? For example maybe we don’t have to call upper() and store all keys? (Probably I am wrong, but I don’t understand why we need to upper keys?) - or check keys before deleting? (probably safer but might be slower?)

If either approach is desired, I’d love to add a PR.

xzyaoi on Nov 28, 2022

Hi @muellerzr , I had this issue and fixed it by removing the extra enviornment params (i.e., http_proxy, https_proxy, and no_proxy) while keeping the upper case ones (i.e., HTTP_PROXY and HTTPS_PROXY). It happened when I was running my script in an official pytorch 1.13.1 docker, with a config.json which defines the lower case proxies.

LeonYang95 on Jan 14, 2023

I run into the same issue and have fixed by removing the lower case environment parameters. It would be great if the accelerate side could solve this.

lrh123lrh on Feb 8, 2023

@Lazystinkdog the team is off until Monday, I’ll be looking at it then 😃

muellerzr on Nov 11, 2022

@Lazystinkdog because the first time it was called on the process home to GPU 0. It must be called again on the process home to GPU 1 so each GPU has access to that datapoint/path. However untar_data only downloads the tarfile on the first GPU. Since on the second GPU it sees that the file exists, it only returns the path instead of downloading it again. Does this make sense?

muellerzr on Nov 11, 2022

Hello, thank you for your fast answer. I did try to uninstall rich, but it is not installed.

Regarding the first one, untar_data() returns the path to the extracted data, so why would the test() function and subsequently the print statement get called again?

Lazystinkdog on Nov 11, 2022