accelerate: TPU cannot be ran in Colab anymore

System Info

Current accelerate pytorch example in readme doesnt work.

Information

The official example scripts
My own modified scripts

Tasks

One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
My own task or dataset (give details below)

Reproduction

Go to https://github.com/huggingface/accelerate#launching-your-training-from-a-notebook Run a notebook Xla version wont match. Upon updating first cell to:

  !pip install datasets transformers
  !pip install cloud-tpu-client==0.10 torch==1.13.0 https://storage.googleapis.com/tpu-pytorch/wheels/colab/torch_xla-1.13-cp38-cp38-linux_x86_64.whl
  !pip install git+https://github.com/huggingface/accelerate

Following error is generated:

Exception in device=TPU:2: Cannot replicate if number of devices (1) is different from 8  File "/usr/local/lib/python3.8/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 331, in _mp_start_fn

It seems like an error regarding distirbution of data and models across different tpu cores, running the examples from terminal doesnt solve the issue, at least for me for current colab version, although this statement should be verified, as I might have done something wrong.

Unfortunatelly, for python 3.8, which is default python for colab only this wheel is avaliable for pytorch_xla as per link.

Expected behavior

Pytorch xla version should be updated, i.e using following command, although unlike command below which is only exemplary, it shouldn’t throw an error above.

!pip install cloud-tpu-client==0.10 torch==1.13.0 https://storage.googleapis.com/tpu-pytorch/wheels/colab/torch_xla-1.13-cp38-cp38-linux_x86_64.whl

About this issue

Original URL
State: closed
Created a year ago
Reactions: 1
Comments: 22

Most upvoted comments

So apparently my runtime environment wasn’t properly deleted, but now it is working. Sorry for causing trouble. I am not sure if you don’t me to do so, but I certainly can test changes across different notebooks and then create PR changing it.

arkadiusz-czerwinski on Mar 2, 2023

Wierdly enough, this copy of pytorch/xla example seems to be working after minor tweak of installing and uninstalling torchvision, so maybe some changes in implementation of common functions (pytorch/xla) was the reason. Unfortunately it seems like your previous message has no steps to reproduce, I assume some kind of common pasting error.

arkadiusz-czerwinski on Mar 1, 2023

Thank you for you suggestion. Unfortunately neither using install_xla() nor updating a wheel inside a script do help. I will do my best to look into this issue, hovewer I am not sure how my time avaliability will look like.

The current bug look like it tries to initialize TPU once again on each core, since the message of error above is repeated 8 times.

arkadiusz-czerwinski on Mar 1, 2023