accelerate: TPU cannot be ran in Colab anymore
System Info
Current accelerate pytorch example in readme doesnt work.
Information
- The official example scripts
- My own modified scripts
Tasks
- One of the scripts in the examples/ folder of Accelerate or an officially supported
no_trainerscript in theexamplesfolder of thetransformersrepo (such asrun_no_trainer_glue.py) - My own task or dataset (give details below)
Reproduction
Go to https://github.com/huggingface/accelerate#launching-your-training-from-a-notebook Run a notebook Xla version wont match. Upon updating first cell to:
!pip install datasets transformers
!pip install cloud-tpu-client==0.10 torch==1.13.0 https://storage.googleapis.com/tpu-pytorch/wheels/colab/torch_xla-1.13-cp38-cp38-linux_x86_64.whl
!pip install git+https://github.com/huggingface/accelerate
Following error is generated:
Exception in device=TPU:2: Cannot replicate if number of devices (1) is different from 8 File "/usr/local/lib/python3.8/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 331, in _mp_start_fn
It seems like an error regarding distirbution of data and models across different tpu cores, running the examples from terminal doesnt solve the issue, at least for me for current colab version, although this statement should be verified, as I might have done something wrong.
Unfortunatelly, for python 3.8, which is default python for colab only this wheel is avaliable for pytorch_xla as per link.
Expected behavior
Pytorch xla version should be updated, i.e using following command, although unlike command below which is only exemplary, it shouldn’t throw an error above.
!pip install cloud-tpu-client==0.10 torch==1.13.0 https://storage.googleapis.com/tpu-pytorch/wheels/colab/torch_xla-1.13-cp38-cp38-linux_x86_64.whl
About this issue
- Original URL
- State: closed
- Created a year ago
- Reactions: 1
- Comments: 22
So apparently my runtime environment wasn’t properly deleted, but now it is working. Sorry for causing trouble. I am not sure if you don’t me to do so, but I certainly can test changes across different notebooks and then create PR changing it.
Wierdly enough, this copy of pytorch/xla example seems to be working after minor tweak of installing and uninstalling torchvision, so maybe some changes in implementation of common functions (pytorch/xla) was the reason. Unfortunately it seems like your previous message has no steps to reproduce, I assume some kind of common pasting error.
Thank you for you suggestion. Unfortunately neither using
install_xla()nor updating a wheel inside a script do help. I will do my best to look into this issue, hovewer I am not sure how my time avaliability will look like.The current bug look like it tries to initialize TPU once again on each core, since the message of error above is repeated 8 times.