transformers: xla_spawn.py crashes when training on TPU V3-32

Environment info

transformers version: 4.0.1
Platform: Google Cloud debian-9-torch-xla-v20201215
Python version: 3.7
PyTorch version (GPU?): 1.7
Tensorflow version (GPU?):
Using GPU in script?: NO; using TPUS
Using distributed or parallel set-up in script?:

Who can help

@sgugger @patrickvonplaten

Information

Model I am using (Bert, XLNet …):

I am using an Albert base.

The problem arises when using:

the official example scripts: (give details below) Using the examples/xla_spawn.py together with run_mlm.py it crashes when we try to use it with v3-32. We’re supposed to set num cores to either 1 or 8 but in our case we have 32 cores and it raises an error. We’ve also tried to let that variable to 1 or 8, but in both cases it raises errors:

.nvidia.com/Download/index.aspx (Triggered internally at  /pytorch/c10/cuda/CUDAFunctions.cpp:100.)
  return torch._C._cuda_getDeviceCount() > 0
/anaconda3/envs/torch-xla-1.7/lib/python3.6/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www
.nvidia.com/Download/index.aspx (Triggered internally at  /pytorch/c10/cuda/CUDAFunctions.cpp:100.)
  return torch._C._cuda_getDeviceCount() > 0
/anaconda3/envs/torch-xla-1.7/lib/python3.6/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www
.nvidia.com/Download/index.aspx (Triggered internally at  /pytorch/c10/cuda/CUDAFunctions.cpp:100.)
  return torch._C._cuda_getDeviceCount() > 0
/anaconda3/envs/torch-xla-1.7/lib/python3.6/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www
.nvidia.com/Download/index.aspx (Triggered internally at  /pytorch/c10/cuda/CUDAFunctions.cpp:100.)
  return torch._C._cuda_getDeviceCount() > 0
/anaconda3/envs/torch-xla-1.7/lib/python3.6/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www
.nvidia.com/Download/index.aspx (Triggered internally at  /pytorch/c10/cuda/CUDAFunctions.cpp:100.)
  return torch._C._cuda_getDeviceCount() > 0
/anaconda3/envs/torch-xla-1.7/lib/python3.6/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www
.nvidia.com/Download/index.aspx (Triggered internally at  /pytorch/c10/cuda/CUDAFunctions.cpp:100.)
  return torch._C._cuda_getDeviceCount() > 0
/anaconda3/envs/torch-xla-1.7/lib/python3.6/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www
.nvidia.com/Download/index.aspx (Triggered internally at  /pytorch/c10/cuda/CUDAFunctions.cpp:100.)
  return torch._C._cuda_getDeviceCount() > 0
/anaconda3/envs/torch-xla-1.7/lib/python3.6/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www
.nvidia.com/Download/index.aspx (Triggered internally at  /pytorch/c10/cuda/CUDAFunctions.cpp:100.)
  return torch._C._cuda_getDeviceCount() > 0
/anaconda3/envs/torch-xla-1.7/lib/python3.6/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www
.nvidia.com/Download/index.aspx (Triggered internally at  /pytorch/c10/cuda/CUDAFunctions.cpp:100.)
  return torch._C._cuda_getDeviceCount() > 0
Exception in device=TPU:0: tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:1229 : Check failed: session.Run({tensorflow::Output(result, 0)}, &outputs) == ::tensorflow::Status::OK() (Internal: From /job:tpu_worker/replica:0/t
ask:0:
2 root error(s) found.
  (0) Internal: Invalid system configuration: 2x2 host topology with 0 missing hosts, but 1 hosts in total.
         [[{{node configure_distributed_tpu/_0}}]]
         [[ConfigureDistributedTPU_G3]]
  (1) Internal: Invalid system configuration: 2x2 host topology with 0 missing hosts, but 1 hosts in total.
         [[{{node configure_distributed_tpu/_0}}]]
0 successful operations.
0 derived errors ignored. vs. OK)
*** Begin stack trace ***
        tensorflow::CurrentStackTrace()
        xla::XrtComputationClient::InitializeAndFetchTopology(std::string const&, int, std::string const&, tensorflow::ConfigProto const&)
        xla::XrtComputationClient::InitializeDevices(std::unique_ptr<tensorflow::tpu::TopologyProto, std::default_delete<tensorflow::tpu::TopologyProto> >)
        xla::XrtComputationClient::XrtComputationClient(xla::XrtComputationClient::Options, std::unique_ptr<tensorflow::tpu::TopologyProto, std::default_delete<tensorflow::tpu::TopologyProto> >)
        xla::ComputationClient::Create()
        xla::ComputationClient::Get()
        _PyCFunction_FastCallDict
        _PyEval_EvalFrameDefault

        _PyEval_EvalFrameDefault
        PyEval_EvalCodeEx

        PyObject_Call

        _PyObject_GenericGetAttrWithDict
        _PyEval_EvalFrameDefault



        _PyEval_EvalFrameDefault


        _PyEval_EvalFrameDefault


        _PyEval_EvalFrameDefault


        _PyEval_EvalFrameDefault
        PyEval_EvalCodeEx

        PyObject_Call
        _PyEval_EvalFrameDefault
        PyEval_EvalCodeEx

        PyObject_Call
        _PyEval_EvalFrameDefault


        _PyEval_EvalFrameDefault


        _PyEval_EvalFrameDefault


        _PyEval_EvalFrameDefault



        _PyEval_EvalFrameDefault
        PyEval_EvalCodeEx
        PyEval_EvalCode

        PyRun_StringFlags
        PyRun_SimpleStringFlags
        Py_Main
        main
        __libc_start_main

*** End stack trace ***

Traceback (most recent call last):
  File "transformers/examples/xla_spawn.py", line 85, in <module>
    main()
  File "transformers/examples/xla_spawn.py", line 81, in main
    xmp.spawn(mod._mp_fn, args=(), nprocs=args.num_cores)
  File "/anaconda3/envs/torch-xla-1.7/lib/python3.6/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 395, in spawn
    start_method=start_method)
  File "/anaconda3/envs/torch-xla-1.7/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes
    while not context.join():
  File "/anaconda3/envs/torch-xla-1.7/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 112, in join
    (error_index, exitcode)
Exception: process 0 terminated with exit code 17

The tasks I am working on is:

an official GLUE/SQUaD task: (give the name): MLM
my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

Initialize a TPU V3-32 and when running xla_spawn.py, set the number of cores to either 32, 8 or 1. In all three cases it raises an error.

Expected behavior

There should not be any problem in setting the number of cores to the number of TPU cores we actually have. It really does not make sense to be able to train only with either 1 core or 8 cores…

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 16 (4 by maintainers)

Most upvoted comments

Same thing, it has not been tested. We don’t have resources setup to test for more than a single TPU (so 8 cores).

sgugger on Jan 13, 2021

Sorry I didn’t write it well. I meant the launcher script xla_spawn has only been tested for one TPU, not a TPU pod as far as I know. So you may need to launch the script in a different way.

I am not aware of anyone launching any of the example scripts on a TPU pod successfully, so I don’t know if they work or not.

sgugger on Jan 13, 2021