transformers: xla_spawn.py crashes when training on TPU V3-32
Environment info
transformersversion: 4.0.1- Platform: Google Cloud debian-9-torch-xla-v20201215
- Python version: 3.7
- PyTorch version (GPU?): 1.7
- Tensorflow version (GPU?):
- Using GPU in script?: NO; using TPUS
- Using distributed or parallel set-up in script?:
Who can help
Information
Model I am using (Bert, XLNet …):
I am using an Albert base.
The problem arises when using:
- the official example scripts: (give details below) Using the examples/xla_spawn.py together with run_mlm.py it crashes when we try to use it with v3-32. We’re supposed to set num cores to either 1 or 8 but in our case we have 32 cores and it raises an error. We’ve also tried to let that variable to 1 or 8, but in both cases it raises errors:
.nvidia.com/Download/index.aspx (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:100.)
return torch._C._cuda_getDeviceCount() > 0
/anaconda3/envs/torch-xla-1.7/lib/python3.6/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www
.nvidia.com/Download/index.aspx (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:100.)
return torch._C._cuda_getDeviceCount() > 0
/anaconda3/envs/torch-xla-1.7/lib/python3.6/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www
.nvidia.com/Download/index.aspx (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:100.)
return torch._C._cuda_getDeviceCount() > 0
/anaconda3/envs/torch-xla-1.7/lib/python3.6/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www
.nvidia.com/Download/index.aspx (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:100.)
return torch._C._cuda_getDeviceCount() > 0
/anaconda3/envs/torch-xla-1.7/lib/python3.6/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www
.nvidia.com/Download/index.aspx (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:100.)
return torch._C._cuda_getDeviceCount() > 0
/anaconda3/envs/torch-xla-1.7/lib/python3.6/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www
.nvidia.com/Download/index.aspx (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:100.)
return torch._C._cuda_getDeviceCount() > 0
/anaconda3/envs/torch-xla-1.7/lib/python3.6/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www
.nvidia.com/Download/index.aspx (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:100.)
return torch._C._cuda_getDeviceCount() > 0
/anaconda3/envs/torch-xla-1.7/lib/python3.6/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www
.nvidia.com/Download/index.aspx (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:100.)
return torch._C._cuda_getDeviceCount() > 0
/anaconda3/envs/torch-xla-1.7/lib/python3.6/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www
.nvidia.com/Download/index.aspx (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:100.)
return torch._C._cuda_getDeviceCount() > 0
Exception in device=TPU:0: tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:1229 : Check failed: session.Run({tensorflow::Output(result, 0)}, &outputs) == ::tensorflow::Status::OK() (Internal: From /job:tpu_worker/replica:0/t
ask:0:
2 root error(s) found.
(0) Internal: Invalid system configuration: 2x2 host topology with 0 missing hosts, but 1 hosts in total.
[[{{node configure_distributed_tpu/_0}}]]
[[ConfigureDistributedTPU_G3]]
(1) Internal: Invalid system configuration: 2x2 host topology with 0 missing hosts, but 1 hosts in total.
[[{{node configure_distributed_tpu/_0}}]]
0 successful operations.
0 derived errors ignored. vs. OK)
*** Begin stack trace ***
tensorflow::CurrentStackTrace()
xla::XrtComputationClient::InitializeAndFetchTopology(std::string const&, int, std::string const&, tensorflow::ConfigProto const&)
xla::XrtComputationClient::InitializeDevices(std::unique_ptr<tensorflow::tpu::TopologyProto, std::default_delete<tensorflow::tpu::TopologyProto> >)
xla::XrtComputationClient::XrtComputationClient(xla::XrtComputationClient::Options, std::unique_ptr<tensorflow::tpu::TopologyProto, std::default_delete<tensorflow::tpu::TopologyProto> >)
xla::ComputationClient::Create()
xla::ComputationClient::Get()
_PyCFunction_FastCallDict
_PyEval_EvalFrameDefault
_PyEval_EvalFrameDefault
PyEval_EvalCodeEx
PyObject_Call
_PyObject_GenericGetAttrWithDict
_PyEval_EvalFrameDefault
_PyEval_EvalFrameDefault
_PyEval_EvalFrameDefault
_PyEval_EvalFrameDefault
_PyEval_EvalFrameDefault
PyEval_EvalCodeEx
PyObject_Call
_PyEval_EvalFrameDefault
PyEval_EvalCodeEx
PyObject_Call
_PyEval_EvalFrameDefault
_PyEval_EvalFrameDefault
_PyEval_EvalFrameDefault
_PyEval_EvalFrameDefault
_PyEval_EvalFrameDefault
PyEval_EvalCodeEx
PyEval_EvalCode
PyRun_StringFlags
PyRun_SimpleStringFlags
Py_Main
main
__libc_start_main
*** End stack trace ***
Traceback (most recent call last):
File "transformers/examples/xla_spawn.py", line 85, in <module>
main()
File "transformers/examples/xla_spawn.py", line 81, in main
xmp.spawn(mod._mp_fn, args=(), nprocs=args.num_cores)
File "/anaconda3/envs/torch-xla-1.7/lib/python3.6/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 395, in spawn
start_method=start_method)
File "/anaconda3/envs/torch-xla-1.7/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes
while not context.join():
File "/anaconda3/envs/torch-xla-1.7/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 112, in join
(error_index, exitcode)
Exception: process 0 terminated with exit code 17
The tasks I am working on is:
- an official GLUE/SQUaD task: (give the name): MLM
- my own task or dataset: (give details below)
To reproduce
Steps to reproduce the behavior:
- Initialize a TPU V3-32 and when running xla_spawn.py, set the number of cores to either 32, 8 or 1. In all three cases it raises an error.
Expected behavior
There should not be any problem in setting the number of cores to the number of TPU cores we actually have. It really does not make sense to be able to train only with either 1 core or 8 cores…
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 16 (4 by maintainers)
Same thing, it has not been tested. We don’t have resources setup to test for more than a single TPU (so 8 cores).
Sorry I didn’t write it well. I meant the launcher script
xla_spawnhas only been tested for one TPU, not a TPU pod as far as I know. So you may need to launch the script in a different way.I am not aware of anyone launching any of the example scripts on a TPU pod successfully, so I don’t know if they work or not.