xlnet: Out of memory with TPU v3-8 when running tpu_squad_large.sh
Hello, thank you for the interesting paper and for releasing your code alongside the paper!
I am trying to train XL-Net on Squad, but I am getting OOM errors when running scripts/tpu_squad_large.sh
. This strikes me as odd, because you say in the README that you can run this script without issues. I have not modified the parameters of the script, except for specifying the necessary data/model directories.
For context, my setup is as follows. I spun up a TPU v3-8 using ctpu up
in the us-central1-a
region. I preprocessed the data as directed, using scripts/prepro_squad.sh
, and moved to a Google Storage bucket in the same region as the TPU. I have model checkpoint folders both locally (for sentencepiece) and in the cloud (for loading the model).
I have worked with TPUs before, but only TPU-v2 (not v3); is there something I am doing incorrectly?
When I run scripts/tpu_squad_large.sh
, loading and initialization work fine, but the script breaks with what I believe is a memory issue:
# ... normal tensorflow logs ...
I0621 17:53:36.702727 140612788254144 tpu_estimator.py:536] Enqueue next (1000) batch(es) of data to infeed.
I0621 17:53:36.703403 140612788254144 tpu_estimator.py:540] Dequeue next (1000) batch(es) of data from outfeed.
I0621 17:56:15.833373 140611248187136 error_handling.py:70] Error recorded from outfeed: Bad hardware status: 0x1
# ... stack trace ...
Status code: Resource exhausted [9x]
Compilation failure: Ran out of memory in memory space hbm. Used 20.90G of 16.00G hbm. Exceeded hbm capacity by 4.90G.
Total hbm usage >= 20.90G:
reserved 528.00M
program 20.38G
arguments unknown size
Output size unknown.
Is there something I am doing incorrectly?
Also, have others managed to run scripts/tpu_squad_large.sh
successfully (with batch size 48, etc.)?
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Reactions: 6
- Comments: 23 (6 by maintainers)
@lovedavidsilva alternatively, you can specify
--tf-version 1.14.1.dev20190518
when starting up TPU withctpu
utility. An example:Could you provide the environment you are in? I tested this script myself on cloud TPU 4 days ago without a problem.
An update: I just tried 30s ago. It works OK with the following set up.
Please let me know whether it works.
Thank you for the quick and detailed response!
I was using a TPU v3-8 with version 1.13, rather than 1.14.1.dev20190518. I have switched to a TPU with version 1.14.1.dev20190518 and it is currently training without the error. I will update this thread and close the issue when the script finishes and I am able to fully reproduce the Squad results.
Thank you again!
I can confirm I get exactly the same error as you, with the same configuration and steps.