xlnet: Out of memory with TPU v3-8 when running tpu_squad_large.sh

Hello, thank you for the interesting paper and for releasing your code alongside the paper!

I am trying to train XL-Net on Squad, but I am getting OOM errors when running scripts/tpu_squad_large.sh. This strikes me as odd, because you say in the README that you can run this script without issues. I have not modified the parameters of the script, except for specifying the necessary data/model directories.

For context, my setup is as follows. I spun up a TPU v3-8 using ctpu up in the us-central1-a region. I preprocessed the data as directed, using scripts/prepro_squad.sh, and moved to a Google Storage bucket in the same region as the TPU. I have model checkpoint folders both locally (for sentencepiece) and in the cloud (for loading the model).

I have worked with TPUs before, but only TPU-v2 (not v3); is there something I am doing incorrectly?

When I run scripts/tpu_squad_large.sh, loading and initialization work fine, but the script breaks with what I believe is a memory issue:

# ... normal tensorflow logs ...

I0621 17:53:36.702727 140612788254144 tpu_estimator.py:536] Enqueue next (1000) batch(es) of data to infeed.
I0621 17:53:36.703403 140612788254144 tpu_estimator.py:540] Dequeue next (1000) batch(es) of data from outfeed.
I0621 17:56:15.833373 140611248187136 error_handling.py:70] Error recorded from outfeed: Bad hardware status: 0x1

# ... stack trace ...

Status code: Resource exhausted [9x]
  Compilation failure: Ran out of memory in memory space hbm. Used 20.90G of 16.00G hbm. Exceeded hbm capacity by 4.90G.

  Total hbm usage >= 20.90G:
      reserved        528.00M
      program          20.38G
      arguments       unknown size

  Output size unknown.

Is there something I am doing incorrectly?

Also, have others managed to run scripts/tpu_squad_large.sh successfully (with batch size 48, etc.)?

About this issue

Original URL
State: closed
Created 5 years ago
Reactions: 6
Comments: 23 (6 by maintainers)

Most upvoted comments

@lovedavidsilva alternatively, you can specify --tf-version 1.14.1.dev20190518 when starting up TPU with ctpu utility. An example:

ctpu up --name your-xlnet-tpu --tpu-size v3-8 --tpu-only --tf-version 1.14.1.dev20190518 --zone us-central1-b

ymcui on Jul 3, 2019

Could you provide the environment you are in? I tested this script myself on cloud TPU 4 days ago without a problem.

An update: I just tried 30s ago. It works OK with the following set up.

TPU v3-8 software version (shown in Compute Engine -> TPUs): 1.14.1.dev20190518
Tensorflow version (the one you chose when launching a VM): 1.13.1

Please let me know whether it works.

zihangdai on Jun 21, 2019

Thank you for the quick and detailed response!

I was using a TPU v3-8 with version 1.13, rather than 1.14.1.dev20190518. I have switched to a TPU with version 1.14.1.dev20190518 and it is currently training without the error. I will update this thread and close the issue when the script finishes and I am able to fully reproduce the Squad results.

Thank you again!

lukemelas on Jun 22, 2019

I can confirm I get exactly the same error as you, with the same configuration and steps.

erichare on Jun 21, 2019