transformers: Cannot run grid search using Trainer API and Ray Tune

Environment info

  • transformers version: 4.8.2
  • Platform: Linux-5.4.104±x86_64-with-Ubuntu-18.04-bionic
  • Python version: 3.7.11
  • PyTorch version (GPU?): 1.9.0+cu102 (True)
  • Tensorflow version (GPU?): 2.6.0 (True)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: <fill in>
  • Using distributed or parallel set-up in script?: <fill in>

Who can help

@richardliaw, @amogkam

@sgugger

Information

Model I am using (Bert, XLNet …): roBERTa

The problem arises when using:

  • [x ] the official example scripts: (give details below)
  • my own modified scripts: (give details below)

The tasks I am working on is:

  • an official GLUE/SQUaD task: (give the name)
  • [x ] my own task or dataset: (give details below)

To reproduce

Hi, I am trying to do grid search on my roberta model.

Steps to reproduce the behavior:

  hyperParameters = {
    "per_gpu_batch_size": [32],
    "learning_rate": [2e-5],
    "num_epochs": [2,3]
  }
  
  def my_hp_space_ray(trial):
      from ray import tune
  
      return {
          "learning_rate": tune.choice(hyperParameters.get('learning_rate')),
          "num_train_epochs": tune.choice(hyperParameters.get('num_epochs'))
      }
training_args = TrainingArguments("test",
                                  per_device_train_batch_size= 32,
                                  per_device_eval_batch_size = 32,
                                  evaluation_strategy = "epoch", #Can be epoch or steps
                                  weight_decay=0.01,
                                  logging_strategy ="epoch",
                                  metric_for_best_model="accuracy",
                                  report_to="wandb"
                                  )
trainer = Trainer(
    args=training_args,
    tokenizer=tokenizer,
    train_dataset=tokenized_datasets_train,
    eval_dataset=tokenized_datasets_val,
    model_init=model_init,
    compute_metrics=compute_metrics,
)
trainer.hyperparameter_search(
    direction="minimize", 
    backend="ray",
    n_trials= 2,
    hp_space = my_hp_space_ray)

2021-08-30 21:07:06,743 ERROR trial_runner.py:773 -- Trial _objective_2e533_00001: Error processing event. Traceback (most recent call last): File "/usr/local/lib/python3.7/dist-packages/ray/tune/trial_runner.py", line 739, in _process_trial results = self.trial_executor.fetch_result(trial) File "/usr/local/lib/python3.7/dist-packages/ray/tune/ray_trial_executor.py", line 746, in fetch_result result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT) File "/usr/local/lib/python3.7/dist-packages/ray/_private/client_mode_hook.py", line 82, in wrapper return func(*args, **kwargs) File "/usr/local/lib/python3.7/dist-packages/ray/worker.py", line 1621, in get raise value.as_instanceof_cause() ray.exceptions.RayTaskError(TuneError): ray::ImplicitFunc.train_buffered() (pid=1182, ip=172.28.0.2, repr=<types.ImplicitFunc object at 0x7f67395b5ed0>) File "/usr/local/lib/python3.7/dist-packages/ray/tune/trainable.py", line 178, in train_buffered result = self.train() File "/usr/local/lib/python3.7/dist-packages/ray/tune/trainable.py", line 237, in train result = self.step() File "/usr/local/lib/python3.7/dist-packages/ray/tune/function_runner.py", line 379, in step self._report_thread_runner_error(block=True) File "/usr/local/lib/python3.7/dist-packages/ray/tune/function_runner.py", line 527, in _report_thread_runner_error ("Trial raised an exception. Traceback:\n{}".format(err_tb_str) ray.tune.error.TuneError: Trial raised an exception. Traceback: ray::ImplicitFunc.train_buffered() (pid=1182, ip=172.28.0.2, repr=<types.ImplicitFunc object at 0x7f67395b5ed0>) File "/usr/local/lib/python3.7/dist-packages/ray/tune/function_runner.py", line 260, in run self._entrypoint() File "/usr/local/lib/python3.7/dist-packages/ray/tune/function_runner.py", line 329, in entrypoint self._status_reporter.get_checkpoint()) File "/usr/local/lib/python3.7/dist-packages/ray/tune/function_runner.py", line 594, in _trainable_func output = fn() File "/usr/local/lib/python3.7/dist-packages/ray/tune/utils/trainable.py", line 344, in inner trainable(config, **fn_kwargs) File "/usr/local/lib/python3.7/dist-packages/transformers/integrations.py", line 162, in _objective local_trainer.train(resume_from_checkpoint=checkpoint, trial=trial) File "/usr/local/lib/python3.7/dist-packages/transformers/trainer.py", line 1269, in train tr_loss += self.training_step(model, inputs) File "/usr/local/lib/python3.7/dist-packages/transformers/trainer.py", line 1762, in training_step loss = self.compute_loss(model, inputs) File "/usr/local/lib/python3.7/dist-packages/transformers/trainer.py", line 1794, in compute_loss outputs = model(**inputs) File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(*input, **kwargs) File "/usr/local/lib/python3.7/dist-packages/transformers/models/roberta/modeling_roberta.py", line 1184, in forward return_dict=return_dict, File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(*input, **kwargs) File "/usr/local/lib/python3.7/dist-packages/transformers/models/roberta/modeling_roberta.py", line 845, in forward return_dict=return_dict, File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(*input, **kwargs) File "/usr/local/lib/python3.7/dist-packages/transformers/models/roberta/modeling_roberta.py", line 529, in forward output_attentions, File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(*input, **kwargs) File "/usr/local/lib/python3.7/dist-packages/transformers/models/roberta/modeling_roberta.py", line 414, in forward past_key_value=self_attn_past_key_value, File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(*input, **kwargs) File "/usr/local/lib/python3.7/dist-packages/transformers/models/roberta/modeling_roberta.py", line 344, in forward output_attentions, File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(*input, **kwargs) File "/usr/local/lib/python3.7/dist-packages/transformers/models/roberta/modeling_roberta.py", line 257, in forward attention_scores = attention_scores / math.sqrt(self.attention_head_size) RuntimeError: CUDA out of memory. Tried to allocate 384.00 MiB (GPU 0; 15.90 GiB total capacity; 13.14 GiB already allocated; 312.75 MiB free; 13.29 GiB reserved in total by PyTorch) Result for _objective_2e533_00001: {}

Expected behavior

Hi i would like to hyper parameter tune my roberta model, using ray tune and the trainer API, is there a way to not run out of memory even if it takes longer time to finish? Or is there some other type of parameter tuning i should use instead?

I spent the whole day trying to figure it out, so any help would be hugely appreciated

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 15 (2 by maintainers)

Most upvoted comments

@Yard1 @richardliaw updating transformers did help. Huge thanks!

altho i have one last question, is there a way to specify to use tpu’s instead of gpu’s in the api?

At the moment i have it like this,

trainer.hyperparameter_search(
    direction="minimize", 
    backend="ray",
    n_trials= 1,
    hp_space = my_hp_space_ray,
    resources_per_trial =  {"cpu": 1,"gpu": 1},
    fail_fast="raise"
    )

Seems like the error is:

(pid=1376)     output = fn()
(pid=1376)   File "/usr/local/lib/python3.7/dist-packages/ray/tune/utils/trainable.py", line 344, in inner
(pid=1376)     trainable(config, **fn_kwargs)
(pid=1376)   File "/usr/local/lib/python3.7/dist-packages/transformers/integrations.py", line 162, in _objective
(pid=1376)     local_trainer.train(resume_from_checkpoint=checkpoint, trial=trial)
(pid=1376)   File "/usr/local/lib/python3.7/dist-packages/transformers/trainer.py", line 1331, in train
(pid=1376)     self._maybe_log_save_evaluate(tr_loss, model, trial, epoch)
(pid=1376)   File "/usr/local/lib/python3.7/dist-packages/transformers/trainer.py", line 1426, in _maybe_log_save_evaluate
(pid=1376)     metrics = self.evaluate()
(pid=1376)   File "/usr/local/lib/python3.7/dist-packages/transformers/trainer.py", line 2031, in evaluate
(pid=1376)     metric_key_prefix=metric_key_prefix,
(pid=1376)   File "/usr/local/lib/python3.7/dist-packages/transformers/trainer.py", line 2260, in evaluation_loop
(pid=1376)     metrics = self.compute_metrics(EvalPrediction(predictions=all_preds, label_ids=all_labels))
(pid=1376)   File "<ipython-input-38-b8a033e8f995>", line 5, in compute_metrics
(pid=1376) NameError: name 'metric' is not defined

as a tip, maybe you could consider also doing hyperparameter_search(..., fail_fast="raise") to help see errors better.