transformers: Cannot run grid search using Trainer API and Ray Tune
Environment info
transformers
version: 4.8.2- Platform: Linux-5.4.104±x86_64-with-Ubuntu-18.04-bionic
- Python version: 3.7.11
- PyTorch version (GPU?): 1.9.0+cu102 (True)
- Tensorflow version (GPU?): 2.6.0 (True)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: <fill in>
Who can help
Information
Model I am using (Bert, XLNet …): roBERTa
The problem arises when using:
- [x ] the official example scripts: (give details below)
- my own modified scripts: (give details below)
The tasks I am working on is:
- an official GLUE/SQUaD task: (give the name)
- [x ] my own task or dataset: (give details below)
To reproduce
Hi, I am trying to do grid search on my roberta model.
Steps to reproduce the behavior:
hyperParameters = {
"per_gpu_batch_size": [32],
"learning_rate": [2e-5],
"num_epochs": [2,3]
}
def my_hp_space_ray(trial):
from ray import tune
return {
"learning_rate": tune.choice(hyperParameters.get('learning_rate')),
"num_train_epochs": tune.choice(hyperParameters.get('num_epochs'))
}
training_args = TrainingArguments("test",
per_device_train_batch_size= 32,
per_device_eval_batch_size = 32,
evaluation_strategy = "epoch", #Can be epoch or steps
weight_decay=0.01,
logging_strategy ="epoch",
metric_for_best_model="accuracy",
report_to="wandb"
)
trainer = Trainer(
args=training_args,
tokenizer=tokenizer,
train_dataset=tokenized_datasets_train,
eval_dataset=tokenized_datasets_val,
model_init=model_init,
compute_metrics=compute_metrics,
)
trainer.hyperparameter_search(
direction="minimize",
backend="ray",
n_trials= 2,
hp_space = my_hp_space_ray)
2021-08-30 21:07:06,743 ERROR trial_runner.py:773 -- Trial _objective_2e533_00001: Error processing event. Traceback (most recent call last): File "/usr/local/lib/python3.7/dist-packages/ray/tune/trial_runner.py", line 739, in _process_trial results = self.trial_executor.fetch_result(trial) File "/usr/local/lib/python3.7/dist-packages/ray/tune/ray_trial_executor.py", line 746, in fetch_result result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT) File "/usr/local/lib/python3.7/dist-packages/ray/_private/client_mode_hook.py", line 82, in wrapper return func(*args, **kwargs) File "/usr/local/lib/python3.7/dist-packages/ray/worker.py", line 1621, in get raise value.as_instanceof_cause() ray.exceptions.RayTaskError(TuneError): ray::ImplicitFunc.train_buffered() (pid=1182, ip=172.28.0.2, repr=<types.ImplicitFunc object at 0x7f67395b5ed0>) File "/usr/local/lib/python3.7/dist-packages/ray/tune/trainable.py", line 178, in train_buffered result = self.train() File "/usr/local/lib/python3.7/dist-packages/ray/tune/trainable.py", line 237, in train result = self.step() File "/usr/local/lib/python3.7/dist-packages/ray/tune/function_runner.py", line 379, in step self._report_thread_runner_error(block=True) File "/usr/local/lib/python3.7/dist-packages/ray/tune/function_runner.py", line 527, in _report_thread_runner_error ("Trial raised an exception. Traceback:\n{}".format(err_tb_str) ray.tune.error.TuneError: Trial raised an exception. Traceback: ray::ImplicitFunc.train_buffered() (pid=1182, ip=172.28.0.2, repr=<types.ImplicitFunc object at 0x7f67395b5ed0>) File "/usr/local/lib/python3.7/dist-packages/ray/tune/function_runner.py", line 260, in run self._entrypoint() File "/usr/local/lib/python3.7/dist-packages/ray/tune/function_runner.py", line 329, in entrypoint self._status_reporter.get_checkpoint()) File "/usr/local/lib/python3.7/dist-packages/ray/tune/function_runner.py", line 594, in _trainable_func output = fn() File "/usr/local/lib/python3.7/dist-packages/ray/tune/utils/trainable.py", line 344, in inner trainable(config, **fn_kwargs) File "/usr/local/lib/python3.7/dist-packages/transformers/integrations.py", line 162, in _objective local_trainer.train(resume_from_checkpoint=checkpoint, trial=trial) File "/usr/local/lib/python3.7/dist-packages/transformers/trainer.py", line 1269, in train tr_loss += self.training_step(model, inputs) File "/usr/local/lib/python3.7/dist-packages/transformers/trainer.py", line 1762, in training_step loss = self.compute_loss(model, inputs) File "/usr/local/lib/python3.7/dist-packages/transformers/trainer.py", line 1794, in compute_loss outputs = model(**inputs) File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(*input, **kwargs) File "/usr/local/lib/python3.7/dist-packages/transformers/models/roberta/modeling_roberta.py", line 1184, in forward return_dict=return_dict, File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(*input, **kwargs) File "/usr/local/lib/python3.7/dist-packages/transformers/models/roberta/modeling_roberta.py", line 845, in forward return_dict=return_dict, File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(*input, **kwargs) File "/usr/local/lib/python3.7/dist-packages/transformers/models/roberta/modeling_roberta.py", line 529, in forward output_attentions, File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(*input, **kwargs) File "/usr/local/lib/python3.7/dist-packages/transformers/models/roberta/modeling_roberta.py", line 414, in forward past_key_value=self_attn_past_key_value, File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(*input, **kwargs) File "/usr/local/lib/python3.7/dist-packages/transformers/models/roberta/modeling_roberta.py", line 344, in forward output_attentions, File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(*input, **kwargs) File "/usr/local/lib/python3.7/dist-packages/transformers/models/roberta/modeling_roberta.py", line 257, in forward attention_scores = attention_scores / math.sqrt(self.attention_head_size) RuntimeError: CUDA out of memory. Tried to allocate 384.00 MiB (GPU 0; 15.90 GiB total capacity; 13.14 GiB already allocated; 312.75 MiB free; 13.29 GiB reserved in total by PyTorch) Result for _objective_2e533_00001: {}
Expected behavior
Hi i would like to hyper parameter tune my roberta model, using ray tune and the trainer API, is there a way to not run out of memory even if it takes longer time to finish? Or is there some other type of parameter tuning i should use instead?
I spent the whole day trying to figure it out, so any help would be hugely appreciated
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 15 (2 by maintainers)
@Yard1 @richardliaw updating transformers did help. Huge thanks!
altho i have one last question, is there a way to specify to use tpu’s instead of gpu’s in the api?
At the moment i have it like this,
Seems like the error is:
as a tip, maybe you could consider also doing
hyperparameter_search(..., fail_fast="raise")
to help see errors better.