openfold: Training Runtime Error: StopIteration

Hi,

I’m using the released training data on AWS and the latest main branch to train the model.

  1. The directory structure of the released data is not recognized by the code.
  2. After re-structuring the directories and put all the .hhr and .a3m under the alignment directory, the code crashes at File "/global/u2/b/bz186/openfold/openfold/data/data_modules.py", line 377, in reroll datapoint_idx = next(samples) with default settings.

Any idea to solve this?

Thanks,

Bo

The full trace back is as below:

Traceback (most recent call last):
  File "train_openfold.py", line 548, in <module>
    main(args)
  File "train_openfold.py", line 341, in main
    ckpt_path=ckpt_path,
  File "/global/homes/b/bz186/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 741, in fit
    self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
  File "/global/homes/b/bz186/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 685, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/global/homes/b/bz186/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 777, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/global/homes/b/bz186/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1199, in _run
    self._dispatch()
  File "/global/homes/b/bz186/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1279, in _dispatch
    self.training_type_plugin.start_training(self)
  File "/global/homes/b/bz186/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 202, in start_training
    self._results = trainer.run_stage()
  File "/global/homes/b/bz186/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1289, in run_stage
    return self._run_train()
  File "/global/homes/b/bz186/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1319, in _run_train
    self.fit_loop.run()
  File "/global/homes/b/bz186/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/pytorch_lightning/loops/base.py", line 140, in run
    self.on_run_start(*args, **kwargs)
  File "/global/homes/b/bz186/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/pytorch_lightning/loops/fit_loop.py", line 197, in on_run_start
    self.trainer.reset_train_val_dataloaders(self.trainer.lightning_module)
  File "/global/homes/b/bz186/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/pytorch_lightning/trainer/data_loading.py", line 595, in reset_train_val_dataloaders
    self.reset_train_dataloader(model=model)
  File "/global/homes/b/bz186/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/pytorch_lightning/trainer/data_loading.py", line 365, in reset_train_dataloader
    self.train_dataloader = self.request_dataloader(RunningStage.TRAINING, model=model)
  File "/global/homes/b/bz186/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/pytorch_lightning/trainer/data_loading.py", line 611, in request_dataloader
    dataloader = source.dataloader()
  File "/global/homes/b/bz186/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/pytorch_lightning/trainer/connectors/data_connector.py", line 300, in dataloader
    return method()
  File "/global/u2/b/bz186/openfold/openfold/data/data_modules.py", line 694, in train_dataloader
    return self._gen_dataloader("train") 
  File "/global/u2/b/bz186/openfold/openfold/data/data_modules.py", line 671, in _gen_dataloader
    dataset.reroll()
  File "/global/u2/b/bz186/openfold/openfold/data/data_modules.py", line 377, in reroll
    datapoint_idx = next(samples)
StopIteration
srun: error: nid001680: task 0: Exited with exit code 1
srun: launch/slurm: _step_signal: Terminating StepId=2466693.0

About this issue

  • Original URL
  • State: open
  • Created 2 years ago
  • Comments: 36 (19 by maintainers)

Most upvoted comments

The “exact sequence” warning is expected and is nothing to worry about. As for the precision thing, could we move this to #180? I’m pretty sure it’s the same thing. FP16 is not supported, so the final error is also expected.

Nah, thank you for OpenFold! 😃

I checked if any other chains had the same issue, but none appear to apart from the aforementioned 6tif_AAA. Removing it from the alignment directory seems to fix the issue, but I soon run into another. I don’t think it’s related – please let me know if you’d prefer me opening a separate issue.

I’m first getting a warning UserWarning: One of given dataloaders is None and it will be skipped.

I’m then getting several more warnings about exact sequences missing. An example is the following:

WARNING:root:The exact sequence MTTPRRALIVIDVQNEYVTGDLPIEYPDVQSSLANIARAMDAARAAGVPVVIVQNFAPAGSPLFARGSNGAELHPVVSERARDHYVEKSLPSAFTGTDLAGWLAARQIDTLTVTGYMTHNADASTINHAVHSGLAVEFLHDATGSVPYENSAGFASAEEIHRVFSVVLQSRFAAVASTDEWIAAVQGGTPLA was not found in 3oqp_A. Realigning the template to the actual sequence.

Finally, I get a RuntimeError: expected scalar type BFloat16 but found Half (not sure if related to #180):

Traceback (most recent call last): File “/fsx/openbioml/openfold/train_openfold.py”, line 104, in training_step outputs = self(batch) File “/fsx/openbioml/openfold_venv/lib/python3.7/site-packages/torch/nn/modules/module.py”, line 1130, in _call_impl return forward_call(*input, **kwargs) File “/fsx/openbioml/openfold/train_openfold.py”, line 67, in forward return self.model(batch) File “/fsx/openbioml/openfold_venv/lib/python3.7/site-packages/torch/nn/modules/module.py”, line 1130, in _call_impl return forward_call(*input, **kwargs) File “/fsx/openbioml/openfold/openfold/model/model.py”, line 510, in forward _recycle=(num_iters > 1) File “/fsx/openbioml/openfold/openfold/model/model.py”, line 243, in iteration inplace_safe=inplace_safe, File “/fsx/openbioml/openfold_venv/lib/python3.7/site-packages/torch/nn/modules/module.py”, line 1130, in _call_impl return forward_call(*input, **kwargs) File “/fsx/openbioml/openfold/openfold/model/embedders.py”, line 116, in forward tf_emb_i = self.linear_tf_z_i(tf) File “/fsx/openbioml/openfold_venv/lib/python3.7/site-packages/torch/nn/modules/module.py”, line 1130, in _call_impl return forward_call(*input, **kwargs) File “/fsx/openbioml/openfold_venv/lib/python3.7/site-packages/torch/nn/modules/linear.py”, line 114, in forward return F.linear(input, self.weight, self.bias) RuntimeError: expected scalar type BFloat16 but found Half

A cursory google search leards to https://discuss.pytorch.org/t/mixed-precision-training-on-cuda-with-bfloat16/156248, so I tried re-running with fp16 rather than bfloat16, and apart from the same warnings I now get a ValueError("Unsupported datatype"):

File “/fsx/openbioml/openfold/train_openfold.py”, line 104, in training_step outputs = self(batch) File “/fsx/openbioml/openfold_venv/lib/python3.7/site-packages/torch/nn/modules/module.py”, line 1130, in _call_impl return forward_call(*input, **kwargs) File “/fsx/openbioml/openfold/train_openfold.py”, line 67, in forward return self.model(batch) File “/fsx/openbioml/openfold_venv/lib/python3.7/site-packages/torch/nn/modules/module.py”, line 1130, in _call_impl return forward_call(*input, **kwargs) File “/fsx/openbioml/openfold/openfold/model/model.py”, line 510, in forward _recycle=(num_iters > 1) File “/fsx/openbioml/openfold/openfold/model/model.py”, line 367, in iteration _mask_trans=self.config._mask_trans, File “/fsx/openbioml/openfold_venv/lib/python3.7/site-packages/torch/nn/modules/module.py”, line 1130, in _call_impl return forward_call(*input, **kwargs) File “/fsx/openbioml/openfold/openfold/model/evoformer.py”, line 996, in forward m, z = checkpoint_fn(b, m, z) File “/fsx/openbioml/openfold_venv/lib/python3.7/site-packages/deepspeed/runtime/activation_checkpointing/checkpointing.py”, line 743, in checkpoint CheckpointFunction.apply(function, all_outputs, *args) File “/fsx/openbioml/openfold_venv/lib/python3.7/site-packages/deepspeed/runtime/activation_checkpointing/checkpointing.py”, line 582, in forward outputs = run_function(*inputs_cuda) File “/fsx/openbioml/openfold_venv/lib/python3.7/site-packages/torch/nn/modules/module.py”, line 1130, in _call_impl return forward_call(*input, **kwargs) File “/fsx/openbioml/openfold/openfold/model/evoformer.py”, line 519, in forward self.ckpt if torch.is_grad_enabled() else False, File “/fsx/openbioml/openfold_venv/lib/python3.7/site-packages/torch/nn/modules/module.py”, line 1130, in _call_impl return forward_call(*input, **kwargs) File “/fsx/openbioml/openfold/openfold/model/msa.py”, line 284, in forward flash_mask=mask, File “/fsx/openbioml/openfold_venv/lib/python3.7/site-packages/torch/nn/modules/module.py”, line 1130, in _call_impl return forward_call(*input, **kwargs) File “/fsx/openbioml/openfold/openfold/model/primitives.py”, line 488, in forward o = attention_core(q, k, v, *((biases + [None] * 2)[:2])) File “/fsx/openbioml/openfold/openfold/utils/kernel/attention_core.py”, line 32, in forward raise ValueError(“Unsupported datatype”)

Thinking that maybe something went wrong with flash attention’s installation? Any tips? I would also greatly appreciate if you could share the DeepSpeed config you used for the reproduction run you reported. I’m running this on 64 A100s, so the setup should be similar to your one (which IIRC was on 45 A100s). All of this is for an open-source project that I would love to chat with you about at some point! 😃

Delete their alignment_dirs and rerun. I’ll look into what’s causing this.

Could you verify programmatically that every single chain in your alignment_dir has a corresponding .mmcif file in the data_dir? Take all chain names in the former, split on _, and search for an mmcif file matching the PDB code.