openfold: Training Runtime Error: StopIteration
Hi,
I’m using the released training data on AWS and the latest main branch to train the model.
- The directory structure of the released data is not recognized by the code.
- After re-structuring the directories and put all the .hhr and .a3m under the alignment directory, the code crashes at
File "/global/u2/b/bz186/openfold/openfold/data/data_modules.py", line 377, in reroll datapoint_idx = next(samples)
with default settings.
Any idea to solve this?
Thanks,
Bo
The full trace back is as below:
Traceback (most recent call last):
File "train_openfold.py", line 548, in <module>
main(args)
File "train_openfold.py", line 341, in main
ckpt_path=ckpt_path,
File "/global/homes/b/bz186/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 741, in fit
self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
File "/global/homes/b/bz186/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 685, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/global/homes/b/bz186/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 777, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/global/homes/b/bz186/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1199, in _run
self._dispatch()
File "/global/homes/b/bz186/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1279, in _dispatch
self.training_type_plugin.start_training(self)
File "/global/homes/b/bz186/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 202, in start_training
self._results = trainer.run_stage()
File "/global/homes/b/bz186/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1289, in run_stage
return self._run_train()
File "/global/homes/b/bz186/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1319, in _run_train
self.fit_loop.run()
File "/global/homes/b/bz186/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/pytorch_lightning/loops/base.py", line 140, in run
self.on_run_start(*args, **kwargs)
File "/global/homes/b/bz186/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/pytorch_lightning/loops/fit_loop.py", line 197, in on_run_start
self.trainer.reset_train_val_dataloaders(self.trainer.lightning_module)
File "/global/homes/b/bz186/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/pytorch_lightning/trainer/data_loading.py", line 595, in reset_train_val_dataloaders
self.reset_train_dataloader(model=model)
File "/global/homes/b/bz186/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/pytorch_lightning/trainer/data_loading.py", line 365, in reset_train_dataloader
self.train_dataloader = self.request_dataloader(RunningStage.TRAINING, model=model)
File "/global/homes/b/bz186/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/pytorch_lightning/trainer/data_loading.py", line 611, in request_dataloader
dataloader = source.dataloader()
File "/global/homes/b/bz186/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/pytorch_lightning/trainer/connectors/data_connector.py", line 300, in dataloader
return method()
File "/global/u2/b/bz186/openfold/openfold/data/data_modules.py", line 694, in train_dataloader
return self._gen_dataloader("train")
File "/global/u2/b/bz186/openfold/openfold/data/data_modules.py", line 671, in _gen_dataloader
dataset.reroll()
File "/global/u2/b/bz186/openfold/openfold/data/data_modules.py", line 377, in reroll
datapoint_idx = next(samples)
StopIteration
srun: error: nid001680: task 0: Exited with exit code 1
srun: launch/slurm: _step_signal: Terminating StepId=2466693.0
About this issue
- Original URL
- State: open
- Created 2 years ago
- Comments: 36 (19 by maintainers)
The “exact sequence” warning is expected and is nothing to worry about. As for the precision thing, could we move this to #180? I’m pretty sure it’s the same thing. FP16 is not supported, so the final error is also expected.
Nah, thank you for OpenFold! 😃
I checked if any other chains had the same issue, but none appear to apart from the aforementioned
6tif_AAA
. Removing it from the alignment directory seems to fix the issue, but I soon run into another. I don’t think it’s related – please let me know if you’d prefer me opening a separate issue.I’m first getting a warning
UserWarning: One of given dataloaders is None and it will be skipped.
I’m then getting several more warnings about exact sequences missing. An example is the following:
Finally, I get a
RuntimeError: expected scalar type BFloat16 but found Half
(not sure if related to #180):A cursory google search leards to https://discuss.pytorch.org/t/mixed-precision-training-on-cuda-with-bfloat16/156248, so I tried re-running with fp16 rather than bfloat16, and apart from the same warnings I now get a
ValueError("Unsupported datatype")
:Thinking that maybe something went wrong with flash attention’s installation? Any tips? I would also greatly appreciate if you could share the DeepSpeed config you used for the reproduction run you reported. I’m running this on 64 A100s, so the setup should be similar to your one (which IIRC was on 45 A100s). All of this is for an open-source project that I would love to chat with you about at some point! 😃
Delete their
alignment_dirs
and rerun. I’ll look into what’s causing this.Could you verify programmatically that every single chain in your
alignment_dir
has a corresponding .mmcif file in thedata_dir
? Take all chain names in the former, split on_
, and search for an mmcif file matching the PDB code.