training-operator: Reconcile PyTorch Job error Operation cannot be fulfilled on pytrochjobs.kubeflow.org
After creating Pytorch Job, the stauts of the job pods will always be pending, and the training-operator controller throws error as below:
Reconcile PyTorch Job error Operation cannot be fulfilled on pytrochjobs.kubeflow.org "xxx-pytorchjob": the object has been modifyed; please apply your changes to the latest version and try again
The gang schedule is enabled with volcano.
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 16 (8 by maintainers)
760ac1171dd30039a7363ffa03c77454bd714da5is the commit id, you can search it in git logs. We probably can change to version tag later for easy debugging.