training-operator: Reconcile PyTorch Job error Operation cannot be fulfilled on pytrochjobs.kubeflow.org

After creating Pytorch Job, the stauts of the job pods will always be pending, and the training-operator controller throws error as below:

Reconcile PyTorch Job error Operation cannot be fulfilled on pytrochjobs.kubeflow.org "xxx-pytorchjob": the object has been modifyed; please apply your changes to the latest version and try again

The gang schedule is enabled with volcano.

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 16 (8 by maintainers)

Most upvoted comments

The container image of training operator is “public.ecr.aws/j1r0q0g6/training/training-operator:760ac1171dd30039a7363ffa03c77454bd714da5”. I am not sure if it is okay to check the version, or any other way to get the version?

760ac1171dd30039a7363ffa03c77454bd714da5 is the commit id, you can search it in git logs. We probably can change to version tag later for easy debugging.