kueue: When a job resets reclaimable pods when evicted, it gets stuck due to Workload validation
If the workload is preempted, will ReclaimablePod be set to nil?
{"level":"error","ts":"2023-10-16T12:47:55.770288657Z","caller":"controller/controller.go:324","msg":"Reconciler error","controller":"job","controllerGroup":"batch.volcano.sh","controllerKind":"Job","Job":{"name":"hobot-job-xxx","namespace":"cpu-preempt"},"namespace":"cpu-preempt","name":"hobot-job-xxx","reconcileID":"f4abf36a-f5f2-4ec0-bbae-31b49cfef394","error":"admission webhook \"[vworkload.kb.io](http://vworkload.kb.io/)\" denied the request: status.reclaimablePods[main]: Required value: cannot be removed","stacktrace":"[sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/gopath/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.0/pkg/internal/controller/controller.go:324\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/gopath/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.0/pkg/internal/controller/controller.go:265\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/gopath/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.0/pkg/internal/controller/controller.go:226](http://sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler/n/t/gopath/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.0/pkg/internal/controller/controller.go:324/nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem/n/t/gopath/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.0/pkg/internal/controller/controller.go:265/nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2/n/t/gopath/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.0/pkg/internal/controller/controller.go:226)"}
Is this our normal logic, or am I using it incorrectly, which would cause the webhook checksum to fail for the departure workload.
About this issue
- Original URL
- State: closed
- Created 8 months ago
- Comments: 18 (18 by maintainers)
skipping https://github.com/kubernetes-sigs/kueue/blob/525098b838eb28ee241cf1aaee5e53f161d30a0c/pkg/controller/jobframework/reconciler.go#L268-L278
while the workload has the evicted condition set, should do the trick.