gpu-operator: Nvidia container toolkit daemonset pod fails with ErrImagePull
The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn’t apply to you and add more information where it makes sense.
1. Quick Debug Checklist
- Are you running on an Ubuntu 18.04 node?
- Are you running Kubernetes v1.13+?
- Are you running Docker (>= 18.06) or CRIO (>= 1.13+)?
- Do you have
i2c_core
andipmi_msghandler
loaded on the nodes? - Did you apply the CRD (
kubectl describe clusterpolicies --all-namespaces
)
1. Issue or feature description
We are using GPU operator v1.6.2 in one of our E2E tests in cluster-api-provider-aws, it was working 2 days back, but now it has started failing, as nvidia-container-toolkit-daemonset pod failed to come up with below error:
Failed to pull image “nvcr.io/nvidia/cuda@sha256:ed723a1339cddd75eb9f2be2f3476edf497a1b189c10c9bf9eb8da4a16a51a59”: rpc error: code = NotFound desc = failed to pull and unpack image “nvcr.io/nvidia/cuda@sha256:ed723a1339cddd75eb9f2be2f3476edf497a1b189c10c9bf9eb8da4a16a51a59”: failed to resolve reference “nvcr.io/nvidia/cuda@sha256:ed723a1339cddd75eb9f2be2f3476edf497a1b189c10c9bf9eb8da4a16a51a59”: nvcr.io/nvidia/cuda@sha256:ed723a1339cddd75eb9f2be2f3476edf497a1b189c10c9bf9eb8da4a16a51a59: not found
Is there anything changed recently which could be causing this issue?
2. Steps to reproduce the issue
Reference manifest used.
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 16 (7 by maintainers)
For the plugin-validator error, it needs an available GPU on the node, if you have other pods consuming GPUs that might happen. If you want to disable plugin validation you can set as below with validator component in ClusterPolicy.
For getting templates for each release you can run
Below is the template from helm for reference for v1.11.1.
basically all manifests you have here needs to be updated for latest version.(Roles, NFD manifests, CRD and CR from values.yaml).