karpenter-provider-aws: "inconsistent state error adding volume, StorageClass.storage.k8s.io "nfs" not found, please file an issue"

Version

Karpenter: v0.12.0

Kubernetes: v1.22.9-eks-a64ea69

Expected Behavior

Karpenter works as expected (no issues with nfs volumes)

Actual Behavior

Seeing the following in my karpenter logs:

controller 2022-06-27T06:30:24.907Z ERROR controller.node-state inconsistent state error adding volume, StorageClass.storage.k8s.io "nfs" not found, please file an issue {"commit": "588d4c8", "node": "ip-XXX-XX-XX-XXX.us-west-2.compute.internal"}

Steps to Reproduce the Problem

Resource Specs and Logs

Provisioner:

apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: workers
spec:
  # https://github.com/aws/karpenter/issues/1252#issuecomment-1166894316
  labels:
    vpc.amazonaws.com/has-trunk-attached: "false"
  taints:
  - key: purpose
    effect: NoSchedule
    value: workers
  - key: purpose
    value: workers
    effect: NoExecute
  requirements:
    - key: "purpose"
      operator: In
      values: ["inflate-workers", "workers"]
    - key: "karpenter.sh/capacity-type"
      operator: In
      values: ["spot", "on-demand"]
    - key: "node.kubernetes.io/instance-type"
      operator: In
      values: ["t3.medium", "t3.large", "t3.xlarge", "m5n.xlarge", "m6a.large", "c6a.large"]
    - key: "kubernetes.io/arch"
      operator: In
      values: ["amd64"]
  provider:
    instanceProfile: eks-random-strings-redacted
    securityGroupSelector:
      Name: v2-cert-03-eks-node
    subnetSelector:
      Name: v2-cert-03-private-us-west-2*
  ttlSecondsAfterEmpty: 30

Sample Deployment Template

apiVersion: apps/v1
kind: Deployment
metadata:
  name: some-deployment
  labels:
    app: some-deployment
    app-kubecost: dev
spec:
  revisionHistoryLimit: 1
  replicas:
  selector:
    matchLabels:
      app: some-deployment
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 0
  template:
    metadata:
      annotations:
        cluster-autoscaler.kubernetes.io/safe-to-evict: "true"
      labels:
        app: some-deployment
        app-kubecost: dev
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: purpose
                    operator: In
                    values:
                      - workers
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              podAffinityTerm:
                labelSelector:
                  matchExpressions:
                    - key: app
                      operator: In
                      values:
                        - some-deployment
                topologyKey: kubernetes.io/hostname
      tolerations:
        - key: purpose
          operator: Equal
          value: workers
          effect: NoSchedule
        - key: purpose
          operator: Equal
          value: workers
          effect: NoExecute
      imagePullSecrets:
        - name: uat-workers-dockercred
      containers:
        - name: worker
          image: "REDACTED.dkr.ecr.us-west-2.amazonaws.com/some_repo:some_immutable_tag"
          imagePullPolicy: IfNotPresent
          env:
            - name: VERBOSE
              value: "3"
          resources:
            requests:
              cpu: "2"
              memory: "2000Mi"
            limits:
              cpu: "2"
              memory: "2000Mi"
          volumeMounts:
            - name: files
              mountPath: /code/.configs.yaml
              subPath: configs.yaml
            - mountPath: "/protected_path/uat"
              name: nfs
      volumes:
        - name: files
          configMap:
            name: config-files
        - name: nfs
          persistentVolumeClaim:
            claimName: nfs


Sample PVC tempalte:

---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: nfs
spec:
  volumeName: nfs-{{ .Release.Namespace }}
  storageClassName: nfs
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 10Gi

Sample PV template:

apiVersion: v1
kind: PersistentVolume
metadata:
  name: nfs-{{ .Release.Namespace }}
spec:
  capacity:
    storage: 10Gi
  volumeMode: Filesystem
  accessModes:
    - ReadWriteMany
  persistentVolumeReclaimPolicy: Retain
  storageClassName: nfs
  mountOptions:
    - nfsvers=4.1
    - rsize=1048576
    - wsize=1048576
    - hard
    - timeo=600
    - retrans=2
    - noresvport
  nfs:
    server: {{ .Values.nfs.server }}
    path: {{ .Values.nfs.path }}
  claimRef:
    name: nfs
    namespace: {{ .Release.Namespace }}

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 22 (10 by maintainers)

Commits related to this issue

Most upvoted comments

Yes, this should work for you. 4c35c0fe3cc13f55f7edba361cb2f5e662ac9867 is the current latest commit in main.

export COMMIT="4c35c0fe3cc13f55f7edba361cb2f5e662ac9867"
export CLUSTER_NAME="<INSERT_CLUSTER_NAME>"
export AWS_ACCOUNT_ID="$(aws sts get-caller-identity --query Account --output text)"
export KARPENTER_IAM_ROLE_ARN="arn:aws:iam::${AWS_ACCOUNT_ID}:role/${CLUSTER_NAME}-karpenter"
export CLUSTER_ENDPOINT="$(aws eks describe-cluster --name ${CLUSTER_NAME} --query "cluster.endpoint" --output text)"

helm install karpenter oci://public.ecr.aws/karpenter-snapshots/karpenter --version v0-${COMMIT} --namespace karpenter \
  --set serviceAccount.annotations."eks\.amazonaws\.com/role-arn"=${KARPENTER_IAM_ROLE_ARN} \
  --set clusterName=${CLUSTER_NAME} \
  --set clusterEndpoint=${CLUSTER_ENDPOINT} \
  --set aws.defaultInstanceProfile=KarpenterNodeInstanceProfile-${CLUSTER_NAME} \
  --wait # for the defaulting webhook to install before creating a Provisioner

@armenr Can you file a separate issue? It’s difficult to track multiple items in a single issue.