cert-manager: HTTP01 challenge fails, solver pod short-lived and all its traces wiped out

Describe the bug: The certificate is not being issued, with order and challenge being in a bad state

➜  accounting-service git:(xxxxxxxxx) ✗ kubectl get challenges                                                                                                                                                                         aws-platform-staging-eu-west-1
NAME                                                 STATE     DOMAIN                                            AGE
accounting-service-tls-lccsr-2824239067-3765429214   expired   accounting-service-public.stag.aws.worksome.net   52m
➜  accounting-service git:(xxxxxxxx) ✗ kubectl get orders                                                                                                                                                                             aws-platform-staging-eu-west-1
NAME                                      STATE     AGE
accounting-service-tls-lccsr-2824239067   invalid   52m
➜  accounting-service git:(xxxxxxxxxx) ✗ kubectl describe order accounting-service-tls-lccsr-2824239067                                                                                                                                 aws-platform-staging-eu-west-1
Name:         accounting-service-tls-lccsr-2824239067
Namespace:    default
Labels:       <none>
Annotations:  cert-manager.io/certificate-name: accounting-service-tls
              cert-manager.io/certificate-revision: 1
              cert-manager.io/private-key-secret-name: accounting-service-tls-ff7pk
API Version:  acme.cert-manager.io/v1
Kind:         Order
Metadata:
  Creation Timestamp:  2022-01-11T13:32:39Z
  Generation:          1
  Managed Fields:
    API Version:  acme.cert-manager.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .:
          f:cert-manager.io/certificate-name:
          f:cert-manager.io/certificate-revision:
          f:cert-manager.io/private-key-secret-name:
        f:ownerReferences:
          .:
          k:{"uid":"c6e54947-8ce7-48c5-92c8-8ec7331d1273"}:
            .:
            f:apiVersion:
            f:blockOwnerDeletion:
            f:controller:
            f:kind:
            f:name:
            f:uid:
      f:spec:
        .:
        f:dnsNames:
        f:issuerRef:
          .:
          f:group:
          f:kind:
          f:name:
        f:request:
      f:status:
        .:
        f:authorizations:
        f:failureTime:
        f:finalizeURL:
        f:state:
        f:url:
    Manager:    controller
    Operation:  Update
    Time:       2022-01-11T13:34:25Z
  Owner References:
    API Version:           cert-manager.io/v1
    Block Owner Deletion:  true
    Controller:            true
    Kind:                  CertificateRequest
    Name:                  accounting-service-tls-lccsr
    UID:                   c6e54947-8ce7-48c5-92c8-8ec7331d1273
  Resource Version:        17805529
  UID:                     71323017-b594-4020-ae74-de30a4f607d4
Spec:
  Dns Names:
    accounting-service-public.stag.aws.worksome.net
  Issuer Ref:
    Group:  cert-manager.io
    Kind:   Issuer
    Name:   letsencrypt-staging
  Request:  <THE CERTIFICATE REQUEST BASE64-ENCODED GOES HERE>
Status:
  Authorizations:
    Challenges:
      Token:        HDU7Jy0sqG4bgp1ADI6nACKYYVs0g_5cVfdUsXcVOgg
      Type:         http-01
      URL:          https://acme-staging-v02.api.letsencrypt.org/acme/chall-v3/1401516058/Oc1iHA
      Token:        HDU7Jy0sqG4bgp1ADI6nACKYYVs0g_5cVfdUsXcVOgg
      Type:         dns-01
      URL:          https://acme-staging-v02.api.letsencrypt.org/acme/chall-v3/1401516058/Ax0QhQ
      Token:        HDU7Jy0sqG4bgp1ADI6nACKYYVs0g_5cVfdUsXcVOgg
      Type:         tls-alpn-01
      URL:          https://acme-staging-v02.api.letsencrypt.org/acme/chall-v3/1401516058/qtKlag
    Identifier:     accounting-service-public.stag.aws.worksome.net
    Initial State:  pending
    URL:            https://acme-staging-v02.api.letsencrypt.org/acme/authz-v3/1401516058
    Wildcard:       false
  Failure Time:     2022-01-11T13:34:25Z
  Finalize URL:     https://acme-staging-v02.api.letsencrypt.org/acme/finalize/39574498/1503494108
  State:            invalid
  URL:              https://acme-staging-v02.api.letsencrypt.org/acme/order/39574498/1503494108
Events:
  Type    Reason   Age   From          Message
  ----    ------   ----  ----          -------
  Normal  Created  53m   cert-manager  Created Challenge resource "accounting-service-tls-lccsr-2824239067-3765429214" for domain "accounting-service-public.stag.aws.worksome.net"
➜  accounting-service git:(xxxxxxxxxxxxxx) ✗ kubectl describe challenge accounting-service-tls-lccsr-2824239067-3765429214                                                                                                                  aws-platform-staging-eu-west-1
Name:         accounting-service-tls-lccsr-2824239067-3765429214
Namespace:    default
Labels:       <none>
Annotations:  <none>
API Version:  acme.cert-manager.io/v1
Kind:         Challenge
Metadata:
  Creation Timestamp:  2022-01-11T13:32:41Z
  Finalizers:
    finalizer.acme.cert-manager.io
  Generation:  1
  Managed Fields:
    API Version:  acme.cert-manager.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:finalizers:
          .:
          v:"finalizer.acme.cert-manager.io":
        f:ownerReferences:
          .:
          k:{"uid":"71323017-b594-4020-ae74-de30a4f607d4"}:
            .:
            f:apiVersion:
            f:blockOwnerDeletion:
            f:controller:
            f:kind:
            f:name:
            f:uid:
      f:spec:
        .:
        f:authorizationURL:
        f:dnsName:
        f:issuerRef:
          .:
          f:group:
          f:kind:
          f:name:
        f:key:
        f:solver:
          .:
          f:http01:
            .:
            f:ingress:
              .:
              f:class:
        f:token:
        f:type:
        f:url:
        f:wildcard:
      f:status:
        .:
        f:presented:
        f:processing:
        f:reason:
        f:state:
    Manager:    controller
    Operation:  Update
    Time:       2022-01-11T13:32:44Z
  Owner References:
    API Version:           acme.cert-manager.io/v1
    Block Owner Deletion:  true
    Controller:            true
    Kind:                  Order
    Name:                  accounting-service-tls-lccsr-2824239067
    UID:                   71323017-b594-4020-ae74-de30a4f607d4
  Resource Version:        17805528
  UID:                     4b30bb88-72b4-4049-8ac4-a2ebc7d2d9fa
Spec:
  Authorization URL:  https://acme-staging-v02.api.letsencrypt.org/acme/authz-v3/1401516058
  Dns Name:           accounting-service-public.stag.aws.worksome.net
  Issuer Ref:
    Group:  cert-manager.io
    Kind:   Issuer
    Name:   letsencrypt-staging
  Key:      HDU7Jy0sqG4bgp1ADI6nACKYYVs0g_5cVfdUsXcVOgg.-yuHWS8lDxJT_DIqqUoRVEU3PIZV3RT-ln1_oYBMf0A
  Solver:
    http01:
      Ingress:
        Class:  nginx
  Token:        HDU7Jy0sqG4bgp1ADI6nACKYYVs0g_5cVfdUsXcVOgg
  Type:         HTTP-01
  URL:          https://acme-staging-v02.api.letsencrypt.org/acme/chall-v3/1401516058/Oc1iHA
  Wildcard:     false
Status:
  Presented:   false
  Processing:  false
  Reason:      Error accepting challenge: 400 urn:ietf:params:acme:error:malformed: Unable to update challenge :: authorization must be pending
  State:       expired
Events:
  Type    Reason     Age   From          Message
  ----    ------     ----  ----          -------
  Normal  Started    55m   cert-manager  Challenge scheduled for processing
  Normal  Presented  55m   cert-manager  Presented challenge using HTTP-01 challenge mechanism

The cert-manager controller pod created “cm-acme-http-solver” pod+service+ingress but they might not work as expected - it seems to return 503 for a minute, then it seems to get into the “Error accepting challenge: 400” error above and the controller seems to remove the setup. The logs from a loop of the cert-manager controller:

I0111 11:32:43.082104       1 pod.go:71] cert-manager/controller/challenges/http01/ensurePod "msg"="creating HTTP01 challenge solver pod" "dnsName"="accounting-service-public.stag.aws.worksome.net" "resource_kind"="Challenge" "resource_name"="accounting-service-tls-2bnht-2824239067-1373156968" "resource_namespace"="default" "resource_version"="v1" "type"="HTTP-01"
I0111 11:32:43.344491       1 pod.go:59] cert-manager/controller/challenges/http01/selfCheck/http01/ensurePod "msg"="found one existing HTTP01 solver pod" "dnsName"="accounting-service-public.stag.aws.worksome.net" "related_resource_kind"="Pod" "related_resource_name"="cm-acme-http-solver-ng88s" "related_resource_namespace"="default" "related_resource_version"="v1" "resource_kind"="Challenge" "resource_name"="accounting-service-tls-2bnht-2824239067-1373156968" "resource_namespace"="default" "resource_version"="v1" "type"="HTTP-01"
I0111 11:32:43.344598       1 service.go:43] cert-manager/controller/challenges/http01/selfCheck/http01/ensureService "msg"="found one existing HTTP01 solver Service for challenge resource" "dnsName"="accounting-service-public.stag.aws.worksome.net" "related_resource_kind"="Service" "related_resource_name"="cm-acme-http-solver-shxmm" "related_resource_namespace"="default" "related_resource_version"="v1" "resource_kind"="Challenge" "resource_name"="accounting-service-tls-2bnht-2824239067-1373156968" "resource_namespace"="default" "resource_version"="v1" "type"="HTTP-01"
I0111 11:32:43.344653       1 ingress.go:90] cert-manager/controller/challenges/http01/selfCheck/http01/ensureIngress "msg"="found one existing HTTP01 solver ingress" "dnsName"="accounting-service-public.stag.aws.worksome.net" "related_resource_kind"="Ingress" "related_resource_name"="cm-acme-http-solver-4wrgx" "related_resource_namespace"="default" "related_resource_version"="v1" "resource_kind"="Challenge" "resource_name"="accounting-service-tls-2bnht-2824239067-1373156968" "resource_namespace"="default" "resource_version"="v1" "type"="HTTP-01"
E0111 11:32:48.404069       1 sync.go:186] cert-manager/controller/challenges "msg"="propagation check failed" "error"="wrong status code '503', expected '200'" "dnsName"="accounting-service-public.stag.aws.worksome.net" "resource_kind"="Challenge" "resource_name"="accounting-service-tls-2bnht-2824239067-1373156968" "resource_namespace"="default" "resource_version"="v1" "type"="HTTP-01"
I0111 11:32:48.427112       1 pod.go:59] cert-manager/controller/challenges/http01/selfCheck/http01/ensurePod "msg"="found one existing HTTP01 solver pod" "dnsName"="accounting-service-public.stag.aws.worksome.net" "related_resource_kind"="Pod" "related_resource_name"="cm-acme-http-solver-ng88s" "related_resource_namespace"="default" "related_resource_version"="v1" "resource_kind"="Challenge" "resource_name"="accounting-service-tls-2bnht-2824239067-1373156968" "resource_namespace"="default" "resource_version"="v1" "type"="HTTP-01"
I0111 11:32:48.427186       1 service.go:43] cert-manager/controller/challenges/http01/selfCheck/http01/ensureService "msg"="found one existing HTTP01 solver Service for challenge resource" "dnsName"="accounting-service-public.stag.aws.worksome.net" "related_resource_kind"="Service" "related_resource_name"="cm-acme-http-solver-shxmm" "related_resource_namespace"="default" "related_resource_version"="v1" "resource_kind"="Challenge" "resource_name"="accounting-service-tls-2bnht-2824239067-1373156968" "resource_namespace"="default" "resource_version"="v1" "type"="HTTP-01"
I0111 11:32:48.427279       1 ingress.go:90] cert-manager/controller/challenges/http01/selfCheck/http01/ensureIngress "msg"="found one existing HTTP01 solver ingress" "dnsName"="accounting-service-public.stag.aws.worksome.net" "related_resource_kind"="Ingress" "related_resource_name"="cm-acme-http-solver-4wrgx" "related_resource_namespace"="default" "related_resource_version"="v1" "resource_kind"="Challenge" "resource_name"="accounting-service-tls-2bnht-2824239067-1373156968" "resource_namespace"="default" "resource_version"="v1" "type"="HTTP-01"
E0111 11:32:48.440863       1 sync.go:186] cert-manager/controller/challenges "msg"="propagation check failed" "error"="wrong status code '503', expected '200'" "dnsName"="accounting-service-public.stag.aws.worksome.net" "resource_kind"="Challenge" "resource_name"="accounting-service-tls-2bnht-2824239067-1373156968" "resource_namespace"="default" "resource_version"="v1" "type"="HTTP-01"
I0111 11:32:58.404531       1 pod.go:59] cert-manager/controller/challenges/http01/selfCheck/http01/ensurePod "msg"="found one existing HTTP01 solver pod" "dnsName"="accounting-service-public.stag.aws.worksome.net" "related_resource_kind"="Pod" "related_resource_name"="cm-acme-http-solver-ng88s" "related_resource_namespace"="default" "related_resource_version"="v1" "resource_kind"="Challenge" "resource_name"="accounting-service-tls-2bnht-2824239067-1373156968" "resource_namespace"="default" "resource_version"="v1" "type"="HTTP-01"
I0111 11:32:58.404635       1 service.go:43] cert-manager/controller/challenges/http01/selfCheck/http01/ensureService "msg"="found one existing HTTP01 solver Service for challenge resource" "dnsName"="accounting-service-public.stag.aws.worksome.net" "related_resource_kind"="Service" "related_resource_name"="cm-acme-http-solver-shxmm" "related_resource_namespace"="default" "related_resource_version"="v1" "resource_kind"="Challenge" "resource_name"="accounting-service-tls-2bnht-2824239067-1373156968" "resource_namespace"="default" "resource_version"="v1" "type"="HTTP-01"
I0111 11:32:58.404689       1 ingress.go:90] cert-manager/controller/challenges/http01/selfCheck/http01/ensureIngress "msg"="found one existing HTTP01 solver ingress" "dnsName"="accounting-service-public.stag.aws.worksome.net" "related_resource_kind"="Ingress" "related_resource_name"="cm-acme-http-solver-4wrgx" "related_resource_namespace"="default" "related_resource_version"="v1" "resource_kind"="Challenge" "resource_name"="accounting-service-tls-2bnht-2824239067-1373156968" "resource_namespace"="default" "resource_version"="v1" "type"="HTTP-01"
E0111 11:32:58.423602       1 sync.go:186] cert-manager/controller/challenges "msg"="propagation check failed" "error"="wrong status code '503', expected '200'" "dnsName"="accounting-service-public.stag.aws.worksome.net" "resource_kind"="Challenge" "resource_name"="accounting-service-tls-2bnht-2824239067-1373156968" "resource_namespace"="default" "resource_version"="v1" "type"="HTTP-01"
I0111 11:33:08.423918       1 pod.go:59] cert-manager/controller/challenges/http01/selfCheck/http01/ensurePod "msg"="found one existing HTTP01 solver pod" "dnsName"="accounting-service-public.stag.aws.worksome.net" "related_resource_kind"="Pod" "related_resource_name"="cm-acme-http-solver-ng88s" "related_resource_namespace"="default" "related_resource_version"="v1" "resource_kind"="Challenge" "resource_name"="accounting-service-tls-2bnht-2824239067-1373156968" "resource_namespace"="default" "resource_version"="v1" "type"="HTTP-01"
I0111 11:33:08.424030       1 service.go:43] cert-manager/controller/challenges/http01/selfCheck/http01/ensureService "msg"="found one existing HTTP01 solver Service for challenge resource" "dnsName"="accounting-service-public.stag.aws.worksome.net" "related_resource_kind"="Service" "related_resource_name"="cm-acme-http-solver-shxmm" "related_resource_namespace"="default" "related_resource_version"="v1" "resource_kind"="Challenge" "resource_name"="accounting-service-tls-2bnht-2824239067-1373156968" "resource_namespace"="default" "resource_version"="v1" "type"="HTTP-01"
I0111 11:33:08.424086       1 ingress.go:90] cert-manager/controller/challenges/http01/selfCheck/http01/ensureIngress "msg"="found one existing HTTP01 solver ingress" "dnsName"="accounting-service-public.stag.aws.worksome.net" "related_resource_kind"="Ingress" "related_resource_name"="cm-acme-http-solver-4wrgx" "related_resource_namespace"="default" "related_resource_version"="v1" "resource_kind"="Challenge" "resource_name"="accounting-service-tls-2bnht-2824239067-1373156968" "resource_namespace"="default" "resource_version"="v1" "type"="HTTP-01"
E0111 11:33:08.433494       1 sync.go:186] cert-manager/controller/challenges "msg"="propagation check failed" "error"="wrong status code '503', expected '200'" "dnsName"="accounting-service-public.stag.aws.worksome.net" "resource_kind"="Challenge" "resource_name"="accounting-service-tls-2bnht-2824239067-1373156968" "resource_namespace"="default" "resource_version"="v1" "type"="HTTP-01"
I0111 11:33:18.434851       1 pod.go:59] cert-manager/controller/challenges/http01/selfCheck/http01/ensurePod "msg"="found one existing HTTP01 solver pod" "dnsName"="accounting-service-public.stag.aws.worksome.net" "related_resource_kind"="Pod" "related_resource_name"="cm-acme-http-solver-ng88s" "related_resource_namespace"="default" "related_resource_version"="v1" "resource_kind"="Challenge" "resource_name"="accounting-service-tls-2bnht-2824239067-1373156968" "resource_namespace"="default" "resource_version"="v1" "type"="HTTP-01"
I0111 11:33:18.434940       1 service.go:43] cert-manager/controller/challenges/http01/selfCheck/http01/ensureService "msg"="found one existing HTTP01 solver Service for challenge resource" "dnsName"="accounting-service-public.stag.aws.worksome.net" "related_resource_kind"="Service" "related_resource_name"="cm-acme-http-solver-shxmm" "related_resource_namespace"="default" "related_resource_version"="v1" "resource_kind"="Challenge" "resource_name"="accounting-service-tls-2bnht-2824239067-1373156968" "resource_namespace"="default" "resource_version"="v1" "type"="HTTP-01"
I0111 11:33:18.435001       1 ingress.go:90] cert-manager/controller/challenges/http01/selfCheck/http01/ensureIngress "msg"="found one existing HTTP01 solver ingress" "dnsName"="accounting-service-public.stag.aws.worksome.net" "related_resource_kind"="Ingress" "related_resource_name"="cm-acme-http-solver-4wrgx" "related_resource_namespace"="default" "related_resource_version"="v1" "resource_kind"="Challenge" "resource_name"="accounting-service-tls-2bnht-2824239067-1373156968" "resource_namespace"="default" "resource_version"="v1" "type"="HTTP-01"
E0111 11:33:23.444981       1 sync.go:186] cert-manager/controller/challenges "msg"="propagation check failed" "error"="wrong status code '503', expected '200'" "dnsName"="accounting-service-public.stag.aws.worksome.net" "resource_kind"="Challenge" "resource_name"="accounting-service-tls-2bnht-2824239067-1373156968" "resource_namespace"="default" "resource_version"="v1" "type"="HTTP-01"
I0111 11:33:33.445392       1 pod.go:59] cert-manager/controller/challenges/http01/selfCheck/http01/ensurePod "msg"="found one existing HTTP01 solver pod" "dnsName"="accounting-service-public.stag.aws.worksome.net" "related_resource_kind"="Pod" "related_resource_name"="cm-acme-http-solver-ng88s" "related_resource_namespace"="default" "related_resource_version"="v1" "resource_kind"="Challenge" "resource_name"="accounting-service-tls-2bnht-2824239067-1373156968" "resource_namespace"="default" "resource_version"="v1" "type"="HTTP-01"
I0111 11:33:33.445478       1 service.go:43] cert-manager/controller/challenges/http01/selfCheck/http01/ensureService "msg"="found one existing HTTP01 solver Service for challenge resource" "dnsName"="accounting-service-public.stag.aws.worksome.net" "related_resource_kind"="Service" "related_resource_name"="cm-acme-http-solver-shxmm" "related_resource_namespace"="default" "related_resource_version"="v1" "resource_kind"="Challenge" "resource_name"="accounting-service-tls-2bnht-2824239067-1373156968" "resource_namespace"="default" "resource_version"="v1" "type"="HTTP-01"
I0111 11:33:33.445536       1 ingress.go:90] cert-manager/controller/challenges/http01/selfCheck/http01/ensureIngress "msg"="found one existing HTTP01 solver ingress" "dnsName"="accounting-service-public.stag.aws.worksome.net" "related_resource_kind"="Ingress" "related_resource_name"="cm-acme-http-solver-4wrgx" "related_resource_namespace"="default" "related_resource_version"="v1" "resource_kind"="Challenge" "resource_name"="accounting-service-tls-2bnht-2824239067-1373156968" "resource_namespace"="default" "resource_version"="v1" "type"="HTTP-01"
E0111 11:33:38.452673       1 sync.go:186] cert-manager/controller/challenges "msg"="propagation check failed" "error"="wrong status code '503', expected '200'" "dnsName"="accounting-service-public.stag.aws.worksome.net" "resource_kind"="Challenge" "resource_name"="accounting-service-tls-2bnht-2824239067-1373156968" "resource_namespace"="default" "resource_version"="v1" "type"="HTTP-01"
I0111 11:33:48.453060       1 pod.go:59] cert-manager/controller/challenges/http01/selfCheck/http01/ensurePod "msg"="found one existing HTTP01 solver pod" "dnsName"="accounting-service-public.stag.aws.worksome.net" "related_resource_kind"="Pod" "related_resource_name"="cm-acme-http-solver-ng88s" "related_resource_namespace"="default" "related_resource_version"="v1" "resource_kind"="Challenge" "resource_name"="accounting-service-tls-2bnht-2824239067-1373156968" "resource_namespace"="default" "resource_version"="v1" "type"="HTTP-01"
I0111 11:33:48.453144       1 service.go:43] cert-manager/controller/challenges/http01/selfCheck/http01/ensureService "msg"="found one existing HTTP01 solver Service for challenge resource" "dnsName"="accounting-service-public.stag.aws.worksome.net" "related_resource_kind"="Service" "related_resource_name"="cm-acme-http-solver-shxmm" "related_resource_namespace"="default" "related_resource_version"="v1" "resource_kind"="Challenge" "resource_name"="accounting-service-tls-2bnht-2824239067-1373156968" "resource_namespace"="default" "resource_version"="v1" "type"="HTTP-01"
I0111 11:33:48.453209       1 ingress.go:90] cert-manager/controller/challenges/http01/selfCheck/http01/ensureIngress "msg"="found one existing HTTP01 solver ingress" "dnsName"="accounting-service-public.stag.aws.worksome.net" "related_resource_kind"="Ingress" "related_resource_name"="cm-acme-http-solver-4wrgx" "related_resource_namespace"="default" "related_resource_version"="v1" "resource_kind"="Challenge" "resource_name"="accounting-service-tls-2bnht-2824239067-1373156968" "resource_namespace"="default" "resource_version"="v1" "type"="HTTP-01"
E0111 11:33:53.462773       1 sync.go:186] cert-manager/controller/challenges "msg"="propagation check failed" "error"="wrong status code '503', expected '200'" "dnsName"="accounting-service-public.stag.aws.worksome.net" "resource_kind"="Challenge" "resource_name"="accounting-service-tls-2bnht-2824239067-1373156968" "resource_namespace"="default" "resource_version"="v1" "type"="HTTP-01"
I0111 11:34:03.463584       1 pod.go:59] cert-manager/controller/challenges/http01/selfCheck/http01/ensurePod "msg"="found one existing HTTP01 solver pod" "dnsName"="accounting-service-public.stag.aws.worksome.net" "related_resource_kind"="Pod" "related_resource_name"="cm-acme-http-solver-ng88s" "related_resource_namespace"="default" "related_resource_version"="v1" "resource_kind"="Challenge" "resource_name"="accounting-service-tls-2bnht-2824239067-1373156968" "resource_namespace"="default" "resource_version"="v1" "type"="HTTP-01"
I0111 11:34:03.463669       1 service.go:43] cert-manager/controller/challenges/http01/selfCheck/http01/ensureService "msg"="found one existing HTTP01 solver Service for challenge resource" "dnsName"="accounting-service-public.stag.aws.worksome.net" "related_resource_kind"="Service" "related_resource_name"="cm-acme-http-solver-shxmm" "related_resource_namespace"="default" "related_resource_version"="v1" "resource_kind"="Challenge" "resource_name"="accounting-service-tls-2bnht-2824239067-1373156968" "resource_namespace"="default" "resource_version"="v1" "type"="HTTP-01"
I0111 11:34:03.463734       1 ingress.go:90] cert-manager/controller/challenges/http01/selfCheck/http01/ensureIngress "msg"="found one existing HTTP01 solver ingress" "dnsName"="accounting-service-public.stag.aws.worksome.net" "related_resource_kind"="Ingress" "related_resource_name"="cm-acme-http-solver-4wrgx" "related_resource_namespace"="default" "related_resource_version"="v1" "resource_kind"="Challenge" "resource_name"="accounting-service-tls-2bnht-2824239067-1373156968" "resource_namespace"="default" "resource_version"="v1" "type"="HTTP-01"
E0111 11:34:39.004532       1 sync.go:386] cert-manager/controller/challenges/acceptChallenge "msg"="error waiting for authorization" "error"="context deadline exceeded" "dnsName"="accounting-service-public.stag.aws.worksome.net" "resource_kind"="Challenge" "resource_name"="accounting-service-tls-2bnht-2824239067-1373156968" "resource_namespace"="default" "resource_version"="v1" "type"="HTTP-01"
E0111 11:34:39.004807       1 controller.go:163] cert-manager/controller/challenges "msg"="re-queuing item due to error processing" "error"="context deadline exceeded" "key"="default/accounting-service-tls-2bnht-2824239067-1373156968"
I0111 11:34:44.006058       1 pod.go:59] cert-manager/controller/challenges/http01/selfCheck/http01/ensurePod "msg"="found one existing HTTP01 solver pod" "dnsName"="accounting-service-public.stag.aws.worksome.net" "related_resource_kind"="Pod" "related_resource_name"="cm-acme-http-solver-ng88s" "related_resource_namespace"="default" "related_resource_version"="v1" "resource_kind"="Challenge" "resource_name"="accounting-service-tls-2bnht-2824239067-1373156968" "resource_namespace"="default" "resource_version"="v1" "type"="HTTP-01"
I0111 11:34:44.006177       1 service.go:43] cert-manager/controller/challenges/http01/selfCheck/http01/ensureService "msg"="found one existing HTTP01 solver Service for challenge resource" "dnsName"="accounting-service-public.stag.aws.worksome.net" "related_resource_kind"="Service" "related_resource_name"="cm-acme-http-solver-shxmm" "related_resource_namespace"="default" "related_resource_version"="v1" "resource_kind"="Challenge" "resource_name"="accounting-service-tls-2bnht-2824239067-1373156968" "resource_namespace"="default" "resource_version"="v1" "type"="HTTP-01"
I0111 11:34:44.006238       1 ingress.go:90] cert-manager/controller/challenges/http01/selfCheck/http01/ensureIngress "msg"="found one existing HTTP01 solver ingress" "dnsName"="accounting-service-public.stag.aws.worksome.net" "related_resource_kind"="Ingress" "related_resource_name"="cm-acme-http-solver-4wrgx" "related_resource_namespace"="default" "related_resource_version"="v1" "resource_kind"="Challenge" "resource_name"="accounting-service-tls-2bnht-2824239067-1373156968" "resource_namespace"="default" "resource_version"="v1" "type"="HTTP-01"
E0111 11:34:59.233881       1 sync.go:378] cert-manager/controller/challenges/acceptChallenge "msg"="error accepting challenge" "error"="400 urn:ietf:params:acme:error:malformed: Unable to update challenge :: authorization must be pending" "dnsName"="accounting-service-public.stag.aws.worksome.net" "resource_kind"="Challenge" "resource_name"="accounting-service-tls-2bnht-2824239067-1373156968" "resource_namespace"="default" "resource_version"="v1" "type"="HTTP-01"
I0111 11:34:59.321416       1 pod.go:119] cert-manager/controller/challenges/cleanupPods "msg"="deleting pod resource" "dnsName"="accounting-service-public.stag.aws.worksome.net" "related_resource_kind"="Pod" "related_resource_name"="cm-acme-http-solver-ng88s" "related_resource_namespace"="default" "related_resource_version"="v1" "resource_kind"="Challenge" "resource_name"="accounting-service-tls-2bnht-2824239067-1373156968" "resource_namespace"="default" "resource_version"="v1" "type"="HTTP-01"
I0111 11:34:59.335495       1 pod.go:127] cert-manager/controller/challenges/cleanupPods "msg"="successfully deleted pod resource" "dnsName"="accounting-service-public.stag.aws.worksome.net" "related_resource_kind"="Pod" "related_resource_name"="cm-acme-http-solver-ng88s" "related_resource_namespace"="default" "related_resource_version"="v1" "resource_kind"="Challenge" "resource_name"="accounting-service-tls-2bnht-2824239067-1373156968" "resource_namespace"="default" "resource_version"="v1" "type"="HTTP-01"
E0111 11:34:59.440884       1 controller.go:102] ingress 'default/cm-acme-http-solver-4wrgx' in work queue no longer exists
I0111 11:34:59.727133       1 trigger_controller.go:160] cert-manager/controller/certificates-trigger "msg"="Not re-issuing certificate as an attempt has been made in the last hour" "key"="default/accounting-service-tls" "retry_delay"=3459272891786
E0111 11:35:00.030063       1 sync.go:70] cert-manager/controller/orders "msg"="failed to update status" "error"=null "resource_kind"="Order" "resource_name"="accounting-service-tls-2bnht-2824239067" "resource_namespace"="default" "resource_version"="v1"
I0111 11:35:00.030144       1 controller.go:161] cert-manager/controller/orders "msg"="re-queuing item due to optimistic locking on resource" "key"="default/accounting-service-tls-2bnht-2824239067" "error"="Operation cannot be fulfilled on orders.acme.cert-manager.io \"accounting-service-tls-2bnht-2824239067\": the object has been modified; please apply your changes to the latest version and try again"

I managed to quickly describe the pod-service-ingress that the cert-manager controller creates, sometime during their short lifespan:

➜  ~ kubectl describe pod cm-acme-http-solver-jnm6f; kubectl describe ingress cm-acme-http-solver-2j9fm; kubectl describe service cm-acme-http-solver-8vzb7
Name:                 cm-acme-http-solver-jnm6f
Namespace:            default
Priority:             2000001000
Priority Class Name:  system-node-critical
Node:                 <none>
Labels:               acme.cert-manager.io/http-domain=3159757414
                      acme.cert-manager.io/http-token=607653218
                      acme.cert-manager.io/http01-solver=true
                      eks.amazonaws.com/fargate-profile=platform-staging-fargate-pod-profile
Annotations:          CapacityProvisioned: 0.25vCPU 0.5GB
                      Logging: LoggingDisabled: LOGGING_CONFIGMAP_NOT_FOUND
                      kubernetes.io/psp: eks.privileged
                      sidecar.istio.io/inject: false
Status:               Pending
IP:
IPs:                  <none>
Controlled By:        Challenge/accounting-service-tls-hprbm-2824239067-3707414063
NominatedNodeName:    1c983fda60-0ef78df90139477ab992d594d1188b1b
Containers:
  acmesolver:
    Image:      quay.io/jetstack/cert-manager-acmesolver:v1.6.1
    Port:       8089/TCP
    Host Port:  0/TCP
    Args:
      --listen-port=8089
      --domain=accounting-service-public.stag.aws.worksome.net
      --token=K6SJKwJBSPDYjW5MI4ICRuHMpBN8lDwU1NLhOX2chN4
      --key=K6SJKwJBSPDYjW5MI4ICRuHMpBN8lDwU1NLhOX2chN4.-yuHWS8lDxJT_DIqqUoRVEU3PIZV3RT-ln1_oYBMf0A
    Limits:
      cpu:     100m
      memory:  64Mi
    Requests:
      cpu:        10m
      memory:     64Mi
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-rbjd6 (ro)
Volumes:
  kube-api-access-rbjd6:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason           Age   From               Message
  ----     ------           ----  ----               -------
  Warning  LoggingDisabled  63s   fargate-scheduler  Disabled logging because aws-logging configmap was not found. configmap "aws-logging" not found
Name:             cm-acme-http-solver-2j9fm
Labels:           acme.cert-manager.io/http-domain=3159757414
                  acme.cert-manager.io/http-token=607653218
                  acme.cert-manager.io/http01-solver=true
Namespace:        default
Address:          k8s-default-ingressn-5f8fde044d-bcdcaea9f98c3614.elb.eu-west-1.amazonaws.com
Default backend:  default-http-backend:80 (<error: endpoints "default-http-backend" not found>)
Rules:
  Host                                             Path  Backends
  ----                                             ----  --------
  accounting-service-public.stag.aws.worksome.net
                                                   /.well-known/acme-challenge/K6SJKwJBSPDYjW5MI4ICRuHMpBN8lDwU1NLhOX2chN4   cm-acme-http-solver-8vzb7:8089 (<none>)
Annotations:                                       nginx.ingress.kubernetes.io/whitelist-source-range: 0.0.0.0/0,::/0
Events:
  Type    Reason  Age                From                      Message
  ----    ------  ----               ----                      -------
  Normal  Sync    50s (x2 over 64s)  nginx-ingress-controller  Scheduled for sync
Name:                     cm-acme-http-solver-8vzb7
Namespace:                default
Labels:                   acme.cert-manager.io/http-domain=3159757414
                          acme.cert-manager.io/http-token=607653218
                          acme.cert-manager.io/http01-solver=true
Annotations:              auth.istio.io/8089: NONE
Selector:                 acme.cert-manager.io/http-domain=3159757414,acme.cert-manager.io/http-token=607653218,acme.cert-manager.io/http01-solver=true
Type:                     NodePort
IP Family Policy:         SingleStack
IP Families:              IPv4
IP:                       10.200.245.42
IPs:                      10.200.245.42
Port:                     http  8089/TCP
TargetPort:               8089/TCP
NodePort:                 http  30593/TCP
Endpoints:                <none>
Session Affinity:         None
External Traffic Policy:  Cluster
Events:                   <none>

And I also managed to get some of the logs of this spawned solver pod in its short lifespan:

Logs from acmesolver in cm-acme-http-solver-jnm6f	

I0111 12:33:58.754880       1 solver.go:39] cert-manager/acmesolver "msg"="starting listener"  "expected_domain"="accounting-service-public.stag.aws.worksome.net" "expected_key"="K6SJKwJBSPDYjW5MI4ICRuHMpBN8lDwU1NLhOX2chN4.-yuHWS8lDxJT_DIqqUoRVEU3PIZV3RT-ln1_oYBMf0A" "expected_token"="K6SJKwJBSPDYjW5MI4ICRuHMpBN8lDwU1NLhOX2chN4" "listen_port"=8089
I0111 12:34:09.444659       1 solver.go:64] cert-manager/acmesolver "msg"="validating request" "base_path"="/.well-known/acme-challenge" "host"="accounting-service-public.stag.aws.worksome.net" "path"="/.well-known/acme-challenge/K6SJKwJBSPDYjW5MI4ICRuHMpBN8lDwU1NLhOX2chN4" "token"="K6SJKwJBSPDYjW5MI4ICRuHMpBN8lDwU1NLhOX2chN4" 
I0111 12:34:09.444701       1 solver.go:72] cert-manager/acmesolver "msg"="comparing host" "base_path"="/.well-known/acme-challenge" "host"="accounting-service-public.stag.aws.worksome.net" "path"="/.well-known/acme-challenge/K6SJKwJBSPDYjW5MI4ICRuHMpBN8lDwU1NLhOX2chN4" "token"="K6SJKwJBSPDYjW5MI4ICRuHMpBN8lDwU1NLhOX2chN4" "expected_host"="accounting-service-public.stag.aws.worksome.net"
I0111 12:34:09.444734       1 solver.go:79] cert-manager/acmesolver "msg"="comparing token" "base_path"="/.well-known/acme-challenge" "host"="accounting-service-public.stag.aws.worksome.net" "path"="/.well-known/acme-challenge/K6SJKwJBSPDYjW5MI4ICRuHMpBN8lDwU1NLhOX2chN4" "token"="K6SJKwJBSPDYjW5MI4ICRuHMpBN8lDwU1NLhOX2chN4" "expected_token"="K6SJKwJBSPDYjW5MI4ICRuHMpBN8lDwU1NLhOX2chN4"
I0111 12:34:09.444753       1 solver.go:87] cert-manager/acmesolver "msg"="got successful challenge request, writing key" "base_path"="/.well-known/acme-challenge" "host"="accounting-service-public.stag.aws.worksome.net" "path"="/.well-known/acme-challenge/K6SJKwJBSPDYjW5MI4ICRuHMpBN8lDwU1NLhOX2chN4" "token"="K6SJKwJBSPDYjW5MI4ICRuHMpBN8lDwU1NLhOX2chN4" 

Issuer used:

apiVersion: cert-manager.io/v1
kind: Issuer
metadata:
  name: letsencrypt-staging
spec:
  acme:
       # The ACME server URL
    server: https://acme-staging-v02.api.letsencrypt.org/directory
       # Email address used for ACME registration
    email: myemail@mycompany.com
       # Name of a secret used to store the ACME account private key
    privateKeySecretRef:
      name: letsencrypt-staging.issuer.private-key
       # Enable the HTTP-01 challenge provider
    solvers:
    - http01:
        ingress:
          class: nginx

Expected behaviour: challenge is successful, resulting in “valid” order state and certificate being issued

Steps to reproduce the bug:

  • create an EKS cluster and set it up to work with Fargate
  • set-up external-dns, ingress-nginx, aws-load-balancer-controller
  • deploy a sample application served behind the ingress-nginx, with a HOST record configured in e.g. Route53 via external-dns
  • check that it can be reached via http curl http://$HOST
  • deploy Issuers: letsencrypt HTTP01 issuers for both staging and production
  • amend Ingress, add cert-manager.io/issuer: "letsencrypt-staging" annotation to the corresponding service Ingress
  • (expect certificate issuance and the endpoint responding successfully over https)

Anything else we need to know?:

  • I tried also setting acme.cert-manager.io/http01-ingress-class: nginx and acme.cert-manager.io/http01-edit-in-place: "false" annotations on the service ingress, but to no avail.
  • I also went through the troubleshooting info - https://cert-manager.io/docs/faq/acme/, but no luck.
  • curl http://$HOST feels slow at times, I wonder if it’s because of the DNS (and I therefore wonder if it somehow impacts the cert-manager solver or the overall process here)
  • This is a detailed bug report of my input in the comments of https://github.com/jetstack/cert-manager/issues/4709

Environment details::

  • Kubernetes version: v1.21.2-eks-06eac09, or more specifically
➜  cert-manager git:(xxxxxxxxxxxxx) ✗ kubectl version                                                                                                                                                                                      
Client Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.1", GitCommit:"86ec240af8cbd1b60bcc4c03c20da9b98005b92e", GitTreeState:"clean", BuildDate:"2021-12-16T11:33:37Z", GoVersion:"go1.17.5", Compiler:"gc", Platform:"darwin/arm64"}
Server Version: version.Info{Major:"1", Minor:"21+", GitVersion:"v1.21.2-eks-06eac09", GitCommit:"5f6d83fe4cb7febb5f4f4e39b3b2b64ebbbe3e97", GitTreeState:"clean", BuildDate:"2021-09-13T14:20:15Z", GoVersion:"go1.16.5", Compiler:"gc", Platform:"linux/amd64"}
WARNING: version difference between client (1.23) and server (1.21) exceeds the supported minor version skew of +/-1
  • Cloud-provider/provisioner: AWS / eks with fargate
  • cert-manager version: 1.6.1
  • Install method: installed via helm template cert-manager jetstack/cert-manager --version 1.6.1 -f values.yaml + kubectl apply of the resulting yaml (values.yaml is obtained by helm show values jetstack/cert-manager --version 1.6.1 and the only value changed in it is webhook.securePort: 10260) and of the corresponding CRDs

I can’t seem to debug further, as the solver pod+service+ingress live for just a minute or so.

/kind bug

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 17

Most upvoted comments

I will dig some more. I assume there’s no (public) way to see what ACME is up to? (logs, or something)

Not directly logs as such. You can get some information about the state of authorizations, orders etc by looking at the URLs on the cert-manager resources (i.e the ACME authorization URL that gets put on the status of Order). You can also increase log level on cert-manager controller with --v=5 flag to controller which will, between other, make it log what calls in makes to ACME

Thank you for the great issue description, I am still to read through the logs you posted

I can’t seem to debug further, as the solver pod+service+ingress live for just a minute or so.

I have previously done this when debugging by modifying RBAC, so that cert-manager doesn’t have permissions to delete pods, services, ingresses and challenges- see https://github.com/jetstack/cert-manager/issues/4676#issuecomment-1003355941 although the user said it didn’t work for them, but it should be achievable by modifying RBAC.

The solver pod appears to be functioning as expected, the error at the end of its log output is it getting killed after cert-manager deleted the invalid Challenge.

public records (crt.sh) suggests that there is no certificate issued for the domain - not sure if it’s meant to show only issued certs or requests as well(?) MikeMcQ on community.letsencrypt.org says it displays only issued production certs

As you say, there will be no certs on crt.sh as it does not appear that ACME was able to successfully validate the challenge

Looking at the last part of logs from controller:

E0111 11:34:39.004532 1 sync.go:386] cert-manager/controller/challenges/acceptChallenge “msg”=“error waiting for authorization” “error”=“context deadline exceeded” “dnsName”=“accounting-service-public.stag.aws.worksome.net” “resource_kind”=“Challenge” “resource_name”=“accounting-service-tls-2bnht-2824239067-1373156968” “resource_namespace”=“default” “resource_version”=“v1” “type”=“HTTP-01”

This is cert-manager waiting for ACME to accept the authorization, so at this point the self check must have succeeded, the challenge has been accepted with ACME and cert-manager waits for ACME to validate the challenge, but that times out presumably because the ACME request for the token that the solver pod serves fails.

E0111 11:34:59.233881 1 sync.go:378] cert-manager/controller/challenges/acceptChallenge “msg”=“error accepting challenge” “error”=“400 urn:ietf:params:acme:error:malformed: Unable to update challenge :: authorization must be pending” “dnsName”=“accounting-service-public.stag.aws.worksome.net” “resource_kind”=“Challenge” “resource_name”=“accounting-service-tls-2bnht-2824239067-1373156968” “resource_namespace”=“default” “resource_version”=“v1” “type”=“HTTP-01”

This is cert-manager again attempting to accept the challenge and wait for ACME to validate it, but it gets back authorization must be pending. This could actually mean that the authorization was set to some other state than ‘pending’ (i.e ‘invalid’) in ACME as a result of the previous attempt to validate the challenge failing (see the conversation on https://github.com/jetstack/cert-manager/issues/4676 for context) so I think the actual issue is the timeout from ACME that happened before.

As you suggest on the other issue, I think it is likely that the error is DNS/networking/ingress setup related so that the ACME server query for the challenge URL is failing