prefect: Kubernetes Cluster AutoScaler can result in failed Flow Runs

Description

This only happens on K8 clusters with Cluster Autoscaler to dynamically provision the K8 nodes based on current workload demands in the cluster. In my case, I’ve observed this on AWS EKS with manual NodeGroups.

The Cluster Autoscaler will evaluate the utilization of each Node in the K8 cluster to determine if it can scale down the number of Nodes. By default this happens every 10 minutes (configurable by --scale-down-unneeded-time flag), and is determined by a number of factors, the main one being the sum of cpu and memory requests of all pods running on this node is smaller than 50% of the node’s allocatable (configurable by --scale-down-utilization-threshold). There’s also a back-off period after a Node is first added to the cluster of 10 minutes (configurable by --scale-down-delay-after-add flag).

This functionality of K8 assumes that all Pods created by a controller object (deployment, replica set, job, stateful set etc) are ephemeral, and can be relocated to a different Node, with zero impact to the functionality of those Pods. The Prefect Kubernetes Agent triggers Flow Runs via K8 Jobs (effectively declaring them as ephemeral), which spawn the Pods to perform the actual execution.

The problem is intermittent, and only comes into play under the following situation:

A new, long-running (i.e. >21 minutes), Prefect Flow Run is scheduled, and picked up by a Prefect Agent running in K8 w/Cluster Autoscaler
Prefect Agent creates a K8 Job, but there are no available K8 Nodes to run this Job within 1-5 minutes
Cluster Autoscaler scales up the Nodes in K8. The Node it scales up has at least 2x cpu/memory allocatable than what the single Prefect Flow Run would consume.
The Prefect Flow Run is assigned to the new Node Cluster Autoscaler just started
Capacity frees up on the other Nodes in the K8 cluster, such that no other Pods are scheduled on the new Node the Cluster Autoscaler just started (in step 3)
After 20 minutes, the Cluster Autoscaler determines that the Node from step 3 is a candidate to scale down because it meets the criteria (i.e. sum of cpu/memory of all pods is smaller than 50%, the pods were created by a controller object, etc.)
The Pod (which is executing the Flow Run) is evicted from this Node and rescheduled to a different Node in K8, and the Cluster Autoscaler Node is removed from K8

Now the fun part:

Because K8 evicted the Pod, Prefect Cloud will attempt to keep track of the original Pod (which no longer exists)
When the evicted Pod starts again, it will begin the Flow Run execution again … because this was a long-running Flow Run, it’ll take it >21 minutes to complete
Eventually Prefect will encounter No heartbeat detected from the remote task; marking the run as failed. for the initial Pod it created, which was since relocated by the Cluster AutoScaler.
You can observe the newly started Pod is actually reporting logs to Prefect Cloud, so it’s curious why it detected no heartbeat from that new process. I’m guessing it’s looking for some metadata to identify a heartbeat from the original Pod IP.
Because no heartbeat was detected from that original Pod, the Lazarus process will attempt to reschedule the Flow Run … but the Flow Run was already restarted by K8, it’s just in a different Pod now
This confusion will result in the Flow Run being marked as “Failed”

While less likely, it’s feasible for a Flow Run to be scheduled on a Node that has been up for awhile, but is a candidate to be scaled down in the next 1 minute. Then after ~1 minute of executing the Flow Run on this Node, capacity on other Nodes could free up, such that the Cluster AutoScaler can determine to relocate all the running Pods, which would result in the same type of failure. In this case, the failure would happen to any Flow Run, regardless of execution time.

Expected Behavior

The root of the problem is that the Cluster AutoScaler is moving the Pod created by the Job the Prefect Agent generated for this Flow Run, before that Flow Run had time to complete. It seems the scheduler may not be capable of handling situations where the Flow Run is restarted by processes outside of it’s control.

A simple way to prevent the Cluster AutoScaler from attempting to evict a Pod from a Node, would be to add this annotation to it’s manifest:

"cluster-autoscaler.kubernetes.io/safe-to-evict": "false"

If the AutoScaler see’s this annotation on a Pod, it will not consider the Node for scaling down. In theory you could add this to a job_spec_file, but it’d be exhausting to do this for every Flow you want to execute in a K8 w/Cluster AutoScaler.

Reproduction

There’s a public flow that simply sleeps for a few minutes: https://hub.docker.com/repository/docker/szelenka/long-running-flow

In AWS, I have an EKS cluster with a simple NodeGroup of t3.2xlarge, where the Prefect Agent spawns Jobs with 2 vCPU and 2G RAM. Initially, we only have the Prefect Agent running on a single Node in K8.

Through Prefect Cloud, we start 3 Flow Runs to run at 5 minutes each. This fills up the capacity of the single Node in K8.

Then we start another Flow Run, with a run time of 25 minutes. This causes the Cluster AutoScaler to scale up a Node in K8, and assign this Job to that new Node.

Because it takes longer than 20 minutes to complete, and the other Flow Runs have completed, it will be evicted and relocated to the other Node that capacity freed up on, and trigger the above failure.

Environment

{
  "config_overrides": {
    "cloud": {
      "use_local_secrets": true
    },
    "context": {
      "secrets": false
    }
  },
  "env_vars": [],
  "system_information": {
    "platform": "macOS-10.15.6-x86_64-i386-64bit",
    "prefect_version": "0.12.6",
    "python_version": "3.8.1"
  }
}

About this issue

Original URL
State: closed
Created 4 years ago
Reactions: 9
Comments: 20 (9 by maintainers)

Most upvoted comments

@szelenka Thanks for a well written issue! I think the best bet here would be to add the eviction policy option setting to the agent so you could say something along the lines of:

prefect agent install/start kubernetes --safe-to-evict=False

And then all created jobs will have the annotation

"cluster-autoscaler.kubernetes.io/safe-to-evict": "false"

joshmeek on Jul 30, 2020