spire: Verify spire-agent high cpu load caused by k8s JSON parsing

Version: 1.5.2
Platform: Linux 5.19.0-35-generic_#36~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC x86_64
Subsystem: agent

During performance tests we experienced that spire-agents utilize high CPU load while spire-servers hardly utilize CPU.

NAME                CPU(cores)   MEMORY(bytes)
spire-agent-5nlk4   488m         47Mi
spire-agent-c7tpr   634m         51Mi
spire-agent-jgsvr   464m         58Mi
spire-agent-l9zzz   406m         39Mi
spire-agent-mdstq   487m         60Mi
spire-agent-qnpd2   390m         45Mi
spire-agent-r6r94   509m         40Mi
spire-server-0      7m           117Mi
spire-server-1      3m           48Mi

There are 46 registration entries with 1h TTL. There are 4 worker nodes in the system, the maximum limit is 500 pods/worker node. During the test there are ~70 pods running per worker nodes.

Below I copied the agent configuration, which is very simple:

agent {
    data_dir = "/run/spire/temp"
    log_level = "DEBUG"
    server_address = "spire-server"
    server_port = "8081"
    socket_path = "/run/spire/sockets/agent.sock"
    trust_bundle_path = "/run/spire/bundle/bundle.crt"
    trust_domain = "example.org"
}

plugins {
    NodeAttestor "k8s_psat" {
        plugin_data {
            cluster = "example-cluster"
        }
    }

    KeyManager "memory" {
        plugin_data {}
    }

    WorkloadAttestor "k8s" {
        plugin_data {
            skip_kubelet_verification = true
        }
    }
}

By default the TTL was 1h for SVIDs and 12h for the CA but we tried to increase and decrease them without significant effect on the CPU utilization.

Based on our slack discussion, we tried to add batches of 20 pods on the cluster in 11 steps. The below diagram shows how the CPU utilization of the spire-agent pod increased in time: cpu_usage_agent_new_pods_iteratively The utilization is quite linear with the increasing number of running pods. We experienced the same behavior on each worker nodes, so in the peak period spire-agent utilized ~15% of the CPU. Then it was broken down when deleted all the pods we created previously. In the meantime the spire-agent logs don’t contain any suspicious activity that would show the agents would be busy at all.

We managed to reproduce it on more clusters however the CPU utilization was different which we cannot explain. In understand that it depends on the amount of the running pods, but do all the processes running on that worker node count?

Do you have experience with crowded clusters with many pods, how spire-agent’s CPU utilization looks like? Is this the expected behavior?

About this issue

Original URL
State: closed
Created a year ago
Comments: 20 (20 by maintainers)

Commits related to this issue

Fix for #3957 Signed-off-by: Szilard Vincze <szilard.vincze@est.tech> — committed to Nordix/spire by szvincze a year ago
Fix for spiffe#3957 Signed-off-by: Szilard Vincze <szilard.vincze@est.tech> — committed to Nordix/spire by szvincze a year ago
Fix for #3957 Signed-off-by: Szilard Vincze <szilard.vincze@est.tech> — committed to azdagron/spire by szvincze a year ago
Fix for spiffe#3957 Signed-off-by: Szilard Vincze <szilard.vincze@est.tech> — committed to azdagron/spire by szvincze a year ago

Most upvoted comments

Do you have cgroup memory limit applied to the agents? Are you doing any go gc tuning?

The memory limit is 210Mi, but the actual memory consumption is way below that, and we are not tuning go gc at all.

On the one hand, I tried to reproduce the issue by decreasing the memory limit on another cluster where I haven’t observed high load at all. I managed to get similar profiling output but the memory limit was very-very low in this case…

On the other hand, we increased and even removed the memory limit on the problematic cluster but the CPU consumption remained as before.

hit the kubelet API in the affected cluster, get a copy of the JSON response that the agent would otherwise be receiving … compare it to JSON response in a healthy cluster, and see if there are any obvious differences

I would highly appreciate if you could give a hint how to do it.

szvincze on Apr 4, 2023