spinnaker: clouddriver-ecs: applications search very slow

Running /search?pageSize=500&q=foobar&type=applications with ecs.enabled = true results in very slow response times (10-20+ seconds), occasionally timing out. Doubling both the instance (from 1cpu/3GB to 2cpu/4GB) or the DB (cpu + RAM) does not lower the response time.

However, when I disable ecs again (e.g. only aws anymore enabled), then the search is almost instant again.

Two things I noticed:

When I have ecs disabled, the search is just exactly 1 query (which is what I would expect):

Screen Shot 2020-10-01 at 02 11 05

however, when I enable ecs, it is doing over 3k SQL queries per search request:

Screen Shot 2020-10-01 at 02 09 51

Screen Shot 2020-10-01 at 02 10 03

With many queries being done against alarms (??), taskDefinitions, services, etc.:

Screen Shot 2020-10-01 at 02 10 33

I noticed that the “application” column is empty in many of the ecs tables:

Screen Shot 2020-10-01 at 02 14 14

And also ecs services are missing from cats_v1_applications

This is with Spinnaker version: 1.22.1, 1.23.1 - we definitely had the issue though earlier (e.g. the search with ecs never worked), and also had it when using redis as database for clouddriver. It is only now that we gained visibility after adding datadog-apm.

About this issue

Original URL
State: open
Created 4 years ago
Reactions: 3
Comments: 30

Most upvoted comments

From what I can tell (though I haven’t tested this) the issue is that getClusterSummaries() is returning the full data set when it should be just returning summary data. That flag is then used during the search results to pull in details of matching server groups.

So there’s kind of two issues:

The summary data fetching methods are pulling in far too much data for what’s asked and doing it in an inefficient way. They basically get a list of applications from Front50, then check every cluster and every alarm, lb, etc to see if they match the one of the applications.
It’s possible to speed up the search even more by just returning some summary stub data and if that flag is true, the search logic will query each server group it needs more data on.

The AWS provider has a bit of optimisation around this that I’m trying to understand how to map to how the ECS provider does things.

deverton on Jul 22, 2021

@ScOut3R we decided against Halyard for this reason, and also rather deployed to ECS/Fargate which made it a lot less operational troublesome for us. I wrote a high-level article on it here: https://www.lifeofguenter.de/2020/10/running-spinnaker-on-ecsfargate.html

Thanks @lifeofguenter, will have a look!

You can simplify your Dockerfile by using ADD https://dtdg.co/latest-java-tracer /opt/datadog.jar instead of adding and calling curl.

ScOut3R on Jul 16, 2021

@ScOut3R: https://docs.datadoghq.com/tracing/setup_overview/setup/java/?tab=containers

So in our case we do the following:

ARG spinnaker_version

FROM us-docker.pkg.dev/spinnaker-community/docker/clouddriver:spinnaker-${spinnaker_version}

ENV JAVA_OPTS="-XX:InitialRAMPercentage=50.0 -XX:MinRAMPercentage=50.0 -XX:MaxRAMPercentage=70.0 -javaagent:/opt/datadog.jar"

USER root

RUN set -ex && \
    apk add --no-progress --no-cache \
      curl && \
    curl -fsSLo /opt/datadog.jar https://dtdg.co/latest-java-tracer

USER spinnaker

... (mainly copying over config files)

That is ingenious @lifeofguenter, thank you! Now I just need to figure out how to use custom docker images with Halyard and Kubernetes. 😃

ScOut3R on Jul 16, 2021