linkerd2: When downstream deployment is updated, upstream pods don't connect to new downstream pods

Given: deploy_a -> deploy_b (a is a client/connects to b)

Sometimes, not always, when a rolling update of deploy_b is performed (new pods are created), each request made by deploy_a start failing, as if the proxies in the deploy_a pods were not aware that the destination pods for deploy_b had changed. Killing all deploy_a pods (i.e. letting them respawn) fixes the problem.

Unfortunately, I don’t have much more than this to say about this, as I did not yet have time to dive deeper into the issue and try to better understand what is going on.

Also, I apologize if this has already been filed, I did not search existing issues.

I just wanted to put it out there because I’ve been bitten by this since early Conduit versions and have been waiting to have details before filing it…

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Reactions: 1
  • Comments: 74 (54 by maintainers)

Most upvoted comments

Would anyone mind if we close this issue and open some more specific ones? There’s so much in here at this point it is tough to figure out what is actually going on.

@JCMais unfortunately we didn’t get to spend time on this last week after all, but we’ll be digging into this this week.

no problem from me, but please link any created issues here, so it’s easier to follow.

Latest update: we haven’t been able to repro this on our side. However, we’re going to add some further logging to the destination service which may elucidate the issue further.

In the meantime we’re still going to work on repro-ing. @JCMais @markstgodard @bourquep @aleerizw are any of you seeing this issue on edge-18.11.1? If so, can you tell us what cloud provider, K8s version, # of services, and any other pertinent details about where and when it occurs?

Thanks for your patience, everyone. This is high priority for us.

Those are part of the client pod (the same place you’re seeing WARN proxy={..} [...] Error attempting to establish underlying session layer: No route to host (os error 113)).

@bourquep any specific replication steps? Does it happen sometimes, all the time, based on the phase of the moon?