vespa: Container restart fails after deployment files are garbage collected

What happens is that occasionally the vespa container node will be restarted due to some cluster maintenance/health reason. Sometimes when this happens the container will not restart successfully and the application has to be redeployed (i.e. using vespa deploy) in order to restore service. I’ve been tracking this in the logs and can see something that looks odd: The issue causing the failed restart looks to be this: (vespa-container-0)

[2023-04-03 14:57:38.604] INFO    configproxy      configproxy.com.yahoo.vespa.filedistribution.FileReferenceDownloader	file 'b46a99f799a311de' not found or timed out (error code 1) at tcp/vespa-0.vespa-internal.default.svc.cluster.local:19070

`“b46a99f799a311de” is the application bundle. There are a few retries in the logs and eventually a full restart with backoff. This cycle repeats until the app bundle is redeployed.

There is connectivity between container and config server. On the config server side the error appears as:

[2023-04-03 14:57:38.603] INFO    configserver     Container.com.yahoo.vespa.config.server.filedistribution.FileServer	Failed downloading 'file 'b46a99f799a311de', client: Connection { Socket[addr=/10.96.3.4,port=52188,localport=19070] }'

Looking at the filesystem on the config server, /opt/vespa/var/db/vespa/filedistribution/ is empty. This would be explained by a log entry from a few days earlier:

[2023-03-31 10:43:40.094] INFO    configproxy      configproxy.com.yahoo.vespa.config.proxy.filedistribution.FileReferencesAndDownloadsMaintainer	Files that can be deleted in /opt/vespa/var/db/vespa/filedistribution (not used since 2023-03-17T10:43:40.064497184Z): [b099f8e64547a6d3, b46a99f799a311de, 3bfa63fe6fb5e795, cdfb8f194d6c5d2d, 1a3342246e80853f]

So, it appears as if a housekeeping process is clearing out those deployment artifacts.

Environment:

  • Infrastructure: Kubernetes (GKE)

Vespa version Vespa version: 8.131.17

Slack thread https://vespatalk.slack.com/archives/C038YTM5SUQ/p1680597019763339

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 15 (9 by maintainers)

Most upvoted comments

A fix (https://github.com/vespa-engine/vespa/pull/26248) for this issue was released in 8.132.43. The fix will make sure the bundle is not removed from the config server. I think we should increase lifetime of bundle on container nodes as well, I’ll look into it.

Good point! https://github.com/vespa-engine/sample-apps/tree/master/examples/operations/multinode-HA/gke is maybe more up to date, so I will look into removing basic-search-on-gke