spegel: imagePull from private repo is slow second time
imagepull second time from another node in same cluster should be fast but it took same time as of downloading from remote repo, looks like caching didnt work
here are the logs from spegel pod for xxx.yyy.io/app-image-k8:dev_123
image (masked repo and image details)
any pointers to look into what can be the issue?
{"level":"error","ts":1696247240.4278097,"caller":"gin@v0.0.9/logger.go:62","msg":"","path":"/v2/app-image-k8/manifests/dev_123","status":404,"method":"HEAD","latency":5.000973923,"ip":"10.14.130.153","handler":"mirror","error":"could not resolve mirror for key: xxx.yyy.io/app-image-k8:dev_123","stacktrace":"github.com/xenitab/pkg/gin.Logger.func1\n\t/go/pkg/mod/github.com/xenitab/pkg/gin@v0.0.9/logger.go:62\ngithub.com/gin-gonic/gin.(*Context).Next\n\t/go/pkg/mod/github.com/gin-gonic/gin@v1.9.1/context.go:174\ngithub.com/gin-gonic/gin.(*Engine).handleHTTPRequest\n\t/go/pkg/mod/github.com/gin-gonic/gin@v1.9.1/gin.go:620\ngithub.com/gin-gonic/gin.(*Engine).ServeHTTP\n\t/go/pkg/mod/github.com/gin-gonic/gin@v1.9.1/gin.go:576\nnet/http.serverHandler.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2936\nnet/http.(*conn).serve\n\t/usr/local/go/src/net/http/server.go:1995"}
{"level":"error","ts":1696247731.0288842,"caller":"registry/registry.go:211","msg":"mirror failed attempting next","error":"expected mirror to respond with 200 OK but received: 500 Internal Server Error","stacktrace":"github.com/xenitab/spegel/internal/registry.(*Registry).handleMirror.func2\n\t/build/internal/registry/registry.go:211\nnet/http/httputil.(*ReverseProxy).modifyResponse\n\t/usr/local/go/src/net/http/httputil/reverseproxy.go:324\nnet/http/httputil.(*ReverseProxy).ServeHTTP\n\t/usr/local/go/src/net/http/httputil/reverseproxy.go:490\ngithub.com/xenitab/spegel/internal/registry.(*Registry).handleMirror\n\t/build/internal/registry/registry.go:217\ngithub.com/xenitab/spegel/internal/registry.(*Registry).registryHandler\n\t/build/internal/registry/registry.go:137\ngithub.com/gin-gonic/gin.(*Context).Next\n\t/go/pkg/mod/github.com/gin-gonic/gin@v1.9.1/context.go:174\ngithub.com/xenitab/spegel/internal/registry.(*Registry).metricsHandler\n\t/build/internal/registry/registry.go:271\ngithub.com/gin-gonic/gin.(*Context).Next\n\t/go/pkg/mod/github.com/gin-gonic/gin@v1.9.1/context.go:174\ngithub.com/gin-gonic/gin.CustomRecoveryWithWriter.func1\n\t/go/pkg/mod/github.com/gin-gonic/gin@v1.9.1/recovery.go:102\ngithub.com/gin-gonic/gin.(*Context).Next\n\t/go/pkg/mod/github.com/gin-gonic/gin@v1.9.1/context.go:174\ngithub.com/slok/go-http-metrics/middleware/gin.Handler.func1.1\n\t/go/pkg/mod/github.com/slok/go-http-metrics@v0.10.0/middleware/gin/gin.go:17\ngithub.com/slok/go-http-metrics/middleware.Middleware.Measure\n\t/go/pkg/mod/github.com/slok/go-http-metrics@v0.10.0/middleware/middleware.go:117\ngithub.com/slok/go-http-metrics/middleware/gin.Handler.func1\n\t/go/pkg/mod/github.com/slok/go-http-metrics@v0.10.0/middleware/gin/gin.go:16\ngithub.com/gin-gonic/gin.(*Context).Next\n\t/go/pkg/mod/github.com/gin-gonic/gin@v1.9.1/context.go:174\ngithub.com/xenitab/pkg/gin.Logger.func1\n\t/go/pkg/mod/github.com/xenitab/pkg/gin@v0.0.9/logger.go:28\ngithub.com/gin-gonic/gin.(*Context).Next\n\t/go/pkg/mod/github.com/gin-gonic/gin@v1.9.1/context.go:174\ngithub.com/gin-gonic/gin.(*Engine).handleHTTPRequest\n\t/go/pkg/mod/github.com/gin-gonic/gin@v1.9.1/gin.go:620\ngithub.com/gin-gonic/gin.(*Engine).ServeHTTP\n\t/go/pkg/mod/github.com/gin-gonic/gin@v1.9.1/gin.go:576\nnet/http.serverHandler.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2936\nnet/http.(*conn).serve\n\t/usr/local/go/src/net/http/server.go:1995"}
About this issue
- Original URL
- State: closed
- Created 9 months ago
- Reactions: 1
- Comments: 31 (11 by maintainers)
discard_unpacked_layers = true
in the containerd config is the issue. which repulls the blob layer every time making spegel slowhttps://github.com/awslabs/amazon-eks-ami/blob/915ce2222e692a4d7b5904ab020d3c28e1e3e511/files/containerd-config.toml#L10
@phillebaba I’m also facing the same issue in EKS (v1.27) with AWS CNI.
But what I found was, spegel was able to successfully mirror all non-blob layers, but when the mirror request comes-in for an blob layer spegel returns 500 stating
content digest sha256 not found <sha256_value>
and this issue is because the containerd library spegel is using returns 404 for all blob layers.I ssh’d into the EKS node and used the
ctr
cli, even withctr
I’m facing the same issue. I can able to get index/manifest/config layers using./ctr content get sha256:<sha256_vaule_of_non_blob_layer>
, but when I try to get a blob layer with digestctr
returns not found. I’ve used “quay.io/prometheus/alertmanager:v0.24.0” as reference, but the behavior is same will all registries.verifying “alertmanager” image is already present in the node
regctl
cli output for list of layers present for “linux/amd” archI was able to get the content of the image’s “manifest” type digest
When I try to retrieve a blob layer
ctr
returns not found, which the same behaviour we getting from containerd’s golang packageBut, if I use code with function similar to this https://github.com/XenitAB/spegel/blob/4a190529ade0eabbfcdc107c8173ef39b6c2f3b8/internal/oci/containerd.go#L150, I was able to get the list of all layers, but still not able to retrieve the blob layer’s data
.
If I use
ctr
to pull the image likectr image pull quay.io/quay/busybox:latest
then everything is working fine, I can able to retrieve the content of blob layers. But if I use an image which is already present in the machine (pulled via kubelet), then I can’t able to get the blob data.If I try to repull the same alertmanager image via
ctr
It freshly pulls all the blob layers, but it reports all the non-blob layers as “exists”Yup, it’s required. The old configfile gets re-written to default after this script runs. Blame Jeff Bezos for that one.
@infa-nang thanks for the great detective work determining the root cause of this issue!
The kind cluster used for the e2e tests sets the same configuration. https://github.com/XenitAB/spegel/blob/4a190529ade0eabbfcdc107c8173ef39b6c2f3b8/e2e/kind-config.yaml#L7-L11
There are now three main tasks that need to be done.
I have found a regression causing this issue and am working on a fix. In the meantime the best solution is to reduce the mirrorResolveTimeout to a low value like “50ms”.
@phillebaba Updated above. @tico24 I had to omit the newline character from the sed inline amend to get it to work, not sure if karpenter was mangling it when launching the instance but confirmed on the instance via SSM shell, that field had a prefix
n
character on the bootstrap.sh file.Additionally I can confirm spegel’s logs are much more 200 and much less 500/404 so that’s encouraging. Container start speed is better anecdotally. I’m also using nvme backed instances with the containerd layer cache on the local nvme disk for even more performance.
Another caveat here is I’ve only deployed AL2 backed AMIs, it’s going to need to be tested against other types…
Just confirming that my world is much happier after tweaking the containerd config. If it’s useful, here’s how I did it using the eks terraform module:
I downgraded AWS CNI from 0.15.x to 0.14.x and 0.13.x in the hopes that maybe the newer feature whereby it obeys networkPolicies might be causing the issue.
Unfortunately it is still not working with AWS CNI 0.13.