kaniko: Image build process Freezes on `Taking snapshot of full filesystem...`

Actual behavior While building image using gcr.io/kaniko-project/executor:debug in gitlab CI runner hosted on kubernetes using helm chart the image build process freezes on Taking snapshot of full filesystem… for the time till the runner timeouts(1 hr) This behaviour is intermittent as for the same project image build stage works sometimes

Issue arises in multistage as well as single stage Dockerfile.

Expected behavior Image build should not freeze at Taking snapshot of full filesystem... and should be successful everytime.

To Reproduce As the behaviour is intermittent not sure how it can be reproduced

Description	Yes/No
Please check if this a new feature you are proposing	- [ ]
Please check if the build works in docker but not in kaniko	- [Yes ]
Please check if this error is seen when you use `--cache` flag	- [ ]
Please check if your dockerfile is a multistage dockerfile	- [ ]

@tejal29

About this issue

Original URL
State: open
Created 4 years ago
Reactions: 19
Comments: 47 (4 by maintainers)

Most upvoted comments

The issue is still actual for me too. Any updates?

+21

sph3rex on Oct 26, 2021

I am experience this problem while building an image with less than a gb. Interesting that it fails silently. GitLab CI job will be marked as successfull but no image is actually pushed.

We are using kaniko for several other projects but this error only happens on two projects. Both are monorepos and use lerna for extending yarn commands to sub packages.

I must say it was working at some point and it does work normally when using docker to build the image

Here is a snippet of the build logs:

INFO[0363] RUN yarn install --network-timeout 100000    
INFO[0363] cmd: /bin/sh                                 
INFO[0363] args: [-c yarn install --network-timeout 100000] 
INFO[0363] util.Lookup returned: &{Uid:1000 Gid:1000 Username:node Name: HomeDir:/home/node} 
INFO[0363] performing slow lookup of group ids for node 
INFO[0363] Running: [/bin/sh -c yarn install --network-timeout 100000] 
yarn install v1.22.5
info No lockfile found.
[1/4] Resolving packages...
INFO[0368] Pushed image to 1 destinations               
... A bunch of yarn logs ...
[4/4] Building fresh packages...
success Saved lockfile.
$ lerna bootstrap
lerna notice cli v3.22.1
lerna info bootstrap root only
yarn install v1.22.5
[1/4] Resolving packages...
success Already up-to-date.
$ lerna bootstrap
lerna notice cli v3.22.1
lerna WARN bootstrap Skipping recursive execution
Done in 20.00s.
Done in 616.92s.
INFO[0982] Taking snapshot of full filesystem...

Interesting to note that RUN yarn install --network-timeout 100000 is not the last step in the dockerfile.

neither --snapshotMode=redo nor --use-new-run solved the problem

+15

leoschet on May 6, 2021

we could fix the gitlab cicd pipeline error

Taking snapshot of full filesystem....
Killed

with --compressed-caching=false and v1.8.0-debug. The image is around 2 GB. Alpine reported around 4 GB in around 100 packages.

+10

baslr on Apr 7, 2022

Adding a data point, I was initially observing the build process freezing problem, when I do not add any memory/cpu request/limits. Then I added memory/cpu request & limits, the process starts to OOM. I increased memory limit to 6GB, but it still reaches OOM killed. When looking at the memory usage, it skyrockets at the end – when log reaches taking snapshot of file system. EDIT: I tried building the same image in local docker, and maximum memory usage is less than 1GB.

logs

+ dockerfile=v2/container/driver/Dockerfile
+ context_uri=
+ context_artifact_path=/tmp/inputs/context_artifact/data
+ context_sub_path=
+ destination=gcr.io/kfp-ci/4674c4982ab8fcf476e610f372fc0e4a38686805/v2-sample-test/test/kfp-driver
+ digest_output_path=/tmp/outputs/digest/data
+ cache=true
+ cache_ttl=24h
+ context=
+ '[['  '!='  ]]
+ context=dir:///tmp/inputs/context_artifact/data
+ dirname /tmp/outputs/digest/data
+ mkdir -p /tmp/outputs/digest
+ /kaniko/executor --dockerfile v2/container/driver/Dockerfile --context dir:///tmp/inputs/context_artifact/data --destination gcr.io/kfp-ci/4674c4982ab8fcf476e610f372fc0e4a38686805/v2-sample-test/test/kfp-driver --snapshotMode redo --image-name-with-digest-file /tmp/outputs/digest/data '--cache=true' '--cache-ttl=24h'
E0730 12:20:40.314406      21 aws_credentials.go:77] while getting AWS credentials NoCredentialProviders: no valid providers in chain. Deprecated.
	For verbose messaging see aws.Config.CredentialsChainVerboseErrors
[36mINFO[0m[0000] Resolved base name golang:1.15-alpine to builder 
[36mINFO[0m[0000] Using dockerignore file: /tmp/inputs/context_artifact/data/.dockerignore 
[36mINFO[0m[0000] Retrieving image manifest golang:1.15-alpine 
[36mINFO[0m[0000] Retrieving image golang:1.15-alpine from registry index.docker.io 
E0730 12:20:40.518068      21 metadata.go:166] while reading 'google-dockercfg-url' metadata: http status code: 404 while fetching url 
http://metadata.google.internal./computeMetadata/v1/instance/attributes/google-dockercfg-url
[36mINFO[0m[0001] Retrieving image manifest golang:1.15-alpine 
[36mINFO[0m[0001] Returning cached image manifest              
[36mINFO[0m[0001] No base image, nothing to extract            
[36mINFO[0m[0001] Built cross stage deps: map[0:[/build/v2/build/driver]] 
[36mINFO[0m[0001] Retrieving image manifest golang:1.15-alpine 
[36mINFO[0m[0001] Returning cached image manifest              
[36mINFO[0m[0001] Retrieving image manifest golang:1.15-alpine 
[36mINFO[0m[0001] Returning cached image manifest              
[36mINFO[0m[0001] Executing 0 build triggers                   
[36mINFO[0m[0001] Checking for cached layer gcr.io/kfp-ci/4674c4982ab8fcf476e610f372fc0e4a38686805/v2-sample-test/test/kfp-driver/cache:9164be18ba887abd9388518d533d79a6e2fda9f81f33e57e0c71319d7a6da78e... 
[36mINFO[0m[0001] No cached layer found for cmd RUN apk add --no-cache make bash 
[36mINFO[0m[0001] Unpacking rootfs as cmd RUN apk add --no-cache make bash requires it. 
[36mINFO[0m[0009] RUN apk add --no-cache make bash             
[36mINFO[0m[0009] Taking snapshot of full filesystem...        
[36mINFO[0m[0016] cmd: /bin/sh                                 
[36mINFO[0m[0016] args: [-c apk add --no-cache make bash]      
[36mINFO[0m[0016] Running: [/bin/sh -c apk add --no-cache make bash] 
fetch 
https://dl-cdn.alpinelinux.org/alpine/v3.14/main/x86_64/APKINDEX.tar.gz
fetch 
https://dl-cdn.alpinelinux.org/alpine/v3.14/community/x86_64/APKINDEX.tar.gz
(1/5) Installing ncurses-terminfo-base (6.2_p20210612-r0)
(2/5) Installing ncurses-libs (6.2_p20210612-r0)
(3/5) Installing readline (8.1.0-r0)
(4/5) Installing bash (5.1.4-r0)
Executing bash-5.1.4-r0.post-install
(5/5) Installing make (4.3-r0)
Executing busybox-1.33.1-r2.trigger
OK: 9 MiB in 20 packages
[36mINFO[0m[0016] Taking snapshot of full filesystem...        
[36mINFO[0m[0017] Pushing layer gcr.io/kfp-ci/4674c4982ab8fcf476e610f372fc0e4a38686805/v2-sample-test/test/kfp-driver/cache:9164be18ba887abd9388518d533d79a6e2fda9f81f33e57e0c71319d7a6da78e to cache now 
[36mINFO[0m[0017] WORKDIR /build                               
[36mINFO[0m[0017] cmd: workdir                                 
[36mINFO[0m[0017] Changed working directory to /build          
[36mINFO[0m[0017] Creating directory /build                    
[36mINFO[0m[0017] Taking snapshot of files...                  
[36mINFO[0m[0017] Pushing image to gcr.io/kfp-ci/4674c4982ab8fcf476e610f372fc0e4a38686805/v2-sample-test/test/kfp-driver/cache:9164be18ba887abd9388518d533d79a6e2fda9f81f33e57e0c71319d7a6da78e 
[36mINFO[0m[0017] COPY api/go.mod api/go.sum api/              
[36mINFO[0m[0017] Taking snapshot of files...                  
[36mINFO[0m[0017] COPY v2/go.mod v2/go.sum v2/                 
[36mINFO[0m[0017] Taking snapshot of files...                  
[36mINFO[0m[0017] RUN cd v2 && go mod download                 
[36mINFO[0m[0017] cmd: /bin/sh                                 
[36mINFO[0m[0017] args: [-c cd v2 && go mod download]          
[36mINFO[0m[0017] Running: [/bin/sh -c cd v2 && go mod download] 
[36mINFO[0m[0018] Pushed image to 1 destinations               
[36mINFO[0m[0140] Taking snapshot of full filesystem...        
Killed

version: gcr.io/kaniko-project/executor:v1.6.0-debug args: I added snapshotMode redo, cache=true env: GKE 1.19, use kubeflow pipelies to run kaniko containers

Bobgy on Jul 30, 2021

same issue , nothing changed only version of kaniko

sneerin on Jun 29, 2021

It sounds like whatever bug is causing that is still present, so it won’t be fixed by releasing the latest image as v1.8.0. We just need someone to figure out why it gets stuck and fix it.

Hold on a second, maybe I spoke early!

My pipeline currently builds multiple images in parallel. I didn’t realize before that one of them that before was stuck in taking snapshot now goes on smoothly with --snapshotMode=redo --use-new-run and gcr.io/kaniko-project/executor:09e70e44d9e9a3fecfcf70cb809a654445837631-debug.

The images actually stuck are basically the same Postgres image built with different build-arg values, so this ends up by running in parallel (and caching in parallel) the same layers.

I consequently tried to remove this parallelism and tried to build these Postgres images in sequence. I ended up with Postgres images stuck in taking snapshot in parallel with a totally different NodeJs image, also stuck in taking snapshots.

So from my tests it looks like when building images happens in parallel against the same registry mirror used as cache, if one image is taking snapshots in parallel with another it gets stuck.

It may be a coincidence, maybe not. I repeat: this is from my tests, it could be totally unrelated to the problem

irizzant on Feb 16, 2022

Edit: my guess is wrong, I reverted to kaniko:1.3.0-debug and added enough memory requests & limit, but I’m still observing the image build freezing problem from time to time.

Bobgy on Aug 2, 2021

Hello everyone! I found solution here https://stackoverflow.com/questions/67748472/can-kaniko-take-snapshots-by-each-stage-not-each-run-or-copy-operation adding option to kaniko --single-snapshot

/kaniko/executor –context “${CI_PROJECT_DIR}” –dockerfile “${CI_PROJECT_DIR}/Dockerfile” –destination “${YC_CI_REGISTRY}/${YC_CI_REGISTRY_ID}/${CI_PROJECT_PATH}😒{CI_COMMIT_SHA}” –single-snapshot

aleksey-masl on Aug 1, 2023

I have the same issue on Gitlab CI/CD but only when cache is set to true

cristi9flavors on Mar 11, 2023