kaniko: Image build process Freezes on `Taking snapshot of full filesystem...`

Actual behavior While building image using gcr.io/kaniko-project/executor:debug in gitlab CI runner hosted on kubernetes using helm chart the image build process freezes on Taking snapshot of full filesystem… for the time till the runner timeouts(1 hr) This behaviour is intermittent as for the same project image build stage works sometimes

Issue arises in multistage as well as single stage Dockerfile.

Expected behavior Image build should not freeze at Taking snapshot of full filesystem... and should be successful everytime.

To Reproduce As the behaviour is intermittent not sure how it can be reproduced

Description Yes/No
Please check if this a new feature you are proposing
  • - [ ]
Please check if the build works in docker but not in kaniko
  • - [Yes ]
Please check if this error is seen when you use --cache flag
  • - [ ]
Please check if your dockerfile is a multistage dockerfile
  • - [ ]

@tejal29

About this issue

  • Original URL
  • State: open
  • Created 4 years ago
  • Reactions: 19
  • Comments: 47 (4 by maintainers)

Most upvoted comments

The issue is still actual for me too. Any updates?

I am experience this problem while building an image with less than a gb. Interesting that it fails silently. GitLab CI job will be marked as successfull but no image is actually pushed.

We are using kaniko for several other projects but this error only happens on two projects. Both are monorepos and use lerna for extending yarn commands to sub packages.

I must say it was working at some point and it does work normally when using docker to build the image

Here is a snippet of the build logs:

INFO[0363] RUN yarn install --network-timeout 100000    
INFO[0363] cmd: /bin/sh                                 
INFO[0363] args: [-c yarn install --network-timeout 100000] 
INFO[0363] util.Lookup returned: &{Uid:1000 Gid:1000 Username:node Name: HomeDir:/home/node} 
INFO[0363] performing slow lookup of group ids for node 
INFO[0363] Running: [/bin/sh -c yarn install --network-timeout 100000] 
yarn install v1.22.5
info No lockfile found.
[1/4] Resolving packages...
INFO[0368] Pushed image to 1 destinations               
... A bunch of yarn logs ...
[4/4] Building fresh packages...
success Saved lockfile.
$ lerna bootstrap
lerna notice cli v3.22.1
lerna info bootstrap root only
yarn install v1.22.5
[1/4] Resolving packages...
success Already up-to-date.
$ lerna bootstrap
lerna notice cli v3.22.1
lerna WARN bootstrap Skipping recursive execution
Done in 20.00s.
Done in 616.92s.
INFO[0982] Taking snapshot of full filesystem...   

Interesting to note that RUN yarn install --network-timeout 100000 is not the last step in the dockerfile.

neither --snapshotMode=redo nor --use-new-run solved the problem

we could fix the gitlab cicd pipeline error

Taking snapshot of full filesystem....
Killed

with --compressed-caching=false and v1.8.0-debug. The image is around 2 GB. Alpine reported around 4 GB in around 100 packages.

Adding a data point, I was initially observing the build process freezing problem, when I do not add any memory/cpu request/limits. Then I added memory/cpu request & limits, the process starts to OOM. I increased memory limit to 6GB, but it still reaches OOM killed. When looking at the memory usage, it skyrockets at the end – when log reaches taking snapshot of file system. EDIT: I tried building the same image in local docker, and maximum memory usage is less than 1GB.

image

logs

+ dockerfile=v2/container/driver/Dockerfile
+ context_uri=
+ context_artifact_path=/tmp/inputs/context_artifact/data
+ context_sub_path=
+ destination=gcr.io/kfp-ci/4674c4982ab8fcf476e610f372fc0e4a38686805/v2-sample-test/test/kfp-driver
+ digest_output_path=/tmp/outputs/digest/data
+ cache=true
+ cache_ttl=24h
+ context=
+ '[['  '!='  ]]
+ context=dir:///tmp/inputs/context_artifact/data
+ dirname /tmp/outputs/digest/data
+ mkdir -p /tmp/outputs/digest
+ /kaniko/executor --dockerfile v2/container/driver/Dockerfile --context dir:///tmp/inputs/context_artifact/data --destination gcr.io/kfp-ci/4674c4982ab8fcf476e610f372fc0e4a38686805/v2-sample-test/test/kfp-driver --snapshotMode redo --image-name-with-digest-file /tmp/outputs/digest/data '--cache=true' '--cache-ttl=24h'
E0730 12:20:40.314406      21 aws_credentials.go:77] while getting AWS credentials NoCredentialProviders: no valid providers in chain. Deprecated.
	For verbose messaging see aws.Config.CredentialsChainVerboseErrors
INFO[0000] Resolved base name golang:1.15-alpine to builder 
INFO[0000] Using dockerignore file: /tmp/inputs/context_artifact/data/.dockerignore 
INFO[0000] Retrieving image manifest golang:1.15-alpine 
INFO[0000] Retrieving image golang:1.15-alpine from registry index.docker.io 
E0730 12:20:40.518068      21 metadata.go:166] while reading 'google-dockercfg-url' metadata: http status code: 404 while fetching url 
http://metadata.google.internal./computeMetadata/v1/instance/attributes/google-dockercfg-url
INFO[0001] Retrieving image manifest golang:1.15-alpine 
INFO[0001] Returning cached image manifest              
INFO[0001] No base image, nothing to extract            
INFO[0001] Built cross stage deps: map[0:[/build/v2/build/driver]] 
INFO[0001] Retrieving image manifest golang:1.15-alpine 
INFO[0001] Returning cached image manifest              
INFO[0001] Retrieving image manifest golang:1.15-alpine 
INFO[0001] Returning cached image manifest              
INFO[0001] Executing 0 build triggers                   
INFO[0001] Checking for cached layer gcr.io/kfp-ci/4674c4982ab8fcf476e610f372fc0e4a38686805/v2-sample-test/test/kfp-driver/cache:9164be18ba887abd9388518d533d79a6e2fda9f81f33e57e0c71319d7a6da78e... 
INFO[0001] No cached layer found for cmd RUN apk add --no-cache make bash 
INFO[0001] Unpacking rootfs as cmd RUN apk add --no-cache make bash requires it. 
INFO[0009] RUN apk add --no-cache make bash             
INFO[0009] Taking snapshot of full filesystem...        
INFO[0016] cmd: /bin/sh                                 
INFO[0016] args: [-c apk add --no-cache make bash]      
INFO[0016] Running: [/bin/sh -c apk add --no-cache make bash] 
fetch 
https://dl-cdn.alpinelinux.org/alpine/v3.14/main/x86_64/APKINDEX.tar.gz
fetch 
https://dl-cdn.alpinelinux.org/alpine/v3.14/community/x86_64/APKINDEX.tar.gz
(1/5) Installing ncurses-terminfo-base (6.2_p20210612-r0)
(2/5) Installing ncurses-libs (6.2_p20210612-r0)
(3/5) Installing readline (8.1.0-r0)
(4/5) Installing bash (5.1.4-r0)
Executing bash-5.1.4-r0.post-install
(5/5) Installing make (4.3-r0)
Executing busybox-1.33.1-r2.trigger
OK: 9 MiB in 20 packages
INFO[0016] Taking snapshot of full filesystem...        
INFO[0017] Pushing layer gcr.io/kfp-ci/4674c4982ab8fcf476e610f372fc0e4a38686805/v2-sample-test/test/kfp-driver/cache:9164be18ba887abd9388518d533d79a6e2fda9f81f33e57e0c71319d7a6da78e to cache now 
INFO[0017] WORKDIR /build                               
INFO[0017] cmd: workdir                                 
INFO[0017] Changed working directory to /build          
INFO[0017] Creating directory /build                    
INFO[0017] Taking snapshot of files...                  
INFO[0017] Pushing image to gcr.io/kfp-ci/4674c4982ab8fcf476e610f372fc0e4a38686805/v2-sample-test/test/kfp-driver/cache:9164be18ba887abd9388518d533d79a6e2fda9f81f33e57e0c71319d7a6da78e 
INFO[0017] COPY api/go.mod api/go.sum api/              
INFO[0017] Taking snapshot of files...                  
INFO[0017] COPY v2/go.mod v2/go.sum v2/                 
INFO[0017] Taking snapshot of files...                  
INFO[0017] RUN cd v2 && go mod download                 
INFO[0017] cmd: /bin/sh                                 
INFO[0017] args: [-c cd v2 && go mod download]          
INFO[0017] Running: [/bin/sh -c cd v2 && go mod download] 
INFO[0018] Pushed image to 1 destinations               
INFO[0140] Taking snapshot of full filesystem...        
Killed

version: gcr.io/kaniko-project/executor:v1.6.0-debug args: I added snapshotMode redo, cache=true env: GKE 1.19, use kubeflow pipelies to run kaniko containers

same issue , nothing changed only version of kaniko

It sounds like whatever bug is causing that is still present, so it won’t be fixed by releasing the latest image as v1.8.0. We just need someone to figure out why it gets stuck and fix it.

Hold on a second, maybe I spoke early!

My pipeline currently builds multiple images in parallel. I didn’t realize before that one of them that before was stuck in taking snapshot now goes on smoothly with --snapshotMode=redo --use-new-run and gcr.io/kaniko-project/executor:09e70e44d9e9a3fecfcf70cb809a654445837631-debug.

The images actually stuck are basically the same Postgres image built with different build-arg values, so this ends up by running in parallel (and caching in parallel) the same layers.

I consequently tried to remove this parallelism and tried to build these Postgres images in sequence. I ended up with Postgres images stuck in taking snapshot in parallel with a totally different NodeJs image, also stuck in taking snapshots.

So from my tests it looks like when building images happens in parallel against the same registry mirror used as cache, if one image is taking snapshots in parallel with another it gets stuck.

It may be a coincidence, maybe not. I repeat: this is from my tests, it could be totally unrelated to the problem

Edit: my guess is wrong, I reverted to kaniko:1.3.0-debug and added enough memory requests & limit, but I’m still observing the image build freezing problem from time to time.

Hello everyone! I found solution here https://stackoverflow.com/questions/67748472/can-kaniko-take-snapshots-by-each-stage-not-each-run-or-copy-operation adding option to kaniko --single-snapshot

/kaniko/executor –context “${CI_PROJECT_DIR}” –dockerfile “${CI_PROJECT_DIR}/Dockerfile” –destination “${YC_CI_REGISTRY}/${YC_CI_REGISTRY_ID}/${CI_PROJECT_PATH}😒{CI_COMMIT_SHA}” –single-snapshot

I have the same issue on Gitlab CI/CD but only when cache is set to true