containerd: panic.log on Windows is filled with containerd logging (with no panics) - causes disk overload

Description

OS: Windows Server 2019 - LTSC

panic.log file under:

 Directory: C:\ProgramData\containerd\root


Mode                LastWriteTime         Length Name
----                -------------         ------ ----
d-----        1/27/2022  10:20 PM                io.containerd.content.v1.content
d-----        3/16/2022  11:46 PM                io.containerd.grpc.v1.cri
d-----        1/27/2022  10:20 PM                io.containerd.metadata.v1.bolt
d-----        1/27/2022  10:20 PM                io.containerd.runtime.v2.task
d-----        1/27/2022  10:20 PM                io.containerd.snapshotter.v1.windows
d-----        1/27/2022  10:20 PM                io.containerd.snapshotter.v1.windows-lcow
d-----        1/27/2022  10:20 PM                opt
-a----        3/16/2022  11:44 PM              0 panic.log

Will show as 0 length, but when checking the actual size/content, it appears to be filled (even with some debug-level logs) although the containerd service is not panic-ing:

PS C:\ProgramData\containerd\root>  (gci panic.log | measure Length -s).Sum /1GB
2.69636131729931

This is from a node that have been running for ~45 days, with no restarts to the service - and no much operations happening at all. Containerd version: 1.5.8.

Examining some of the logs written to the panic.log, it appears to be up to debug-level logs from the containerd service, like:

PS C:\ProgramData\containerd\root> Get-Content .\panic.log -Tail 5
time="2022-05-25T18:06:38.830624200Z" level=info msg=Span duration=5.0235ms endTime="2022-05-25 18:06:38.8356477 +0000 GMT m=+4205687.175091601" name=Stats parentSpanID=c2b2fb5f8a71e937 pod-id=4e690ea17a48cc1ced2e9fe419da0a4679a3bf26c636a3aa64d491a0acb142fb spanID=87dab501b9cef418 startTime="2022-05-25 18:06:38.8306242 +0000 GMT m=+4205687.170068101" tid=f381669289799dc5cc10f0a7370aa543fa0f72a1ef8e108b655bd0f26dee90d4 traceID=17eba89d208fec619f27024c91ab428c
time="2022-05-25T18:06:38.830624200Z" level=info msg=Span duration=5.0235ms endTime="2022-05-25 18:06:38.8356477 +0000 GMT m=+4205687.175091601" name=containerd.task.v2.Task.Stats parentSpanID=0000000000000000 spanID=c2b2fb5f8a71e937 startTime="2022-05-25 18:06:38.8306242 +0000 GMT m=+4205687.170068101" traceID=17eba89d208fec619f27024c91ab428c
time="2022-05-25T18:06:38.836395500Z" level=info msg=Span duration=5.3631ms endTime="2022-05-25 18:06:38.8417586 +0000 GMT m=+2482569.648015901" name=HcsGetComputeSystemProperties parentSpanID=f223ca604c5919fd propertyQuery="{\"PropertyTypes\":[\"Statistics\"]}" spanID=e35daa901aaf24c1 startTime="2022-05-25 18:06:38.8363955 +0000 GMT m=+2482569.642652801" traceID=7fad47e61bd3b2957b6c35db551bb7d9
time="2022-05-25T18:06:38.836395500Z" level=info msg=Span duration=5.3631ms endTime="2022-05-25 18:06:38.8417586 +0000 GMT m=+2482569.648015901" name=Stats parentSpanID=01eae9307e030e39 pod-id=a6f12b46c6f044b1feac3c1bcdff084f0dc3c1a83b993b7f7c27fcd71ea5196e spanID=f223ca604c5919fd startTime="2022-05-25 18:06:38.8363955 +0000 GMT m=+2482569.642652801" tid=f32f795e0e3c4895bf121f45a4095345b8128ce59e429d0165ceeac2e328d176 traceID=7fad47e61bd3b2957b6c35db551bb7d9
time="2022-05-25T18:06:38.836395500Z" level=info msg=Span duration=5.3631ms endTime="2022-05-25 18:06:38.8417586 +0000 GMT m=+2482569.648015901" name=containerd.task.v2.Task.Stats parentSpanID=0000000000000000 spanID=01eae9307e030e39 startTime="2022-05-25 18:06:38.8363955 +0000 GMT m=+2482569.642652801" traceID=7fad47e61bd3b2957b6c35db551bb7d9

The toml used doesn’t specify debug overrides, so it’s the normal info-level logging - but this is expected to be redirected/captured only in the containerd log file. What I’m seeing is that logs is being written normally to containerd log, but also this panic log is capturing the output stream somehow?

The problem is that this panic.log is not rotated, only replaced on new starts. if service is kept running for long time, and have been logging excessively (like in case of a continuous failing deployment or probe), the file goes becomes >10GB in matter of weeks as we’ve seen in one case.

Steps to reproduce the issue

Run containerd as a service on a Windows Server host
Keep it running even for a short time
Examine the size/content of the panic.log file, to see the written logs there

Describe the results you received and expected

panic.log is expected to be used for logs from panicking processes/threads and not for normal logs written by the containerd service. It seems also based on the codepath of windows service setup, there is no way to control/disable the creation of this panic.log.

What version of containerd are you using?

containerd v1.5.8

Any other relevant information

No response

Show configuration if it is related to CRI plugin.

PS C:\etc\kubernetes\node\bin> cat .\containerd.config.toml
[plugins.scheduler]
  schedule_delay = '0s'
  startup_delay = '0s'
[plugins.cri]
  sandbox_image = 'gcr.io/gke-release/pause-win:1.6.1'
[plugins.cri.containerd]
  snapshotter = 'windows'
  default_runtime_name = 'runhcs-wcow-process'
  disable_snapshot_annotations = true
  discard_unpacked_layers = true
[plugins.cri.cni]
  bin_dir = 'C:\etc\kubernetes\cni'
  conf_dir = 'C:\etc\kubernetes\cni\config'

About this issue

Original URL
State: closed
Created 2 years ago
Comments: 17 (13 by maintainers)

Most upvoted comments

Thanks @dcantah for the update! I believe this should be fine to close, but agree that a proper rotation in future might be the best way.

ibabou on Aug 3, 2022

I have a change in mind to be able to send the shims logs to the same logfile (and avoid writing to panic.log as well).

That would be awesome! As we don’t lose the information and keep the panic.log for its actual use-case. Thanks Danny for looking into this!

ibabou on Jun 4, 2022