containerd: When processing with high memory load is performed, it is forcibly terminated with "shim reaped".
Hello. I’m sorry for my poor English.
I run the Tensorflow program in Docker. And this program is forcibly terminated with “shim reaped”.
Processing with high memory load is executed, but as far as the results of the top command are concerned, there is a considerable margin in the memory. My server has 256 GB memory. It uses at most 600 MB - 4 GB during execution. This program runs without problem unless using Docker. I almost tried the docker run option on memory. And I tried the storage driver with devicemapper, overlay, overlay 2.
Could you tell me whether this problem is bad on how to use Docker or if it is planned to be cured with already recognized problems?
BUG REPORT INFORMATION
Use the commands below to provide key information from your environment: You do NOT have to include this information if this is a FEATURE REQUEST –>
It occurs when executing high load Tensorflow processing (train ()) in Docker.
Describe the results you received:
In console Error 137
In syslog (use overlay2)
Mar 12 13:02:44 gpu1 dockerd[9951]: time="2018-03-12T13:02:44+09:00" level=info msg="shim reaped" id=f24980703b5d13089200d6bd84bf4c8648a9d4216633e5da8ecd237f0ff6e0bb module="containerd/tasks"
Mar 12 13:02:44 gpu1 dockerd[9951]: time="2018-03-12T13:02:44.340010941+09:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Mar 12 13:03:05 gpu1 sshd[13321]: Set /proc/self/oom_score_adj to 0
Describe the results you expected:
I want to finish processing without error.
Output of containerd --version
:
containerd github.com/containerd/containerd v1.0.2 cfd04396dc68220d1cecbe686a6cc3aa5ce3667c
Output of sudo docker info
:
Containers: 1
Running: 0
Paused: 0
Stopped: 1
Images: 1
Server Version: 18.02.0-ce
Storage Driver: overlay2
Backing Filesystem: xfs
Supports d_type: true
Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: cfd04396dc68220d1cecbe686a6cc3aa5ce3667c (expected: 9b55aab90508bd389d7654c4baf173a981477d55)
runc version: N/A (expected: 9f9c96235cc97674e935002fc3d78361b696a69e)
init version: 949e6fa
Security Options:
seccomp
Profile: default
Kernel Version: 4.1.50
Operating System: CentOS Linux 7 (Core)
OSType: linux
Architecture: x86_64
CPUs: 32
Total Memory: 251.8GiB
Name: gpu1.dssci.ssk.yahoo.co.jp
ID: 47CX:FHBR:53ZM:IH3N:ZUEC:256P:D76K:QDSV:OHJ4:R4QR:JIWA:Z5EX
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Reactions: 2
- Comments: 19 (5 by maintainers)
I have to admit, I wouldnt mind seeing this re-opened just as I am having the same problem, with exactly the same output from jounalctl
This problem seems intermittent, where sometimes the software within the docker runs for hours before being forcefully restarted, and others less than 1/2 hour. @yohiram - did you ever find an answer, or any better understanding of what is going on?
You should be able to use
sudo journalctl -k
to see if it really is the oom killer killing your task. It could help pinpoint the reason why or look in your application’s logs for the reason why it is being killed.The logs you are seeing from docker/containerd are standard log messages for when a task is kill, nothing important here. Also, could this be something with GPU memory/resources and nothing to do with system RAM?
Hi,all: I resolve it by disable Transparent Huge Pages.It’s Ok now
I am facing the same issue in google cloud compute docker environment with high memory nodejs tasks. Have you guys found a solution?
I have the same output in journalctl my JNLP slave for Jenkins is dead spontaneously when many build are created and aborted in short period of time.
And I cannot reproduce this if disable network related config in docker-compose.yml