rook: OSDs crashlooping after being OOMKilled: bind unable to bind

Is this a bug report or feature request?

Bug Report

Deviation from expected behavior: Updated ceph from v14.2.2-20190722 to v14.2.4-20190917, which seems to have made some changes in memory management, and nodes started getting system OOMKilles followed by OSDs crashlooping.

2019-09-26 01:20:10.118 7f70aa104dc0 -1 Falling back to public interface
2019-09-26 01:20:10.128 7f70aa104dc0 -1  Processor -- bind unable to bind to v2:10.244.15.18:7300/0 on any port in range 6800-7300: (99) Cannot assign requested address
2019-09-26 01:20:10.128 7f70aa104dc0 -1  Processor -- bind was unable to bind. Trying again in 5 seconds
2019-09-26 01:20:15.137 7f70aa104dc0 -1  Processor -- bind unable to bind to v2:10.244.15.18:7300/0 on any port in range 6800-7300: (99) Cannot assign requested address
2019-09-26 01:20:15.137 7f70aa104dc0 -1  Processor -- bind was unable to bind. Trying again in 5 seconds
2019-09-26 01:20:20.144 7f70aa104dc0 -1  Processor -- bind unable to bind to v2:10.244.15.18:7300/0 on any port in range 6800-7300: (99) Cannot assign requested address
2019-09-26 01:20:20.144 7f70aa104dc0 -1  Processor -- bind was unable to bind after 3 attempts: (99) Cannot assign requested address

How to reproduce it (minimal and precise): Get an OSD OOMKilled by system

Rook version (use rook version inside of a Rook Pod): 1.1.1
Storage backend version (e.g. for ceph do ceph -v): v14.2.4-20190917
Kubernetes version (use kubectl version): 1.15.1

About this issue

Original URL
State: closed
Created 5 years ago
Reactions: 2
Comments: 54 (54 by maintainers)

Most upvoted comments

This shouldn’t be a concern anymore, now OSDs will do their normal IP detection based on the NIC available inside the container. We don’t force any IP anymore.

Can I get this fix soon plz?

Present in 1.1.2: https://github.com/rook/rook/releases/tag/v1.1.2

leseb on Oct 3, 2019