reth: Reth killed by OOM killer w/ 64GB RAM

Describe the bug

reth syncs and tracks the chain head with no issues for days, but every week or two it will be killed by the linux OOM killer. This is during normal operation, not sync, so I’ve opened a new issue for it as the other OOM issues look to be during sync. The machine is running ubuntu 22.04 and has 64GB RAM. There are several other processes using a decent amount of memory, but there should be plenty left over for reth. You can see the detailed breakdown of process memory usage in the dmesg.txt attached.

The machine is running reth, lighthouse, an arbitrum node which is using reth, and an RPC client program which is primarily calling eth_call and eth_subscribe on reth. It’s not a particularly heavy RPC load, but there may be some bursts or activity. I don’t see signs of anything crazy happening right before the OOM in any of the client logs.

I’ve had this happen twice now. You can see the kernel logs for both OOM events in the attached logs. From the most recent one, reth is using 39612664kB of anon-rss at the time it is killed. I’ve also pasted a screenshot from grafana of the memory stats before the latest OOM. Eyeballing the rest of the grafana stats I don’t see anything concerning in that time period. There are 10 inflight requests and 14 readonly transactions open right before the crash, but this is not unusual for the days prior where no issues were observed. I’d be happy to send more data from grafana if you have a nice way for me to export it. Reth log before the crash is attached as well, but nothing in it caught my eye.

From the grafana jemalloc stats, there are two quick spikes in memory usage right about 20 minutes before the crash which look like they might be similar in nature to the crash, but didn’t quite trigger the OOM killer. At the crash, the RSS goes from 20GB to 33GB in one tick of the graph, and appears to hit about 40GB at the crash. At the same time, the jemalloc stats show “active”, “allocated”, and “mapped” all jumping from <30GB to >400GB.

reth.log dmesg.txt reth-oom-dashboard

Let me know if there’s anything else I can do to help track this down. Do you have any guesses as to what would be using ~40GB of anon-rss?

Steps to reproduce

  1. Sync reth
  2. Run some load against it
  3. Wait
  4. Boom!

Node logs

attached above

Platform(s)

Linux (x86)

What version/commit are you on?

reth Version: 0.1.0-alpha.10 Commit SHA: a9fa2818 Build Timestamp: 2023-10-28T03:44:10.194397854Z Build Features: default,jemalloc Build Profile: release

What database version are you on?

1

What type of node are you running?

Archive (default)

What prune config do you use, if any?

None

If you’ve built Reth from source, provide the full command you used

cargo install --locked --path bin/reth --bin reth

Code of Conduct

  • I agree to follow the Code of Conduct

About this issue

  • Original URL
  • State: closed
  • Created 8 months ago
  • Comments: 16 (13 by maintainers)

Most upvoted comments

@mattsse I’ll close this unless there’s some reason to keep it open. I’ve verified that the changes above have greatly decreased memory consumption when tracing.