rippled: rippled v1.3.1 OOM (out of memory) crash on testnet

Hi!

Been running rippled v1.2.4 on testnet for quite some time without any issues on 10GB VM.

But 2 months or 3 months ago started to see massive OOMs and huge disk reads.

After updating to rippled v1.3.1 the situation is still the same, basically every hour there is an OOM crash.

Here is my rippled.cfg for testnet:

[server]
port_rpc_admin_local
port_peer
port_ws_admin_local
port_rpc_external
port_ws_external
#ssl_key = /etc/ssl/private/server.key
#ssl_cert = /etc/ssl/certs/server.crt

#enable signing
[signing_support]
true

[port_rpc_admin_local]
port = 5005
ip = 127.0.0.1
admin = 127.0.0.1
protocol = http

[port_peer]
port = 51235
ip = 0.0.0.0
protocol = peer

[port_ws_admin_local]
port = 6006
ip = 127.0.0.1
admin = 127.0.0.1
protocol = ws

[port_rpc_external]
port = 51234
ip = 192.168.70.86
protocol = http
user = ripplerpc_testnet
password = pass
admin = 192.168.70.2
#admin_user = ripplerpc_testnet
#admin_password = adminpass

[port_ws_external]
port = 51233
ip = 192.168.70.86
protocol = http
user = ripplerpc_testnet
password = pass
admin = 192.168.70.2
#admin_user = ripplerpc_testnet
#admin_password = adminpass
protocol = ws

#-------------------------------------------------------------------------------

[node_size]
tiny

[node_db]
type=RocksDB
path=/home/ripple/.ripple/db_testnet/rocksdb
open_files=2000
filter_bits=12
cache_mb=256
file_size_mb=64
file_size_mult=2
online_delete=2000
advisory_delete=0

[database_path]
/home/ripple/.ripple/db_testnet

[debug_logfile]
/home/ripple/.ripple/debug_testnet.log

[sntp_servers]
time.windows.com
time.apple.com
time.nist.gov
pool.ntp.org

[ips]
r.altnet.rippletest.net 51235

[validator_list_sites]
https://vl.altnet.rippletest.net

[validator_list_keys]
ED264807102805220DA0F312E71FC2C69E1552C9C5790F6C25E3729DEB573D5860

[rpc_startup]
{ "command": "log_level", "severity": "warning" }

[ssl_verify]
1

The crashlog:

[9517417.712334] Out of memory in UB 785: OOM killed process 1136 (rippled: main) score 0 vm:12445792kB, rss:8542132kB, swap:2930984kB
[9523304.685633] Out of memory in UB 785: OOM killed process 1181 (rippled: main) score 0 vm:12600648kB, rss:10099244kB, swap:1374516kB
[9539038.259031] Out of memory in UB 785: OOM killed process 1214 (rippled: main) score 0 vm:12619392kB, rss:9740884kB, swap:1731872kB
[9547999.656154] Out of memory in UB 785: OOM killed process 1292 (rippled: main) score 0 vm:12631060kB, rss:8109076kB, swap:3363836kB
[9548834.788951] Out of memory in UB 785: OOM killed process 1374 (rippled: main) score 0 vm:12490436kB, rss:9968068kB, swap:1505032kB
[9554955.520002] Out of memory in UB 785: OOM killed process 1400 (rippled: main) score 0 vm:12546684kB, rss:10385836kB, swap:1079028kB
[9568000.591282] Out of memory in UB 785: OOM killed process 1433 (rippled: main) score 0 vm:12535856kB, rss:9847404kB, swap:1617708kB
[9586333.723690] Out of memory in UB 785: OOM killed process 1471 (rippled: main) score 0 vm:12518188kB, rss:8828636kB, swap:2628092kB
[9594660.762773] Out of memory in UB 785: OOM killed process 1551 (rippled: main) score 0 vm:12523148kB, rss:9851188kB, swap:1605008kB
[9600658.329453] Out of memory in UB 785: OOM killed process 1581 (rippled: main) score 0 vm:12575460kB, rss:9754816kB, swap:1702112kB
[9606191.889044] Out of memory in UB 785: OOM killed process 1614 (rippled: main) score 0 vm:12491040kB, rss:10405776kB, swap:1050684kB
[9618393.215965] Out of memory in UB 785: OOM killed process 1644 (rippled: main) score 0 vm:12501732kB, rss:8257096kB, swap:3199896kB
[9642889.916292] Out of memory in UB 785: OOM killed process 1682 (rippled: main) score 0 vm:12531844kB, rss:8513004kB, swap:2942972kB
[9649820.280216] Out of memory in UB 785: OOM killed process 1817 (rippled: main) score 0 vm:12605872kB, rss:9383708kB, swap:2072036kB
[9655551.374678] Out of memory in UB 785: OOM killed process 1851 (rippled: main) score 0 vm:12553380kB, rss:8081320kB, swap:3375464kB

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 19 (4 by maintainers)

Most upvoted comments

Hello,every one ,I know the reason for my question, may be the same reason for yours @gituser .

The g_libc will hold the memory instead of free to OS after the destructor is called , you can use malloc_trim(0) to free the memory g_libc not freed,see :https://stackoverflow.com/questions/15529643/what-does-malloc-trim0-really-mean

I have a similiar problem when using rippled:

There is a project using rippled0.80.2, and it have ran abount a year,with 180G data(sqlite+rocksdb), Environment: aws c5d.2xlarge 8cpu 16G NVMe SSD and the sles count are below:

 {
         "account" : 2524069,
         "amendments" : 1,
         "directory" : 2518953,
         "escrow" : 0,
         "fee" : 0,
         "hashes" : 136,
         "offer" : 1,
         "payment_channel" : 0,
         "signer_list" : 0,
         "state" : 4738656,
         "table" : 0,
         "ticket" : 0
 }

I am puzzled about the memory in rippled process. The key point is when the treenodecache size returned from ‘get_counts’ get down from above 12000000 to below 2000000, and other params didn’t grow, but the memory didn’t get down at all(Always about 10G). So I suspect there is a memory leak in rippled.

The json returned by get_counts the first time:

{
   "id" : 1,
   "result" : {
      "AL_hit_rate" : 15.70247936248779,
      "HashRouterEntry" : 1479,
      "InboundLedger" : 99,
      "Ledger" : 109,
      "NodeObject" : 17812,
      "RCLCxPeerPos::Data" : 40,
      "SLE_hit_rate" : 0.3076923076923077,
      "STArray" : 128,
      "STLedgerEntry" : 29,
      "STObject" : 1332,
      "STTx" : 244,
      "STValidation" : 392,
      "Transaction" : 173,
      "dbKBLedger" : 4204,
      "dbKBTotal" : 9604,
      "dbKBTransaction" : 4204,
      "fullbelow_size" : 2559678,
      "historical_perminute" : 22,
      "ledger_hit_rate" : 11.01928329467773,
      "node_hit_rate" : 28.22913932800293,
      "node_read_bytes" : 4163320653,
      "node_reads_hit" : 14754404,
      "node_reads_total" : 15643782,
      "node_writes" : 741054,
      "node_written_bytes" : 289595474,
      "status" : "success",
      "treenode_cache_size" : 12238,
      "treenode_track_size" : 12009843,
      "uptime" : "5 minutes, 5 seconds",
      "write_load" : 0
   }
}

The second:

{
   "id" : 1,
   "result" : {
      "AL_hit_rate" : 30.69767570495605,
      "HashRouterEntry" : 1509,
      "Ledger" : 14,
      "NodeObject" : 8539,
      "RCLCxPeerPos::Data" : 40,
      "SLE_hit_rate" : 0.5398058252427185,
      "STArray" : 102,
      "STLedgerEntry" : 141,
      "STObject" : 1301,
      "STTx" : 297,
      "STValidation" : 584,
      "Transaction" : 274,
      "dbKBLedger" : 4204,
      "dbKBTotal" : 9604,
      "dbKBTransaction" : 4204,
      "fullbelow_size" : 1374097,
      "historical_perminute" : 0,
      "ledger_hit_rate" : 15.36214923858643,
      "node_hit_rate" : 28.27434730529785,
      "node_read_bytes" : 4203252193,
      "node_reads_hit" : 14893786,
      "node_reads_total" : 15783684,
      "node_writes" : 743631,
      "node_written_bytes" : 291259310,
      "status" : "success",
      "treenode_cache_size" : 163,
      "treenode_track_size" : 1625090,
      "uptime" : "7 minutes, 38 seconds",
      "write_load" : 0
   }
}

It could very well be that your node doesn’t manage to write its files fast enough and fills caches until it runs out of memory. NuDB caches much less (and also does not rewrite/compact data into different levels), so that’s one way you could try to find out what happens.

Other cryptocurrency clients also have their issues (I ran into some parity based servers that are hilariously bad at dealing with RocksDB), many don’t have as much state or write/read load as XRPL so they can cache more or can take their time to do a compaction since the next block is still minutes away.

I’m not saying there isn’t an issue within rippled potentially, it’s just that a resource starved server (too little network IO, disk IO, CPU or RAM) tends to spiral out of control a bit in my experience. I had other issues with a spotty and slow internet connection, which lead to rippled constantly fully re-syncing which led to even more traffic. It might very well be the case that on your node it tries to write to the database but can’t because it is compacting or still slowly writing so it will cache and cache and cache…

Running rippled without a fast SSD is anyways a futile task if you want any kind of historic data. Why do you need exactly 2000 ledgers anyways? Maybe you can increase/decrease that number to see if it helps. Also you could start logging below the “warning” level and also monitoring the machine so you have a bit more insight into what’s happening.