solana: solana-validator leaks memory (but at very slow pace)

Problem

solana-validator (tested v1.4.19) definitely leaks memory needing periodic restart of once per a week or so.

The pace seems stable across nodes at the rate of 1-2G/day.

Proposed Solution

Debug.

we don’t know this existed on the v1.3 line as well. But this leak is observed from both RPC and non-RPC nodes. All, the leak happening on RssAnon. This excludes AppendVec (mmap) as it’s accounted under RssFile

So, remaining culprits: gossip, blockstore, runtime, rocksdb, etc.

For runtime, blockstore, I think we can just run loong ledger-tool verify session.

CC: @carllin

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 46 (46 by maintainers)

Commits related to this issue

Most upvoted comments

We do seemingly have another memory leak/growth (master & v1.14) that is in the early stages of investigation at the moment. That being said, I’m in favor of closing this issue due to its’ age. Releases are different and code is so different that I think new investigation would be worthy of a new issue (currently in Discord).

We could always reference this issue as a “prior work” in a new issue.

I’ve had a node running v1.9.9 against mainnet-beta for a couple weeks; it showed a 4-5 GB / day ramp shortly after starting; however, memory has looked pretty stable the last 2+ weeks: image

I’ll detail later

Hi @ryoqun - are you still looking actively looking into this / have any updates? There was some chatter last week on Discord about the memleaks that seem to be present in both v1.9 and master and wanted to avoid any duplicate work

hi, i haven’t actively investigating the possible memory leak bug, which i thought i came up with. seems it’s false alarm… also, i was testing v1.8.x line.

Hi @behzadnouri, sorry for bothering, but it looks like unprefixed_malloc_on_supported_platforms feature should be enabled in tikv-jemalloc-sys. Without this, jemalloc is used only for Rust code but not for bundled C/C++ libraries (like rocksdb). This seems wrong.

memory leak reports on v1.7: https://discord.com/channels/428295358100013066/439194979856809985/877222446061727764 https://discord.com/channels/428295358100013066/689412830075551748/877190383275212900

leak

https://gist.githubusercontent.com/behzadnouri/6acaae1c9664f0a3445827d25f28305a/raw/9687afbf2a49591de4201984bd338f807eb2aff5/heaptrack-validator-2021-08-18-v1.7-34107 ☝️ heaptrack on a testnet validator running v1.7 with some recent cluster-slots patches (which I am backporting to v1.7 now).

  • Running with heaptrack, the validator could not catch-up with the cluster. So some numbers (e.g. cluster-slots & repair) have probably become misleading.
  • The node was not running cuda as I could not get cuda and heaptrack working together. So, if the cuda code is causing leakage, it would not show up there. Though some reports indicate that the leakage is not from cuda: https://discord.com/channels/428295358100013066/689412830075551748/877995739454767104
  • cluster-slots still shows high memory consumption. This is at least partly due to node falling behind. Inspecting cluster-slots data on testnet, I still see some wrong epoch-slots, but the total size of the hash-map (including inner entries) seems to hover around 200k for the most part. Probably does not keep growing.
  • Stake was not yet activated, and that could also be a source of bias.

@behzadnouri I owe you for passing this to you as it’s still very uncooked. but could you take a look at it?

@ryoqun sure, I will look into cluster_slots.rs. I think the code is new but the logic is the same as before. definitely needs some more digging. thanks

@ryoqun, hmmm weird, is the node caught up with the cluster? I can only imagine those far-future slots if:

  1. Node is very far behind and others are completing slots in the future
  2. Pollution in gossip from another network
  3. Malicious spam
  4. Flate2 errors

I think we can distinguish between the above by seeing how many nodes are in the ClusterSlots that have completed a slot > root + 10,000. If it’s a few, it might be some pollution, if it’s a lot AND we’re sure we’re near the tip, then probably something is wrong with the compression/decompression path.

For context, when thinking about whether we can do a blanket filter like *slot > root + 10000 the primary two places where ClusterSlots is used:

  1. Propagation status for your own leader slots: https://github.com/solana-labs/solana/blob/master/core/src/replay_stage.rs#L1545-L1549. Here it’s fine to ignore far future slots since you only care about your own leader slots and slots built on top of your leader slot, which should be in a reasonable range from your current root

  2. For weighting repairs, to find validators who have actually completed that slot: https://github.com/solana-labs/solana/blob/master/core/src/cluster_slots.rs#L110. This currently magnifies the weight of nodes that have completed the slot by a factor of 2. I imagine this might be useful in catchup scenarios where validators are trying to repair slots that are far in the future, for instance if a node is > 10,000 slots behind. To get around this, we may be able to leverage information based on votes in the cluster about which slots in the future are actaully relevant. This is already done here: https://github.com/solana-labs/solana/blob/master/core/src/repair_weight.rs#L142-L148 to find the best orphans to repair. We could do something like, ignore *slot > root + 10000 && slot > best_orphan.

@behzadnouri this pr (#14467) could be a fix to one of the above suspicious memory leaks, especially like this? #14366 (comment)? Or completely different one? (I’m not seeing crds in my backtraces). Maybe, did you find #14467 via metrics?

I doubt that that is the case. You mention:

needing periodic restart of once per a week or so.

but the issue with https://github.com/solana-labs/solana/pull/14467 does not go away with restart. If you restart a node it quickly syncs up to the previous table it had in memory. Also as you also mentioned, those stack traces do not show relevant crds or ClusterInfo either. There is one with ClusterInfo_handle_pull_requests which is not good, but that seems unrelated to the crds table size thing.

did you find #14467 via metrics?

yes, it is cluster_info_stats.table_size.

cluster_info_stats table_size

Hmm, HashMap needs periodic shrink_to_fit? I doubt it.

@carllin does something ring? If heaptrack is right, it says we keep some references to these heap-allocated objects somehow or HashMap having too many elements or HashMap doesn’t shrink its capacity after moderate .retain?:

I think that might be right if we are not ok with the excess capacity. when I looked both Vec and HashMap I don’t think size-down capacity for .resize, .retain .remove etc.

Not that slow on tds with 1.4.20:

image

Anon pages as well:

image

(accounts not on a tmpfs)