iavl: IAVL panic with missing value

A user is reporting the following panics with the SDK (0.38.4) and options pruning=everything:

I[2020-06-08|14:01:38.646] starting ABCI with Tendermint                module=main 
panic: Value missing for hash 190e17e5722f481b33536af9525e21abd9e9a02e8e4fed46e3921010899ab65c corresponding to nodeKey n�r/HSj�R^!���.�O�F����\
goroutine 1 [running]:
github.com/tendermint/iavl.(*nodeDB).GetNode(0xc000df6580, 0xc002a77f60, 0x20, 0x20, 0x0)
        /home/ubuntu/go/pkg/mod/github.com/tendermint/iavl@v0.13.2/nodedb.go:109 +0x7bf

This does not happen when pruning is disabled.

We’ve seen similar panics related to concurrency, see e.g. #240.

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 16 (16 by maintainers)

Commits related to this issue

Most upvoted comments

This turned out to be caused by a subtle flaw in the test case - consider this snippet:

key := make([]byte, 16)
value := make([]byte, 16)
random.Read(key)
random.Read(value)
// if we hit an existing key, regenerate and set a new key
for tree.Set(key, value) {
    random.Read(key)
}

Notice that the inner random.Read(key) call reads into the same byte-slice as originally used. The Set() call will have placed this byte-slice into the IAVL node, and I think it’s somehow made its way into the orphans list. So when we change the key, this changes the nodes in the tree (giving them the wrong node position for their key), and I think somehow also changes which orphan nodes are scheduled for removal, causing the node to be missing in the next version.

Changing the loop as follows makes the panic go away:

for tree.Set(key, value) {
    key = make([]byte, 16)
    random.Read(key)
}

On one hand I’m relieved that this was a problem with the test rather than IAVL, but on the other hand I’m back to square one for debugging this crash. I would not be surprised if the user crash is caused by a similar problem in the SDK, since this is a very easy mistake to make, and I don’t believe this is even documented for IAVL (although it is for tm-db).

As a side-note, MemDB in tm-db 0.5.0 also has this behavior (given and returned keys and values are pointers to the actual database items, for performance). In MemDB 0.4.0, as an implementation detail, the key was copied due to a string-cast for use as a map key, while the value was stored as a pointer. It is quite possible that this change has triggered similar bugs, although MemDB should not be involved at all with pruning=everything for GoLevelDB-based IAVL trees.

I’m considering further approaches to debugging this.

I think there are multiple related problems here (including unclean shutdowns, as you point out).

However, even after eliminating all other error sources, there is an underlying problem with orphans in IAVL - it only occurs when replacing a key, not when writing new keys. I’m going to focus on that for now, then we can work our way up and fix any related problems.

I don’t feel like the application should have to do anything special for this to be safe. We may want to add a Close() method to the tree, which the application should call before exiting, but IAVL should also do whatever it can to prevent corruption in a default configuration.

I’ve found the problem, and written a minimal test case: https://github.com/cosmos/iavl/blob/60acfb1ff36c057882d418b0bbe04aa8d759edfc/mutable_tree_test.go#L336

It boils down to orphan entries being incorrectly persisted to disk for changes that are only done in memory, and then reloading the tree and making different changes causes these orphan entries to remove nodes that are now in use when the versions the orphans were visible in are deleted. Consider this history, with keepEvery=2 keepRecent=1:

  • Save version 1 (memory): set a=1 b=2
  • Save version 2 (disk): set c=3 d=4
  • Save version 3 (memory): set d=0 → orphan d=4 visible in versions <= 2 (persisted to disk)
  • Restart, load version 2 from disk
  • Save version 3 (memory): set a=0
  • Save version 4 (disk): set b=0
  • Delete version 2 (disk) → orphan for d=4 visible for version <= 2, no longer needed so remove node d=4
  • Load d=4 → panic, because the node no longer exists

This usually isn’t a problem, because the replayed in-memory versions will be the same as before (since it’s the same blocks). However, if the pruning settings have been changed, then these orphans can get removed before the change that orphaned them is made, causing this crash. There may be other corner-cases as well.

Will begin work on a fix.

I’ve been able to reproduce this again, have written a test-case here:

https://github.com/cosmos/iavl/blob/d554b885042a47ea93d6b998f8b869bab37dfdff/mutable_tree_test.go#L270

The behavior is basically this:

Use KeepEvery: 5 and KeepRecent: 1. When we save a version to disk (every 5), delete the previous saved version (so after 10 delete 5). This follows SDK pruning behavior. Stop producing versions after version 12.

Open a new tree, loading version 10, and write 12 new versions with the same key/value pairs as from the previous run. After version 15 is saved and version 10 is deleted, IAVL panics while generating version 16:

panic: Value missing for hash e61c06828e90b91404a9442f74d33f3b89f9d9fa93960df459a7ca313a99338b corresponding to nodeKey 6ee61c06828e90b91404a9442f74d33f3b89f9d9fa93960df459a7ca313a99338b [recovered]
	panic: Value missing for hash e61c06828e90b91404a9442f74d33f3b89f9d9fa93960df459a7ca313a99338b corresponding to nodeKey 6ee61c06828e90b91404a9442f74d33f3b89f9d9fa93960df459a7ca313a99338b

I don’t think there are any flaws in the test case this time, but would appreciate a second pair of eyes @alexanderbez.

I tested this with the original 0.13.0 release as well (before the tm-db changes), and the panic is present there as well.

The expected behavior here is that if pruning is `everything", the ABCI application would replay ever block from genesis against an empty IAVL tree to restore state to the current block height and it would never look for a snapshot.

You mean this is the expected buggy behavior, not the desired behavior, right? With pruning=everything, I would expect the application to retain the latest version on disk, and start from that.