iavl: IAVL panic with missing value
A user is reporting the following panics with the SDK (0.38.4) and options pruning=everything:
I[2020-06-08|14:01:38.646] starting ABCI with Tendermint module=main
panic: Value missing for hash 190e17e5722f481b33536af9525e21abd9e9a02e8e4fed46e3921010899ab65c corresponding to nodeKey n�r/HSj�R^!���.�O�F����\
goroutine 1 [running]:
github.com/tendermint/iavl.(*nodeDB).GetNode(0xc000df6580, 0xc002a77f60, 0x20, 0x20, 0x0)
/home/ubuntu/go/pkg/mod/github.com/tendermint/iavl@v0.13.2/nodedb.go:109 +0x7bf
This does not happen when pruning is disabled.
We’ve seen similar panics related to concurrency, see e.g. #240.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 16 (16 by maintainers)
Commits related to this issue
- document byte slice immutability (#262) See #261. — committed to cosmos/iavl by erikgrinaker 4 years ago
This turned out to be caused by a subtle flaw in the test case - consider this snippet:
Notice that the inner
random.Read(key)call reads into the same byte-slice as originally used. TheSet()call will have placed this byte-slice into the IAVL node, and I think it’s somehow made its way into the orphans list. So when we change the key, this changes the nodes in the tree (giving them the wrong node position for their key), and I think somehow also changes which orphan nodes are scheduled for removal, causing the node to be missing in the next version.Changing the loop as follows makes the panic go away:
On one hand I’m relieved that this was a problem with the test rather than IAVL, but on the other hand I’m back to square one for debugging this crash. I would not be surprised if the user crash is caused by a similar problem in the SDK, since this is a very easy mistake to make, and I don’t believe this is even documented for IAVL (although it is for tm-db).
As a side-note,
MemDBin tm-db 0.5.0 also has this behavior (given and returned keys and values are pointers to the actual database items, for performance). In MemDB 0.4.0, as an implementation detail, the key was copied due to a string-cast for use as a map key, while the value was stored as a pointer. It is quite possible that this change has triggered similar bugs, although MemDB should not be involved at all withpruning=everythingfor GoLevelDB-based IAVL trees.I’m considering further approaches to debugging this.
I think there are multiple related problems here (including unclean shutdowns, as you point out).
However, even after eliminating all other error sources, there is an underlying problem with orphans in IAVL - it only occurs when replacing a key, not when writing new keys. I’m going to focus on that for now, then we can work our way up and fix any related problems.
I don’t feel like the application should have to do anything special for this to be safe. We may want to add a
Close()method to the tree, which the application should call before exiting, but IAVL should also do whatever it can to prevent corruption in a default configuration.I’ve found the problem, and written a minimal test case: https://github.com/cosmos/iavl/blob/60acfb1ff36c057882d418b0bbe04aa8d759edfc/mutable_tree_test.go#L336
It boils down to orphan entries being incorrectly persisted to disk for changes that are only done in memory, and then reloading the tree and making different changes causes these orphan entries to remove nodes that are now in use when the versions the orphans were visible in are deleted. Consider this history, with keepEvery=2 keepRecent=1:
This usually isn’t a problem, because the replayed in-memory versions will be the same as before (since it’s the same blocks). However, if the pruning settings have been changed, then these orphans can get removed before the change that orphaned them is made, causing this crash. There may be other corner-cases as well.
Will begin work on a fix.
I’ve been able to reproduce this again, have written a test-case here:
https://github.com/cosmos/iavl/blob/d554b885042a47ea93d6b998f8b869bab37dfdff/mutable_tree_test.go#L270
The behavior is basically this:
Use
KeepEvery: 5andKeepRecent: 1. When we save a version to disk (every 5), delete the previous saved version (so after 10 delete 5). This follows SDK pruning behavior. Stop producing versions after version 12.Open a new tree, loading version 10, and write 12 new versions with the same key/value pairs as from the previous run. After version 15 is saved and version 10 is deleted, IAVL panics while generating version 16:
I don’t think there are any flaws in the test case this time, but would appreciate a second pair of eyes @alexanderbez.
I tested this with the original 0.13.0 release as well (before the tm-db changes), and the panic is present there as well.
You mean this is the expected buggy behavior, not the desired behavior, right? With
pruning=everything, I would expect the application to retain the latest version on disk, and start from that.