cosmos-sdk: bug: Can't sync darwin-arm64 after creating IBC channel (different gas used / pointer addresses are different)
Summary of Bug
TL;DR: capabilitytypes.FwdCapabilityKey produces different results on darwin-arm64
A binary compiled for the Apple M1 chip (darwin-arm64) is not compatible with other architectures, causing consensus failure. It happens after a new IBC channel is created (I’m not sure if it’s ibc.core.client.v1.MsgUpdateClient ibc.core.channel.v1.MsgChannelOpenTry causing it).
Failed to process committed block (1005:4FB45CC400ADE7B6F640EBE90C51C5EB58C0161D65C9F0B56771914C2F2E6A7D): wrong Block.Header.LastResultsHash. Expected 520376AE1EC01E88011892D398C88ADA48EB941F6DC975DCA45125C4398A8D44, got E30709AEEC492763438DA9DB0F64E9C69EFDAF237AE10778FFA4C03009347904
This issue seems to affect only darwin-arm64 and not linux-arm64.
Version
v0.45.2
Steps to Reproduce
- Start chain on a Linux machine (arm64 or amd64, doesn’t matter)
- Create an IBC channel
- Join with a Mac, it will halt on the block in which the IBC channel was created
Debugging
I tracked down this issue to a mismatch in gas used caused by a “difference in encoding”.
Check below the operations and the data they are working with.
My macbook: gas_used:177437
Server: gas_used:177371
Diff: 66
Mac
gas consumed, operation, key/value
1000 ReadFlat ibc/fwd/0x14000ffda48
63 ReadPerByte ibc/fwd/0x14000ffda48 <- +3
42 ReadPerByte ports/transfer
...
2000 WriteFlat ibc/fwd/0x140084adec8
630 WritePerByte ibc/fwd/0x140084adec8 <- +30
1380 WritePerByte capabilities/ports/transfer/channels/channel-0
...
1000 ReadFlat transfer/fwd/0x140084adec8
78 ReadPerByte transfer/fwd/0x140084adec8 <- +3
0 ReadPerByte
...
2000 WriteFlat transfer/fwd/0x140084adec8
780 WritePerByte transfer/fwd/0x140084adec8 <- +30
1380 WritePerByte capabilities/ports/transfer/channels/channel-0
Linux server
1000 ReadFlat ibc/fwd/0x40010081a8
60 ReadPerByte ibc/fwd/0x40010081a8
42 ReadPerByte ports/transfer
...
2000 WriteFlat ibc/fwd/0x4005b534a8
600 WritePerByte ibc/fwd/0x4005b534a8
1380 WritePerByte capabilities/ports/transfer/channels/channel-0
...
1000 ReadFlat transfer/fwd/0x4005b534a8
75 ReadPerByte transfer/fwd/0x4005b534a8
0 ReadPerByte
...
2000 WriteFlat transfer/fwd/0x4005b534a8
750 WritePerByte transfer/fwd/0x4005b534a8
1380 WritePerByte capabilities/ports/transfer/channels/channel-0
It’s clear that the key that has an extra 1 at the beginning is causing the issue.
To quickly replicate
package main
import (
"fmt"
capabilitytypes "github.com/cosmos/cosmos-sdk/x/capability/types"
)
func main() {
cap := &capabilitytypes.Capability{}
fmt.Println(string(capabilitytypes.FwdCapabilityKey("ibc", cap)))
}
Run the code above in a linux-arm64 (or any other) and then in a darwin-arm64.
ibc/fwd/0x4000c68290 <- linux-arm64
ibc/fwd/0x140004a7f00 <- darwin-arm64
So I don’t know enough about these things to make a fix suggestion (unless removing any leading chars that exceed the length is an acceptable suggestion lol). We should find the Go docs that explain this behavior and act accordingly, but I couldn’t find those docs.
For Admin Use
- Not duplicate issue
- Appropriate labels applied
- Appropriate contributors tagged
- Contributor assigned/self-assigned
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Reactions: 7
- Comments: 34 (33 by maintainers)
Hmm I don’t understand how the forward mapping itself could cause an apphash mismatch.
The forward key is stored in the memstore which is not committed in the final apphash. https://github.com/cosmos/cosmos-sdk/blob/main/store/mem/store.go#L56
The fact that the two machines have different pointers is expected behavior. It would be true even if they were 2 Linux machines.
This is the intention. See from the ADR
@ValarDragon:
The pointers are only being stored in the local node’s state. They are not expected to be the same across nodes. Nor is it expected to be the same between different runs (ie. if the node stops and restarts). In fact, it is intentionally part of the design that different nodes would have different pointers to the capability.
What is expected is that while the node is running, the memory location of the capability does not change from the time it was last initialized by the state machine. This possibly has changed with M1???
The reason it was designed this way, was explicitly to keep the capability outside of the state machine.
The reasoning was if the capability was an object in state, then a malicious module or user could simply copy that capability and pass it in.
Instead, the capability key was the in-memory pointer itself. Each node would generate a capability and store it in memory.
In order to authenticate an action that requires a given capability, a caller must pass in the capability (the exact same pointer), rather than just a copy of the capability. Each node will then check this pointer against their stored pointer value.
The caller can get the exact capability on the given node by requesting it from the capability keeper who will authenticate that the caller is authorized to get the original capability pointer.
Since every node is retrieving in-memory capabilities from its own local store, and verifying these capabilities against its own local store. The authentication is not happening directly in the state machine. Each node is independently doing its own in-memory capability check and then returning to the state machine whether the authentication passed.
Arguably this is overengineering given the current SDK security model between modules, but it was a requirement put into the ICS specification.
I’d be interested in seeing the exact tx response that occurs during the channel handshake on the Mac. It would be the first handshake msg to be executed on the chain (Either INIT or TRY). You should be able to retrieve it @facundomedica by querying the txhash that got committed in the channel handshake.
My guess is that the Linux machine is authenticating the capability locally correctly. But the Mac is not authenticating the capability correctly.
This would imply to me that the capability got stored in some memory location during initialization and then the Mac changed the memory location while the node was still running. So that when it was later retrieved the new pointer did not match the stored pointer. I’m not sure if that’s possible but it would certainly break capabilities.
Thanks @facundomedica for an excellent breakdown and investigation!
Yes lets get this in!! @facundomedica would you like to open a pr?
I don’t want to yak-shed @ValarDragon. If you think the current threat model is moot and not needed, let’s open a separate issue/discussion for that 🙌
@facundomedica, I like @ValarDragon’s suggestion. We can use a go-native map outside the gas metered memstore to hold the PRNs and thus avoid the gas costs. Can you go with that?
This concrete problem could be solved by an in memory map, thats outside of whats metered by stores. (A map
usedNonces map[int64]struct{})I just think its bad form for us to rely on pointers as a source of RNG. This becomes the sort of thing that has edge cases / odditiess that haunt us far into the future.
I also question the entire threat model here though. I’d be in favor of just removing this entire pointer logic in the IBC spec. (As I noted in my post – pointers in golang do almost nothing for security over a counter here)
Anyone have context on who was originally in favor of it? Want to make sure we get their perspective / can talk about what they perceive as the benefit.
Kindly cc-ing an ARM expert @elias-orijtech
@facundomedica your debugging skills are absolutely unrivaled. Thanks for opening up this issue. Unfortunately, i think there are some cases where we need to use pointers, especially for IBC caps. However, the plus side is that pretty much all operators run in the cloud or even bare metal with standard amd64 architecture. Which is why we haven’t seen this in mainet most likely.
I’m not really sure what to do. In the meantime, I’ll do some digging in the golang repo to see if there are any similar issues
Tagging folks for visibility.
@faddat @marbar3778 @jackzampolin @ValarDragon