celestia-node: share/p2p/shrexeds: Error handling
A lot of new errors are appearing client side against BSR.
- Application Error times out after 60s
2023-05-09T15:30:49.015+0200 DEBUG share/getters getters/shrex.go:171 eds: request failed {"hash": "FB78B3FEC22B77F98C4A23BB27EB557E202172AD74A018800CB9A09299A99EC8", "peer": "12D3KooWBc9nKfmKJZKqkto9UHvQce1ZZZy3MoNdZzDQpgi2SFkD", "attempt": 1, "err": "failed to read status from stream: Application error 0x0 (remote)", "finished (s)": 57.119983625}
- Failed to read status from stream: deadline exceeded after 60s
- Why does it happen at this point? If ratelimited, the err should be
io.EOF
. The deadline should be smaller here
2023-05-09T16:14:34.920+0200 DEBUG shrex/eds shrexeds/client.go:57 client: eds request to peer failed {"peer": "12D3KooWFC3v38kFFJYWVNjR92iX962YpSVpAbGsDwYcgVqvTWk8", "hash": "9EB61119363E061DF7525B5DDEC77D13D63B36C4D479DBF29C4A2E3926148165", "error": "failed to read status from stream: deadline exceeded"}
2023-05-09T16:14:34.920+0200 DEBUG share/getters getters/shrex.go:172 eds: request failed {"hash": "9EB61119363E061DF7525B5DDEC77D13D63B36C4D479DBF29C4A2E3926148165", "peer": "12D3KooWFC3v38kFFJYWVNjR92iX962YpSVpAbGsDwYcgVqvTWk8", "attempt": 1, "err": "context deadline exceeded", "finished (s)": 60.000769791}
- Somehow we get a peer that doesnt support the protocol (almost immediate error)
- How is this possible with our current nodes?
- We should probably just blacklist
2023-05-09T15:24:15.781+0200 DEBUG shrex/eds shrexeds/client.go:57 client: eds request to peer failed {“peer”: “12D3KooWCV61dtrSkykyEJ4Z8i6LTgvYhaWVuEjChs8Crk4phLXJ”, “hash”: “2C159F3964034F0DE630D9309735AD80E267E012C7BA271A6D93B17E610A444E”, “error”: “failed to open stream: protocols not supported: [/blockspacerace-0/shrex/eds/v0.0.1]“}
2023-05-09T15:24:15.781+0200 WARN shrex/eds shrexeds/client.go:72 client: eds request to peer failed {“peer”: “12D3KooWCV61dtrSkykyEJ4Z8i6LTgvYhaWVuEjChs8Crk4phLXJ”, “hash”: “2C159F3964034F0DE630D9309735AD80E267E012C7BA271A6D93B17E610A444E”, “err”: “failed to open stream: protocols not supported: [/blockspacerace-0/shrex/eds/v0.0.1]“}
2023-05-09T15:24:15.781+0200 DEBUG share/getters getters/shrex.go:171 eds: request failed {“hash”: “2C159F3964034F0DE630D9309735AD80E267E012C7BA271A6D93B17E610A444E”, “peer”: “12D3KooWCV61dtrSkykyEJ4Z8i6LTgvYhaWVuEjChs8Crk4phLXJ”, “attempt”: 3, “err”: “failed to open stream: protocols not supported: [/blockspacerace-0/shrex/eds/v0.0.1]“, “finished (s)“: 0.019803125}
- Stateless reset with token after 22s
- Has something to do w QUIC
2023-05-09T15:02:30.163+0200 WARN shrex/eds shrexeds/client.go:72 client: eds request to peer failed {“peer”: “12D3KooWD4uxio9FnK2AUBoydXo5LqiiXbC7QZR658wUt6Hf5xfD”, “hash”: “D7A4B103159BCE6FC9AD0B3E8995F1932B2F9C76357BFD1E886298119E4D12E1", “err”: “failed to read status from stream: received a stateless reset with token a7b84d1da38f64eec20c9e2b8b395729"}
2023-05-09T15:02:30.163+0200 DEBUG share/getters getters/shrex.go:170 eds: request failed {“hash”: “D7A4B103159BCE6FC9AD0B3E8995F1932B2F9C76357BFD1E886298119E4D12E1", “peer”: “12D3KooWD4uxio9FnK2AUBoydXo5LqiiXbC7QZR658wUt6Hf5xfD”, “attempt”: 1, “err”: “failed to read status from stream: received a stateless reset with token a7b84d1da38f64eec20c9e2b8b395729”, “finished (s)“: 22.121177}
- Failed to open stream: context deadline exceeded (after 60s)
- We should not let the opening of the stream take more than maybe 5 seconds.
2023-05-09T14:32:34.940+0200 DEBUG shrex/eds shrexeds/client.go:57 client: eds request to peer failed {“peer”: “12D3KooWBbc9ixMAH4SzF3EH6SwSfo8gRW1bVrDUX19oxC4DaRbz”, “hash”: “C71DA660939D4B174129A6F5839DB73B9F353544365C5EEADF2C7F7324F4C38A”, “error”: “failed to open stream: context deadline exceeded”}
2023-05-09T14:32:34.940+0200 DEBUG share/getters getters/shrex.go:170 eds: request failed {“hash”: “C71DA660939D4B174129A6F5839DB73B9F353544365C5EEADF2C7F7324F4C38A”, “peer”: “12D3KooWBbc9ixMAH4SzF3EH6SwSfo8gRW1bVrDUX19oxC4DaRbz”, “attempt”: 1, “err”: “context deadline exceeded”, “finished (s)“: 60.000322042}
7. RequestEDS checks child context for ctx error, then returns parent context error, leading to returning of nil, nil and causing panic Resolved
- “failed to read eds from ods bytes: share: reading next car entry: stream reset”
2023-05-13T17:15:36.413+0300 WARN shrex/eds shrexeds/client.go:72 client: eds request to peer failed {"peer": "12D3KooWB7GvRuTWm6DUd4pt5VTRfYfciWKLkUpCKrCtJnGNKEv7", "hash": "80730570DB5B2F39DCC2BE8091A3322E3B0B404ABAA2EC22F570E9615A60161C", "err": "failed to read eds from ods bytes: share: reading next car entry: stream reset"}
-
Not an actual error, but server node could always return ErrNotFound and not get any punishment
No recent network activity.
- Happens for multiple peers at the same time, so probably related to client connectivity issues. Needs to be handled accordingly (no remote peer punishment)
2023-05-13T20:03:30.190+0300 WARN shrex/eds shrexeds/client.go:72 client: eds request to peer failed {"peer": "12D3KooWBASepmTQtATM196L1wWH6kVd4TBgpUN57koyfjT2vheX", "hash": "91DFD05708F4D96E2C1B13B040C07C962830D970922E8192423DF4F579B80EFC", "err": "failed to read status from stream: timeout: no recent network activity"}
2023-05-13T20:03:30.212+0300 WARN shrex/eds shrexeds/client.go:72 client: eds request to peer failed {"peer": "12D3KooWB7GvRuTWm6DUd4pt5VTRfYfciWKLkUpCKrCtJnGNKEv7", "hash": "91DFD05708F4D96E2C1B13B040C07C962830D970922E8192423DF4F579B80EFC", "err": "failed to read status from stream: timeout: no recent network activity"}
2023-05-13T20:03:30.212+0300 WARN shrex/eds shrexeds/client.go:72 client: eds request to peer failed {"peer": "12D3KooWB7GvRuTWm6DUd4pt5VTRfYfciWKLkUpCKrCtJnGNKEv7", "hash": "3F2C7DE3EB02FE219BF7CE931030E6E79C0E687000187B5FF456DEE2D9FB010D", "err": "failed to read eds from ods bytes: share: reading next car entry: timeout: no recent network activity"}
2023-05-13T20:03:30.212+0300 WARN shrex/eds shrexeds/client.go:72 client: eds request to peer failed {"peer": "12D3KooWB7GvRuTWm6DUd4pt5VTRfYfciWKLkUpCKrCtJnGNKEv7", "hash": "A5723C5C033FB1F6A54BE906639C901FBC3C01278F5EDD8C26CB651BB6D96997", "err": "failed to read status from stream: timeout: no recent network activity"}
2023-05-13T20:03:30.212+0300 WARN shrex/eds shrexeds/client.go:72 client: eds request to peer failed {"peer": "12D3KooWB7GvRuTWm6DUd4pt5VTRfYfciWKLkUpCKrCtJnGNKEv7", "hash": "76044121A20F6AB9FAE153F52CE671C4363385478FC1A12BF8717113F1A96F6A", "err": "failed to read status from stream: timeout: no recent network activity"}
- Failed to open stream: failed to dial
- Happens right after 10, so must be related to connectivity issue on client side.
- Needs to be handled too
2023-05-13T21:03:59.627+0300 WARN shrex/eds shrexeds/client.go:72 client: eds request to peer failed {"peer": "12D3KooWCAMGzWrcnbPradgkJEe2yTEWNYi5LPM4YqohDxDdE9Ng", "hash": "83676B211F75ECC843F411D3E2A57B8BD368DD1B8786CB8A73AFDFB08FA1E45B", "err": "failed to open stream: failed to dial 12D3KooWCAMGzWrcnbPradgkJEe2yTEWNYi5LPM4YqohDxDdE9Ng:\n * [/ip4/68.183.28.164/tcp/2121] dial tcp4 0.0.0.0:2121->68.183.28.164:2121: i/o timeout\n * [/ip4/68.183.28.164/udp/2121/quic-v1] timeout: no recent network activity"}
- Stream reset on reading response status.
- The only reason for stream reset at this stage is if request was invalid. But since running client works fine with majority of peers, it seems to be server issue.
2023-05-13T21:03:02.747+0300 WARN shrex/eds shrexeds/client.go:72 client: eds request to peer failed {"peer": "12D3KooWBTFMz9doATz6gxb9VUZ5d2iJ1udBfJCkho6ixgeu6nhE", "hash": "4EA6E0541DAEF1C7206EC5B5CA9C00F6546F6E32E12D6711F6B4C60E69BD3C8D", "err": "failed to read status from stream: stream reset"}
- Routing not found
- It is extremely unlikely that peer has disconnected at the point of shrexeds request. The fact that there are 3 peers disconnected in short sequence requires investigation.
2023-05-13T21:03:08.633+0300 WARN shrex/eds shrexeds/client.go:72 client: eds request to peer failed {"peer": "12D3KooWCAMGzWrcnbPradgkJEe2yTEWNYi5LPM4YqohDxDdE9Ng", "hash": "9F78CE8A2E753B78A4380C530F4AA8B6F8125CC38F38EE9BDDAD8932E083672C", "err": "failed to open stream: failed to find peers: routing: not found"}
2023-05-13T21:03:20.434+0300 WARN shrex/eds shrexeds/client.go:72 client: eds request to peer failed {"peer": "12D3KooWCAMGzWrcnbPradgkJEe2yTEWNYi5LPM4YqohDxDdE9Ng", "hash": "4657B89D6D7618A6928B1612A26DB9D03BA3D6182A4388D4A0165DDD759C0FB1", "err": "failed to open stream: failed to find peers: routing: not found"}
2023-05-13T21:03:42.354+0300 WARN shrex/eds shrexeds/client.go:72 client: eds request to peer failed {"peer": "12D3KooWBCfAdTYhtpuzbKqnc5WM3qh7Zkp3DmjrbAwgTsgiKQPW", "hash": "B366AFE879E7979DF8A132A3D3E5833F48FCF7F9434F6C4E4AE81CC0B1F7277C", "err": "failed to open stream: failed to find peers: routing: not found"}
About this issue
- Original URL
- State: closed
- Created a year ago
- Reactions: 1
- Comments: 15 (14 by maintainers)
To summarize. Most of the errors above are unrelated to shrex. Similarly, shrex should not be responsible for handling them. They are various manifestations of issues on the remote side, which shrex can do nothing about. It should just ignore this peer, let it go and move on.
However, I really appreciate the effort here to document each.
Happens for multiple peers at the same time. Probably related to closing connections over the limit on libp2p level.