hermes: Update client may cause "new header has a time from the future" chain error

Crate

relayer

Summary of Bug

@mircea-c has reported that the following errors are seen once in a while when updating the sifchain client (07-tendermint-395) on cosmoshub:

Oct 11 09:13:03.057 DEBUG ThreadId(42) [cosmoshub-4] simulated gas: Err(
    0: #033[91mGRPC call return error status status: InvalidArgument, message: "failed to execute message; message index: 0: cannot update client with ID 07-tendermint-395: failed to verify header: inval
id header: new header has a time from the future 2021-10-11 13:12:56.673817244 +0000 UTC (now: 2021-10-11 13:12:51.047099028 +0000 UTC; max clock drift: 5s): invalid request", details: [], metadata: Meta
dataMap { headers: {"content-type": "application/grpc"} }#033[0m

After looking at the sif chain configuration (client created with clock_drift = 5sec) and comparing block times (5-6 sec on sif, 7-13 sec on hub), the following seems to be the issue. Here is an example (- is 1 sec, b are blocks on sif, B are blocks on the hub).

	   1:ev 2:clUpdate
           |    |  |
           |    | 3: fetch header (h)
           v    v  |
b----b-----b----b-----b----b      (sifchain 5/6 sec blocktimes)
                |  | 4: simulate/ send Tx
                |  v
B-------B------------B------B     (cosmoshub 7-13 sec)
        ^       |
        | 5: create context with latest header (H)
        |       |
        v       v
        <-------> 6: check that h.timestamp - drift < H.timestamp (aka now)

The steps are: 1 - an IBC event that requires a clientUpdate 2 - hermes starts to build the client update message and 3 - fetches the header h from sifchain (this happens for the header at event height + 1) 4 - sends a simulate request to the hub (can happen on send_tx as well) 5 - the chain receives the request and creates a context with the latest header H on the hub 6 - the client handler is called with that context and the message. Among other things it checks that the h.timestamp is not “in the future”, i.e. H.timestamp + client_state.clock_drift > h.timestamp

(one 13 sec example on the hub: h:7969646 - ts:16:31:24 h:7969647 - ts:16:31:37 )

One problem is that the clock_drift of 5 sec is not enough for clients of sifchain on the hub. But it might be appropriate for clients of sifchain on other chains. This in itself is a hint on the difficulty of sizing this properly. Also changing the configuration alone is not enough to fix the issue, it would require new clients, connections and channels to be created, and applications to switch to these.

The other solution is to make hermes’ update client code more robust. This issue is to investigate and if possible fix this on hermes side. cc @adizere @romac A number of ideas were mentioned, hermes could maybe perform the H.timestamp + client_state.clock_drift > h.timestamp check when updating tendermint clients, or drop the tx on the specific tx simulate (or tx_send()) error ( (to be picked by the clear packet mechanism or retries for channel/ connection handshake).

Version

all

Steps to Reproduce

We should be able to reproduce this with local setup, create two chains with different block times, create client with small clock_drift values.

Acceptance Criteria

Ideally this error should never be seen with correct headers. Impact on hermes operation should be minimal with “bad” headers, e.g. one that is in the long future.


For Admin Use

  • Not duplicate issue
  • Appropriate labels applied
  • Appropriate milestone (priority) applied
  • Appropriate contributors tagged
  • Contributor assigned/self-assigned

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 16 (5 by maintainers)

Most upvoted comments

Jacob was reporting this same error (osmosis-labs/osmosis#520 (comment)) potentially also on a sifchain channel, not really sure.

Two questions:

  • Is there any safety issue with increasing max_clock_drift to say 15 seconds?

I think it’s fine and advisable for new clients. Existing clients would need to be upgraded.

  • Is it possible to make the error print the counterparty chains block height, and chain-id, to ease debugging? (e.g. we can also go talk to that block proposer to fix their local time)

It’s not straightforward at exactly the place the error is issued but we will find a way to print more info. The problem though might not be an unsynchronized clock in this case. I noticed that once in a while we have the same proposer for two consecutive blocks and when this happens the block time doubles. I gave one example in the description, here is another one on the hub: 7988809 - 08:37:19 7988810 - 08:37:31 In the few cases I checked the second block is an empty block.

In the example in the description the third block 7988810 has client updates tx-es.

When suggesting setting 5s as default we only accounted for the actual clock drift and not the block time. Not sure if this is something the on-chain client handler should/ could consider taking into account when checking for the drift cc @colin-axner.

From hermes side we should consider the following:

  • change the default
  • document that clock_drift should include a rough estimate of the block time (maybe x 2 to account for the longer block times we see with emtpy blocks)
  • on client update verify that the latest header timestamp on destination chain is recent enough such that the update header will not be considered in the future, print a warning log and wait for next block on destination.
  • use also the destination chain’s clock_drift when creating the client (i think)

Sorry for being late here. In the light client spec, clockDrift is a parameter dealing with only approximately synchronized clocks. It is used only to constrain what time from the future is allowed, i.e., an acceptable time is smaller than now + clockDrift. In IBC case, we don’t have time that is necessarily synced with some source (for example NTP time) as we used bft time, so we could in theory have arbitrary clock drifts between chains.

If I am not wrong @ancazamfir was considering the following scenario: we consider a top header h at chain A, and a top header H at chain B, and we try to understand what is the condition under which header h will be considered acceptable from time perspective on the chain B. I don’t see other way of making this work in any case unless relayer is checking top header time on chain B and making sure that the condition is satisfied. This is at least from theoretical perspective only sound in my view, as we could have period with no blocks on the chain B.

In case you try to optimise for the normal case where blocks are continuously being produced on the both chains (normal case), we need to take into account that from the perspective of time of the latest block on the chain B time of the header on chain A is not too much in the future. For example, in the worst case if a header h is committed roughly at time t - chainB.average_block_time, where t is time at which header h is committed (from bird eye perspective), i.e., the latest block on chain B is lagging behind chain A, we need to have the following satisfied so h is accepted:

h.time + clock_drift_A < h.time - clock_drift_B - chainB.average_block_time + clock_drift_B, i.e., clock_drift_A + chainB.average_block_time < 0, which is impossible as both are positive values.

My point here is that even if we assume that blocks on both chains are produced continuously so we know what is the average block time, and bft times on both chains are in sync, we can still be in this scenario where destination chain still hasn’t produced the latest block which time is close to now, while such block is produced on the source chain. In this case, no matter how we choose parameters, the client update might fail. Does this make sense or I am misunderstanding something?

Maybe the total clock_drift value when creating a client needs to consider the worst case, i.e. it should be: src_chain.clock_drift + dst_chain.clock_drift + dst_chain.block_time meaning we need to have a separate block_time configured per chain, unless we can estimate it via some query(?). cc @milosevic for help here.