indexer: Catchpoint Stuck on Phase 1

Subject of the issue

I’m running an indexer mainnet instance and I recently upgraded from 2.11 => 2.13. I went through the procedures for updating to the last catchpoint but it seems to have gotten stuck after processing the accounts. Here are logs:

Jul 29 21:32:04 ip-10-0-1-62 algorand-indexer[10434]: {"level":"info","msg":"catchup phase 1 of 4 (Processed Accounts): 14233616 / 14233616","time":"2022-07-29T21:32:04Z"}
Jul 29 21:32:09 ip-10-0-1-62 algorand-indexer[10434]: {"level":"info","msg":"catchup phase 1 of 4 (Processed Accounts): 14233616 / 14233616","time":"2022-07-29T21:32:09Z"}
Jul 29 21:32:14 ip-10-0-1-62 algorand-indexer[10434]: {"event":"ConnectedOut","file":"wsNetwork.go","function":"github.com/algorand/go-algorand/network.(*WebsocketNetwork).tryConnect","level":"info","line":2094,"local":"","msg":"Made outgoing connection to peer relay-mumbai-mai
Jul 29 21:32:14 ip-10-0-1-62 algorand-indexer[10434]: {"level":"info","msg":"catchup phase 1 of 4 (Processed Accounts): 14233616 / 14233616","time":"2022-07-29T21:32:14Z"}
Jul 29 21:32:19 ip-10-0-1-62 algorand-indexer[10434]: {"level":"info","msg":"catchup phase 1 of 4 (Processed Accounts): 14233616 / 14233616","time":"2022-07-29T21:32:19Z"}
Jul 29 21:32:24 ip-10-0-1-62 algorand-indexer[10434]: {"level":"info","msg":"catchup phase 1 of 4 (Processed Accounts): 14233616 / 14233616","time":"2022-07-29T21:32:24Z"}
Jul 29 21:32:29 ip-10-0-1-62 algorand-indexer[10434]: {"level":"info","msg":"catchup phase 1 of 4 (Processed Accounts): 14233616 / 14233616","time":"2022-07-29T21:32:29Z"}
Jul 29 21:32:34 ip-10-0-1-62 algorand-indexer[10434]: {"level":"info","msg":"catchup phase 1 of 4 (Processed Accounts): 14233616 / 14233616","time":"2022-07-29T21:32:34Z"}
Jul 29 21:32:39 ip-10-0-1-62 algorand-indexer[10434]: {"level":"info","msg":"catchup phase 1 of 4 (Processed Accounts): 14233616 / 14233616","time":"2022-07-29T21:32:39Z"}

It’s all normal except the WebSocket connection issue. One thing to note is that my indexer instance is in a VPC on AWS. The firewall rules allow for only incoming SSH connections, but no outgoing connections. It’s been an hour at this point with the indexer stuck in this state.

Your environment

12885426177 3.8.1.stable [rel/stable] (commit #73615e0b) go-algorand is licensed with AGPLv3.0 source code available at https://github.com/algorand/go-algorand admin@ip-10-0-1-62:/mainnet$

About this issue

Original URL
State: closed
Created 2 years ago
Comments: 19 (7 by maintainers)

Most upvoted comments

@phil-vuollet-anterdit the “context cancelled” error suggests that the process was shutdown. When you start it again it would start the init process from the beginning.

Had low IOPS increased and got it running. Other issues besides. Will open a new issue.

phil-vuollet-anterdit on Aug 3, 2022

@Blackglade thanks for bearing with me. I ran some tests and closer inspection of your error messages, I have a few things to share:

Unusual messages

Many of the messages you saw are normal:
- github.com/algorand/go-algorand/network.(*WebsocketNetwork).tryConnect
- github.com/algorand/go-algorand/network.(*WebsocketNetwork).removePeer
- github.com/algorand/go-algorand/ledger.(*CatchpointCatchupAccessorImpl).BuildMerkleTrie.func1
  - I’m not sure if this one is normal, but I saw it also. Maybe needs to be suppressed.
This one is not normal, and is a bit alarming:
- github.com/algorand/go-algorand/catchup.(*CatchpointCatchupService).processStageLedgerDownload
  - “unable to download ledger : database or disk is full”

Processed accounts resetting

github.com/algorand/go-algorand/ledger.(*CatchpointCatchupAccessorImpl).ResetStagingBalances.func1
- dbatomic: tx surpassed expected deadline by 8m49.061025022s

I was able to reproduce this using an EBS drive once it started throttling the IOPS.

Followup question / recommendation

What type of EBS drive are you using, and how many IOPS has it been provisioned with?

New recommendation: deployment should use NVMe drives.

This is based on testing with standard (magnetic) / gp2 / io1 / io2 / NVMe and matches the recommendation for algod. It should have been the recommendation for the new version of Indexer from the beginning, so I’m really sorry to have put you through this. Thanks again for the detailed logs that made it very clear that there was a problem.

Furthermore, for this testing I also put together a utility to make it easier to test this process. I hope to make it available in a future release to assist with debugging hardware configurations in the future.

winder on Aug 2, 2022

@Blackglade thanks for the logs. I don’t recall seeing anything like this during testing, so I’ll need ask around next week.

winder on Jul 29, 2022