vitess: Bug Report: vttablet hanging when running PlannedReparentShard

Overview of the Issue

During a PRS we sometimes see the tablet stuck waiting for something at this point in the logs:

I0317 04:56:50.637023    9434 rpc_replication.go:388] DemotePrimary
I0317 04:56:50.638689    9434 rpc_replication.go:438] DemotePrimary disabling query service
I0317 04:56:50.638703    9434 state_manager.go:214] Starting transition to PRIMARY Not Serving, timestamp: 2022-03-17 04:25:38.205679188 +0000 UTC
I0317 04:56:50.638773    9434 tablegc.go:212] TableGC: closing
...

Here’s a sample debug blocking profile covering this: https://gist.github.com/derekperkins/dd6d54809a98b582c03909061e639766

Reproduction Steps

We suspect that it involves:

  1. Setting up many active vstreams: MoveTables,Reshard,OnlineDDL,Messaging
  2. Doing PRS while those are active

Binary Version

v13.0.0

Operating System and Environment details

N/A

Log Fragments

No response

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 17 (17 by maintainers)

Most upvoted comments

update: it’s deployed now. Most of our heavy usage starts at midnight UTC, so hopefully I’ll have some logs by tomorrow

@derekperkins if it’s possible to also grab the output of mysql> show full processlist; in the primary tablet’s mysqld instance when you see the messages stop flowing from the shard that would be very helpful.

Thanks!

I’m rebuilding now on my same branch, with #9942 cherry picked on top. Will deploy as soon as the build completes https://hub.docker.com/repository/registry-1.docker.io/vitess/base/builds/c1a58d39-4848-4237-bc1f-f16512ec7948

update: it’s deployed now. Most of our heavy usage starts at midnight UTC, so hopefully I’ll have some logs by tomorrow