OpenSearch: [BUG] Timeout on org.opensearch.cluster.routing.MovePrimaryFirstTests.testClusterGreenAfterPartialRelocation

Describe the bug Caught on PR #1952. The test timed out while waiting for the cluster to become green. Related PR for test: #1445.

REPRODUCE WITH: ./gradlew ':server:test' --tests "org.opensearch.cluster.routing.MovePrimaryFirstTests.testClusterGreenAfterPartialRelocation" -Dtests.seed=AF1232B890DC88C7 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=he-IL -Dtests.timezone=Japan -Druntime.java=17

To Reproduce Steps to reproduce the behavior:

  1. Run the above command in the repo. Its a flaky test.

Expected behavior The cluster becomes green and the test does not timeout.

Plugins Core OpenSearch.

Host/Environment (please complete the following information):

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 23 (23 by maintainers)

Most upvoted comments

Okay, I can see that none of the shard was unassigned, just 1 replica was remaining that would have started given few more seconds. @owaiskazi19 - I will increase the timeout to 60 seconds! 😃

  1> --------[foo][98], node[gK0yMdlgQcq7XDvBVcHqHA], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=AHHzR21gRwyH9m8Z2tDMoA], unassigned_info[[reason=NODE_LEFT], at[2022-02-09T01:03:06.551Z], delayed=false, details[node_left [dDqzlwdpR5yo4Eu75qv02Q]], allocation_status[no_attempt]]
  1> --------[foo][15], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=YA_tO2W5SlGHclGkPmNptQ]
  1> --------[foo][40], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=6B01KebASMCgo8EzO8YKZg]
  1> --------[foo][6], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=cnF4_Bs4QqCuxxj_HFAqgw]
  1> --------[foo][95], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=pAGuKsjjR-G3RbvFpvYwQQ]
  1> --------[foo][20], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=pEms1UhDSxW0LUngYrIsPQ]
  1> --------[foo][14], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=aqRw-O9sQ3i3odC_TtwSwg]
  1> --------[foo][76], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=9pd7AEihRFO-_Cwoa_UAGA]
  1> --------[foo][7], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=OUN6zdcYTFK5F3MfFQOdrg]
  1> --------[foo][81], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=H5BXr0llSS-pR3ai8VrySg]
  1> --------[foo][89], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=y-_-RKj8SyCfKIFnvmQL4w]
  1> --------[foo][24], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=pq4U5n3tQlOGFmrxcNiJ3Q]
  1> --------[foo][19], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=qs8kmVy1RAKaRKmCdWJsXQ]
  1> --------[foo][59], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=AllB9KA8ST-R6hiADkbp8Q]
  1> --------[foo][58], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=Cc110Ve-QAyxnMs8gwN5jQ]
  1> --------[foo][26], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=6bx8XZ49QZCVLwMAHzFO9A]
  1> --------[foo][66], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=m-0y5bnxREajEeNHSDVlPQ]
  1> --------[foo][86], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=p0Q47DxfTleki8s42UIqhg]
  1> --------[foo][9], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=y-F_n75mR3KpWbHkMXqBVg]
  1> --------[foo][17], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=qjWibVYFSFildJbykxn3rg]
  1> --------[foo][44], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=bA-J1mDcR_u7CdCgneZxzQ]
  1> --------[foo][94], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=qr73XHAKRKebttLJrfbVPw]
  1> --------[foo][11], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=CTR9cjS3QfSFG0qt7VRrbQ]
  1> --------[foo][28], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=izZjpNhoS-2AFvj4Y4Y2JQ]
  1> --------[foo][30], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=fzERcAZ8TemBOeaUKWsMxw]
  1> --------[foo][53], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=fiMnSwICTN--mEf8Fn3oqQ]
  1> --------[foo][71], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=7xbaRZI2QPKdvBrE5iT0JQ]
  1> --------[foo][32], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=bFplE9pGTY-uTVHdl5RByQ]
  1> --------[foo][43], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=wsdW3kXSRZakgjBEUoLNOw]
  1> --------[foo][52], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=zxDyzd0gTISJfR05fNkZSw]
  1> --------[foo][3], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=j_ePRsVJRm2DPaDHu-q-tg]
  1> --------[foo][92], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=OBJ-9NKNTU6kZQ9R0a0s6A]
  1> --------[foo][82], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=5L3LOKOiR5mGoFwWoCxkHg]
  1> --------[foo][41], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=wTks5SaaQ2a7i74Cy7WwXA]
  1> --------[foo][50], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=D2n8YmDQSbab6YaDEmNk-w]
  1> --------[foo][33], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=i0GmrglcSyWLeM7jBmtU7Q]
  1> --------[foo][57], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=LngaeINzSQCPywze9NU  1> 2Rw]
  1> --------[foo][18], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=VxHWMVCUThy08tOoJ6VAww]
  1> --------[foo][61], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=9EPpu_ThT8u1HXDIX-DVzQ]
  1> --------[foo][48], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=LE_hapDHTdyGqdUm1xNk_Q]
  1> --------[foo][51], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=UWCrnlekRJ6kTKufo_cmUg]
  1> --------[foo][85], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=0OIL6unHQMyhG0MpuEXrdQ]
  1> --------[foo][38], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=ir4h6zK0SDGCzYEEu5QBUw]
  1> ---- unassigned

  1> pending tasks:
  1> tasks: (1):
  1> 933/URGENT/shard-started StartedShardEntry{shardId [[foo][98]], allocationId [AHHzR21gRwyH9m8Z2tDMoA], primary term [1], message [after peer recovery]}/36ms

Hey @jainankitk! I spent some time on the bug and it was mostly related to adding a timeout to ensureGreen function for nodes to get available. I raised a PR for the same and it’s merged now.

Thank you, appreciate that!