OpenSearch: [BUG] org.opensearch.cluster.coordination.CoordinationStateRejectedException upgrading 2.0.1 to 3.0

Describe the bug Nodes cannot join back to the cluster after upgrading from 2.0.1 to 3.0.0.

To Reproduce Failing GHA: https://github.com/cliu123/security/runs/6926965970?check_suite_focus=true

Expected behavior A clear and concise description of what you expected to happen.

Plugins Please list all plugins currently enabled.

Error logs

 WARN ][o.o.c.NodeConnectionsService] [securityBwcCluster0-2] failed to connect to {securityBwcCluster0-1}{Us2JKk3aSU-lIJVCnUQ02w}{5Wnx4OM1RkSweVKL6lkfdA}{127.0.0.1}{127.0.0.1:43131}{dimr}{testattr=test, shard_indexing_pressure_enabled=true} (tried [1] times)
»  org.opensearch.transport.ConnectTransportException: [securityBwcCluster0-1][127.0.0.1:43131] connect_exception
»  	at org.opensearch.transport.TcpTransport$ChannelsConnectedListener.onFailure(TcpTransport.java:1076) ~[opensearch-2.0.1.jar:2.0.1]
»  	at org.opensearch.action.ActionListener.lambda$toBiConsumer$2(ActionListener.java:215) ~[opensearch-2.0.1.jar:2.0.1]
»  	at org.opensearch.common.concurrent.CompletableContext.lambda$addListener$0(CompletableContext.java:55) ~[opensearch-core-2.0.1.jar:2.0.1]
»  	at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:863) ~[?:?]
»  	at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:841) ~[?:?]
»  	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510) ~[?:?]
»  	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2162) ~[?:?]
»  	at org.opensearch.common.concurrent.CompletableContext.completeExceptionally(CompletableContext.java:70) ~[opensearch-core-2.0.1.jar:2.0.1]
»  	at org.opensearch.transport.netty4.Netty4TcpChannel.lambda$addListener$0(Netty4TcpChannel.java:81) ~[?:?]
»  	at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:578) ~[?:?]
»  	at io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:571) ~[?:?]
»  	at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:550) ~[?:?]
»  	at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:491) ~[?:?]
»  	at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:616) ~[?:?]
»  	at io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:609) ~[?:?]
»  	at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117) ~[?:?]
»  	at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.fulfillConnectPromise(AbstractNioChannel.java:321) ~[?:?]
»  	at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:337) ~[?:?]
»  	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:710) ~[?:?]
»  	at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:623) ~[?:?]
»  	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:586) ~[?:?]
»  	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:496) ~[?:?]
»  	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986) ~[?:?]
»  	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[?:?]
»  	at java.lang.Thread.run(Thread.java:833) [?:?]
»  Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 127.0.0.1/127.0.0.1:43131
»  Caused by: java.net.ConnectException: Connection refused
»  	at sun.nio.ch.Net.pollConnect(Native Method) ~[?:?]
»  	at sun.nio.ch.Net.pollConnectNow(Net.java:672) ~[?:?]
»  	at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:946) ~[?:?]
»  	at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:330) ~[?:?]
»  	at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334) ~[?:?]
»  	... 7 more
»   ↓ last 40 non error or warning messages from /home/runner/work/security/security/bwc-test/build/testclusters/securityBwcCluster0-2/logs/opensearch.stdout.log ↓
» [2022-06-16T22:10:14,761][INFO ][o.o.n.Node               ] [securityBwcCluster0-2] node name [securityBwcCluster0-2], node ID [jjX5m3WkSLe8eAwhuQhn7A], cluster name [securityBwcCluster0], roles [cluster_manager, remote_cluster_client, data, ingest]
» [2022-06-16T22:10:19,058][INFO ][o.o.t.NettyAllocator     ] [securityBwcCluster0-2] creating NettyAllocator with the following configs: [name=unpooled, suggested_max_allocation_size=256kb, factors={opensearch.unsafe.use_unpooled_allocator=null, g1gc_enabled=true, g1gc_region_size=1mb, heap_size=512mb}]
» [2022-06-16T22:10:19,129][INFO ][o.o.d.DiscoveryModule    ] [securityBwcCluster0-2] using discovery type [zen] and seed hosts providers [settings, file]
» [2022-06-16T22:10:19,577][INFO ][o.o.n.Node               ] [securityBwcCluster0-2] initialized
» [2022-06-16T22:10:19,577][INFO ][o.o.n.Node               ] [securityBwcCluster0-2] starting ...
» [2022-06-16T22:10:19,727][INFO ][o.o.t.TransportService   ] [securityBwcCluster0-2] publish_address {127.0.0.1:42529}, bound_addresses {[::1]:37963}, {127.0.0.1:42529}
» [2022-06-16T22:10:19,932][DEBUG][o.o.c.c.Coordinator      ] [securityBwcCluster0-2] startInitialJoin: coordinator becoming CANDIDATE in term 0 (was null, lastKnownLeader was [Optional.empty])
» [2022-06-16T22:10:19,960][INFO ][o.o.h.AbstractHttpServerTransport] [securityBwcCluster0-2] publish_address {127.0.0.1:46681}, bound_addresses {[::1]:41935}, {127.0.0.1:46681}
» [2022-06-16T22:10:19,962][INFO ][o.o.n.Node               ] [securityBwcCluster0-2] started
» [2022-06-16T22:10:19,962][INFO ][o.o.s.OpenSearchSecurityPlugin] [securityBwcCluster0-2] Node started
» [2022-06-16T22:10:19,968][INFO ][o.o.s.OpenSearchSecurityPlugin] [securityBwcCluster0-2] 0 OpenSearch Security modules loaded so far: []
» [2022-06-16T22:10:20,876][INFO ][o.o.c.c.Coordinator      ] [securityBwcCluster0-2] setting initial configuration to VotingConfiguration{jjX5m3WkSLe8eAwhuQhn7A,Us2JKk3aSU-lIJVCnUQ02w,{bootstrap-placeholder}-securityBwcCluster0-0}
» [2022-06-16T22:10:21,093][DEBUG][o.o.c.c.ElectionSchedulerFactory] [securityBwcCluster0-2] scheduling scheduleNextElection{gracePeriod=0s, thisAttempt=0, maxDelayMillis=100, delayMillis=22, ElectionScheduler{attempt=1, ElectionSchedulerFactory{initialTimeout=100ms, backoffTime=100ms, maxTimeout=10s}}}
» [2022-06-16T22:10:21,103][DEBUG][o.o.c.c.Coordinator      ] [securityBwcCluster0-2] joinLeaderInTerm: for [{securityBwcCluster0-1}{Us2JKk3aSU-lIJVCnUQ02w}{5Wnx4OM1RkSweVKL6lkfdA}{127.0.0.1}{127.0.0.1:43131}{dimr}{testattr=test, shard_indexing_pressure_enabled=true}] with term 1
» [2022-06-16T22:10:21,104][DEBUG][o.o.c.c.CoordinationState] [securityBwcCluster0-2] handleStartJoin: leaving term [0] due to StartJoinRequest{term=1,node={securityBwcCluster0-1}{Us2JKk3aSU-lIJVCnUQ02w}{5Wnx4OM1RkSweVKL6lkfdA}{127.0.0.1}{127.0.0.1:43131}{dimr}{testattr=test, shard_indexing_pressure_enabled=true}}
» [2022-06-16T22:10:21,129][DEBUG][o.o.c.c.ElectionSchedulerFactory] [securityBwcCluster0-2] scheduleNextElection{gracePeriod=0s, thisAttempt=0, maxDelayMillis=100, delayMillis=22, ElectionScheduler{attempt=1, ElectionSchedulerFactory{initialTimeout=100ms, backoffTime=100ms, maxTimeout=10s}}} starting election
» [2022-06-16T22:10:21,131][DEBUG][o.o.c.c.ElectionSchedulerFactory] [securityBwcCluster0-2] scheduling scheduleNextElection{gracePeriod=500ms, thisAttempt=1, maxDelayMillis=200, delayMillis=645, ElectionScheduler{attempt=2, ElectionSchedulerFactory{initialTimeout=100ms, backoffTime=100ms, maxTimeout=10s}}}
» [2022-06-16T22:10:21,134][DEBUG][o.o.c.c.PreVoteCollector ] [securityBwcCluster0-2] PreVotingRound{preVotesReceived={}, electionStarted=false, preVoteRequest=PreVoteRequest{sourceNode={securityBwcCluster0-2}{jjX5m3WkSLe8eAwhuQhn7A}{haqxPg4aSa23pHJFt9_XUA}{127.0.0.1}{127.0.0.1:42529}{dimr}{testattr=test, shard_indexing_pressure_enabled=true}, currentTerm=1}, isClosed=false} requesting pre-votes from [{securityBwcCluster0-2}{jjX5m3WkSLe8eAwhuQhn7A}{haqxPg4aSa23pHJFt9_XUA}{127.0.0.1}{127.0.0.1:42529}{dimr}{testattr=test, shard_indexing_pressure_enabled=true}, {securityBwcCluster0-1}{Us2JKk3aSU-lIJVCnUQ02w}{5Wnx4OM1RkSweVKL6lkfdA}{127.0.0.1}{127.0.0.1:43131}{dimr}{testattr=test, shard_indexing_pressure_enabled=true}, {securityBwcCluster0-0}{MntY5LcxSzCwoWnBav5SUA}{QagsblRtSNihbQhNLBZukQ}{127.0.0.1}{127.0.0.1:46477}{dimr}{testattr=test, shard_indexing_pressure_enabled=true}]
» [2022-06-16T22:10:21,157][DEBUG][o.o.c.c.PreVoteCollector ] [securityBwcCluster0-2] PreVotingRound{preVotesReceived={{securityBwcCluster0-0}{MntY5LcxSzCwoWnBav5SUA}{QagsblRtSNihbQhNLBZukQ}{127.0.0.1}{127.0.0.1:46477}{dimr}{testattr=test, shard_indexing_pressure_enabled=true}=PreVoteResponse{currentTerm=1, lastAcceptedTerm=0, lastAcceptedVersion=0}}, electionStarted=false, preVoteRequest=PreVoteRequest{sourceNode={securityBwcCluster0-2}{jjX5m3WkSLe8eAwhuQhn7A}{haqxPg4aSa23pHJFt9_XUA}{127.0.0.1}{127.0.0.1:42529}{dimr}{testattr=test, shard_indexing_pressure_enabled=true}, currentTerm=1}, isClosed=false} added PreVoteResponse{currentTerm=1, lastAcceptedTerm=0, lastAcceptedVersion=0} from {securityBwcCluster0-0}{MntY5LcxSzCwoWnBav5SUA}{QagsblRtSNihbQhNLBZukQ}{127.0.0.1}{127.0.0.1:46477}{dimr}{testattr=test, shard_indexing_pressure_enabled=true}, no quorum yet
» [2022-06-16T22:10:21,159][DEBUG][o.o.c.c.PreVoteCollector ] [securityBwcCluster0-2] PreVotingRound{preVotesReceived={{securityBwcCluster0-0}{MntY5LcxSzCwoWnBav5SUA}{QagsblRtSNihbQhNLBZukQ}{127.0.0.1}{127.0.0.1:46477}{dimr}{testattr=test, shard_indexing_pressure_enabled=true}=PreVoteResponse{currentTerm=1, lastAcceptedTerm=0, lastAcceptedVersion=0}, {securityBwcCluster0-2}{jjX5m3WkSLe8eAwhuQhn7A}{haqxPg4aSa23pHJFt9_XUA}{127.0.0.1}{127.0.0.1:42529}{dimr}{testattr=test, shard_indexing_pressure_enabled=true}=PreVoteResponse{currentTerm=1, lastAcceptedTerm=0, lastAcceptedVersion=0}}, electionStarted=false, preVoteRequest=PreVoteRequest{sourceNode={securityBwcCluster0-2}{jjX5m3WkSLe8eAwhuQhn7A}{haqxPg4aSa23pHJFt9_XUA}{127.0.0.1}{127.0.0.1:42529}{dimr}{testattr=test, shard_indexing_pressure_enabled=true}, currentTerm=1}, isClosed=false} added PreVoteResponse{currentTerm=1, lastAcceptedTerm=0, lastAcceptedVersion=0} from {securityBwcCluster0-2}{jjX5m3WkSLe8eAwhuQhn7A}{haqxPg4aSa23pHJFt9_XUA}{127.0.0.1}{127.0.0.1:42529}{dimr}{testattr=test, shard_indexing_pressure_enabled=true}, no quorum yet
» [2022-06-16T22:10:21,150][DEBUG][o.o.c.c.PreVoteCollector ] [securityBwcCluster0-2] TransportResponseHandler{PreVoteCollector{state=Tuple [v1=null, v2=PreVoteResponse{currentTerm=1, lastAcceptedTerm=0, lastAcceptedVersion=0}]}, node={securityBwcCluster0-1}{Us2JKk3aSU-lIJVCnUQ02w}{5Wnx4OM1RkSweVKL6lkfdA}{127.0.0.1}{127.0.0.1:43131}{dimr}{testattr=test, shard_indexing_pressure_enabled=true}} failed
»  org.opensearch.transport.RemoteTransportException: [securityBwcCluster0-1][127.0.0.1:43131][internal:cluster/request_pre_vote]
»  Caused by: org.opensearch.cluster.coordination.CoordinationStateRejectedException: rejecting PreVoteRequest{sourceNode={securityBwcCluster0-2}{jjX5m3WkSLe8eAwhuQhn7A}{haqxPg4aSa23pHJFt9_XUA}{127.0.0.1}{127.0.0.1:42529}{dimr}{testattr=test, shard_indexing_pressure_enabled=true}, currentTerm=1} as there is already a leader
»  	at org.opensearch.cluster.coordination.PreVoteCollector.handlePreVoteRequest(PreVoteCollector.java:162) ~[opensearch-2.0.1.jar:2.0.1]
»  	at org.opensearch.cluster.coordination.PreVoteCollector.lambda$new$0(PreVoteCollector.java:100) ~[opensearch-2.0.1.jar:2.0.1]
»  	at org.opensearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:103) ~[opensearch-2.0.1.jar:2.0.1]
»  	at org.opensearch.transport.InboundHandler$RequestHandler.doRun(InboundHandler.java:453) ~[opensearch-2.0.1.jar:2.0.1]
»  	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:798) ~[opensearch-2.0.1.jar:2.0.1]
»  	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) ~[opensearch-2.0.1.jar:2.0.1]
»  	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
»  	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
»  	at java.lang.Thread.run(Thread.java:833) [?:?]

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 47 (36 by maintainers)

Most upvoted comments

@cliu123 did you try running opensearch-min without security plugin? OpenSearch 3.0.0 is passing bwc tests with 2.0.1. Ref: https://github.com/opensearch-project/OpenSearch/blob/main/.ci/bwcVersions

That’s a good point! I haven’t tried that as I don’t see any security specific errors. But I’ll try.