OpenSearch: [BUG] _cluster/stats API returning incorrect cluster_manager count

Describe the bug _cluster/stats API returns wrong count of nodes with cluster_manager role.

To Reproduce Steps to reproduce the behavior:

  1. Create a multi-node cluster on OS 2.3 (I tried it on 2.3), lets say with 3 nodes with cluster_manager role.
  2. Check response of _cat/nodes - which should show correct roles of each nodes.
ip         heap.percent ram.percent cpu load_1m load_5m load_15m node.role node.roles                                   cluster_manager name
10.0.3.37            40          77   0    0.00    0.00     0.00 dir       data,ingest,remote_cluster_client            -               data-node
10.0.5.179           45          77   0    0.00    0.00     0.00 dir       data,ingest,remote_cluster_client            -               data-node
10.0.4.180           12          76   0    0.00    0.02     0.01 -         ml                                           -               ml-node
10.0.4.224           41          76   0    0.00    0.00     0.00 mmr       cluster_manager,master,remote_cluster_client -               manager-node
10.0.4.16            37          78   0    0.00    0.00     0.00 dir       data,ingest,remote_cluster_client            -               data-node
10.0.3.181           13          76   0    0.00    0.00     0.00 mmr       cluster_manager,master,remote_cluster_client -               manager-node
10.0.5.122           17          76   0    0.01    0.01     0.00 mmr       cluster_manager,master,remote_cluster_client *               seed
  1. Check response of _cluster/stats
.
.
.
"nodes" : {
    "count" : {
      "total" : 7,
      "cluster_manager" : 6,
      "coordinating_only" : 0,
      "data" : 3,
      "ingest" : 3,
      "master" : 6,
      "ml" : 1,
      "remote_cluster_client" : 6
    },
    "versions" : [
      "2.3.0"
    ],
.
.
.

Expected behavior Count of cluster_manager and master should be 3 in above case.

Plugins None

Screenshots None

Host/Environment (please complete the following information):

  • Version: 2.3

Additional context The above response was correct till OS 1.3.x

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 31 (25 by maintainers)

Most upvoted comments

thanks @sandeshkr419 . Understood your concern. Looks like we need to be on the same first with @tlfeng what is the expected backward compatibility we are aiming for here. IMHO we shouldnt break any API behavior in 2.x

@andrross @shwetathareja @tlfeng Gentle reminder to review the PR and let me know any additional steps that are required for merging?

Narrowed down the issue.

This issue does not occurs when node.roles are used to initialize the node. It occurs when the legacy legacySettings is used to initialize the node as in how I was creating the cluster using https://github.com/opensearch-project/opensearch-cluster-cdk. This utilizes the legacy 'node.master': true legacy setting: https://github.com/opensearch-project/opensearch-cluster-cdk/blob/main/lib/opensearch-config/node-config.ts#L12 (note: legacy, not deprecated)

I have modified the fix where in I remove master role when the legacy settings are used. Please note that there is no such setting such as 'node.cluster_manager'. The new way to initialize the nodes is via providing node roles like: https://github.com/opensearch-project/opensearch-cluster-cdk/blob/main/lib/opensearch-config/node-config.ts#L47 This is the reason why ‘master’ role is being removed in my changes whenever roles are decided by legacy settings.

I have added test cases for better understanding of scenarios - Asserting both the node.roles attached to nodes and the cluster/_stats response.

While I’m adding more test cases seeking early comments on draft code changes. @shwetathareja @andrross @tlfeng Will be improving other test cases as well to assert both the things instead of just relying on _cluster/stats within this scope.

Also, in response to @andrross comments:

 If the users specifies master in the node configuration, then _cat/nodes should return master. If the user specifies cluster_manager then cluster_manager should be returned.

Whatever node.roles are specified by user, whether ‘master’ or ‘cluster_manager’ - the node obeys that - so I think we can close on this.

The _cluster/stats API is something of an exception where counts for both are returned regardless of which one is specified in the configuration.

Since the changes were in getting roles from legacySettings(), no changes will be required in _cluster/stats API.