OpenSearch: [BUG] _cluster/stats API returning incorrect cluster_manager count
Describe the bug
_cluster/stats API returns wrong count of nodes with cluster_manager role.
To Reproduce Steps to reproduce the behavior:
- Create a multi-node cluster on OS 2.3 (I tried it on 2.3), lets say with 3 nodes with
cluster_managerrole. - Check response of
_cat/nodes- which should show correct roles of each nodes.
ip heap.percent ram.percent cpu load_1m load_5m load_15m node.role node.roles cluster_manager name
10.0.3.37 40 77 0 0.00 0.00 0.00 dir data,ingest,remote_cluster_client - data-node
10.0.5.179 45 77 0 0.00 0.00 0.00 dir data,ingest,remote_cluster_client - data-node
10.0.4.180 12 76 0 0.00 0.02 0.01 - ml - ml-node
10.0.4.224 41 76 0 0.00 0.00 0.00 mmr cluster_manager,master,remote_cluster_client - manager-node
10.0.4.16 37 78 0 0.00 0.00 0.00 dir data,ingest,remote_cluster_client - data-node
10.0.3.181 13 76 0 0.00 0.00 0.00 mmr cluster_manager,master,remote_cluster_client - manager-node
10.0.5.122 17 76 0 0.01 0.01 0.00 mmr cluster_manager,master,remote_cluster_client * seed
- Check response of
_cluster/stats
.
.
.
"nodes" : {
"count" : {
"total" : 7,
"cluster_manager" : 6,
"coordinating_only" : 0,
"data" : 3,
"ingest" : 3,
"master" : 6,
"ml" : 1,
"remote_cluster_client" : 6
},
"versions" : [
"2.3.0"
],
.
.
.
Expected behavior
Count of cluster_manager and master should be 3 in above case.
Plugins None
Screenshots None
Host/Environment (please complete the following information):
- Version: 2.3
Additional context The above response was correct till OS 1.3.x
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 31 (25 by maintainers)
thanks @sandeshkr419 . Understood your concern. Looks like we need to be on the same first with @tlfeng what is the expected backward compatibility we are aiming for here. IMHO we shouldnt break any API behavior in 2.x
@andrross @shwetathareja @tlfeng Gentle reminder to review the PR and let me know any additional steps that are required for merging?
Narrowed down the issue.
This issue does not occurs when
node.rolesare used to initialize the node. It occurs when the legacylegacySettingsis used to initialize the node as in how I was creating the cluster using https://github.com/opensearch-project/opensearch-cluster-cdk. This utilizes the legacy'node.master': truelegacy setting: https://github.com/opensearch-project/opensearch-cluster-cdk/blob/main/lib/opensearch-config/node-config.ts#L12 (note: legacy, not deprecated)I have modified the fix where in I remove
masterrole when the legacy settings are used. Please note that there is no such setting such as'node.cluster_manager'. The new way to initialize the nodes is via providing node roles like: https://github.com/opensearch-project/opensearch-cluster-cdk/blob/main/lib/opensearch-config/node-config.ts#L47 This is the reason why ‘master’ role is being removed in my changes whenever roles are decided by legacy settings.I have added test cases for better understanding of scenarios - Asserting both the
node.rolesattached to nodes and thecluster/_statsresponse.While I’m adding more test cases seeking early comments on draft code changes. @shwetathareja @andrross @tlfeng Will be improving other test cases as well to assert both the things instead of just relying on
_cluster/statswithin this scope.Also, in response to @andrross comments:
Whatever
node.rolesare specified by user, whether ‘master’ or ‘cluster_manager’ - the node obeys that - so I think we can close on this.Since the changes were in getting roles from
legacySettings(), no changes will be required in_cluster/statsAPI.