security: [BUG] Errors/Broken operations during rolling upgrade of clusters from 1.3 to 2.0

What is the bug? Errors/Broken search results during rolling upgrade of clusters from 1.3 to 2.0

How can one reproduce the bug?

Create 1.3 cluster with atleast 2 nodes
Create index with 2 primaries to allocate atleast one primary per node.
Upgrade one of the node to 2.0 OS version.
Invoke search query to invoke search on all the shards from 1.3 node.
See that there are failures to execute the search on this request

"took" : 40,
  "timed_out" : false,
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 1,
    "failures" : [
      {
        "shard" : 1,
        "index" : "test-index",
        "node" : "O7kxX-lMTAKvXBj91-LQ8Q",
        "reason" : {
          "type" : "exception",
          "reason" : "java.lang.ClassNotFoundException: com.amazon.opendistroforelasticsearch.security.user.User",
          "caused_by" : {
            "type" : "class_not_found_exception",
            "reason" : "class_not_found_exception: com.amazon.opendistroforelasticsearch.security.user.User"
          }
        }
      }
    ]
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        ".....
          }
        }
      }
    ]

Notice that the failed shard count is 1

What is the expected behavior? Rolling upgrade of clusters should complete without any issues.

What is your host/environment?

OS: 1.3 to OS: 2.0 upgrade
Plugins: Security

About this issue

Original URL
State: closed
Created 2 years ago
Comments: 28 (19 by maintainers)

Most upvoted comments

@ronniepg Understood and that’s what this PR is targeting: https://github.com/opensearch-project/security/pull/2268

There is logic in 1.3 to keep backwards compatibility with ODFE that will always rewrite package names from org.opensearch to com.amazon.opendistroforelasticsearch so that when messages are picked up by ODFE nodes that they are able to understand the message. That’s a problem when you are going from OS 1 to OS 2 because OS 2 does not understand the com.amazon.opendistroforelasticsearch packages. The PR above aims to conditionally apply the serialization logic if there are ODFE nodes in the cluster. If you have only OS 1 nodes and going to OS 2, it should not be performing the package rewrite on serialization for the transport action.

cwperks on Nov 18, 2022

@peternied I will look into this today and see if there’s a possibility of getting the min node version from the ClusterInfoHolder to conditionally apply the serialization logic that replaces the package name with opendistro package name.

If the min node in the cluster is OS 1, then no need to perform the rewrite logic.

If the min node in the cluster is ODFE, then apply the rewrite logic.

cwperks on Nov 17, 2022