security: [BUG] Errors/Broken operations during rolling upgrade of clusters from 1.3 to 2.0

What is the bug? Errors/Broken search results during rolling upgrade of clusters from 1.3 to 2.0

How can one reproduce the bug?

  1. Create 1.3 cluster with atleast 2 nodes
  2. Create index with 2 primaries to allocate atleast one primary per node.
  3. Upgrade one of the node to 2.0 OS version.
  4. Invoke search query to invoke search on all the shards from 1.3 node.
  5. See that there are failures to execute the search on this request
"took" : 40,
  "timed_out" : false,
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 1,
    "failures" : [
      {
        "shard" : 1,
        "index" : "test-index",
        "node" : "O7kxX-lMTAKvXBj91-LQ8Q",
        "reason" : {
          "type" : "exception",
          "reason" : "java.lang.ClassNotFoundException: com.amazon.opendistroforelasticsearch.security.user.User",
          "caused_by" : {
            "type" : "class_not_found_exception",
            "reason" : "class_not_found_exception: com.amazon.opendistroforelasticsearch.security.user.User"
          }
        }
      }
    ]
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        ".....
          }
        }
      }
    ]

Notice that the failed shard count is 1

What is the expected behavior? Rolling upgrade of clusters should complete without any issues.

What is your host/environment?

  • OS: 1.3 to OS: 2.0 upgrade
  • Plugins: Security

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 28 (19 by maintainers)

Most upvoted comments

@ronniepg Understood and that’s what this PR is targeting: https://github.com/opensearch-project/security/pull/2268

There is logic in 1.3 to keep backwards compatibility with ODFE that will always rewrite package names from org.opensearch to com.amazon.opendistroforelasticsearch so that when messages are picked up by ODFE nodes that they are able to understand the message. That’s a problem when you are going from OS 1 to OS 2 because OS 2 does not understand the com.amazon.opendistroforelasticsearch packages. The PR above aims to conditionally apply the serialization logic if there are ODFE nodes in the cluster. If you have only OS 1 nodes and going to OS 2, it should not be performing the package rewrite on serialization for the transport action.

@peternied I will look into this today and see if there’s a possibility of getting the min node version from the ClusterInfoHolder to conditionally apply the serialization logic that replaces the package name with opendistro package name.

If the min node in the cluster is OS 1, then no need to perform the rewrite logic.

If the min node in the cluster is ODFE, then apply the rewrite logic.