security: [BUG] [CI] Investigate Flaky test failure for windows CI tasks

What is the bug?

Windows CI has been flaky with recent runs for tasks, specifically citest, dlicRestApiTest.

See these runs for example:

https://github.com/opensearch-project/security/actions/runs/5577190458/attempts/1 <-- 3 tasks failed https://github.com/opensearch-project/security/actions/runs/5577190458/attempts/2 <-- 2 tasks failed (1 prev failed task passed) https://github.com/opensearch-project/security/actions/runs/5577190458/attempts/3 <-- 1 task failed https://github.com/opensearch-project/security/actions/runs/5577190458/attempts/5 <-- all tasks passed

Failure Details
(citest, windows-latest, 11) (citest, windows-latest, 17)
1st run - org.opensearch.security.SecurityAdminTests.testSecurityAdmin
- org.opensearch.security.auditlog.compliance.RestApiComplianceAuditlogTest.testRestApiRolesEnabled
- org.opensearch.security.auditlog.compliance.ComplianceAuditlogTest.testComplianceEnable
- org.opensearch.security.auditlog.compliance.RestApiComplianceAuditlogTest.testRestApiRolesDisabled
- org.opensearch.security.multitenancy.test.MultitenancyTests.testTenantParametersSubstitution
- org.opensearch.security.auditlog.integration.BasicAuditlogTest.testDeleteByQuery
- org.opensearch.security.httpclient.HttpClientTest.testPlainConnection
- org.opensearch.security.auditlog.compliance.ComplianceAuditlogTest.testComplianceEnable
- org.opensearch.security.auditlog.compliance.RestApiComplianceAuditlogTest.testRestApiRolesEnabled
- com.amazon.dlic.auth.http.saml.HTTPSamlAuthenticatorTest.initialConnectionFailureTest
- com.amazon.dlic.auth.ldap2.LdapBackendIntegTest2.testIntegLdapAuthenticationSSL
- org.opensearch.security.SecurityAdminTests.testIsLegacySecurityIndexOnV7Index
- org.opensearch.security.auditlog.compliance.ComplianceAuditlogTest.testInternalConfig
- org.opensearch.security.auditlog.compliance.ComplianceAuditlogTest.testComplianceEnable
- org.opensearch.security.auditlog.compliance.ComplianceAuditlogTest.testInternalConfig
- org.opensearch.security.auditlog.compliance.RestApiComplianceAuditlogTest.testBCryptHashRedaction
- org.opensearch.security.auditlog.compliance.RestApiComplianceAuditlogTest.testRestApiRolesDisabled
- org.opensearch.security.multitenancy.test.TenancyPrivateTenantEnabledTests.testPrivateTenantDisabled_Update_EndToEnd
- org.opensearch.security.auditlog.integration.BasicAuditlogTest.testSensitiveMethodRedaction
- org.opensearch.security.auditlog.integration.BasicAuditlogTest.testScroll
- org.opensearch.security.auditlog.integration.BasicAuditlogTest.testScroll
- org.opensearch.security.auditlog.compliance.RestApiComplianceAuditlogTest.testBCryptHashRedaction
2nd run - org.opensearch.security.SecurityAdminTests.testSecurityAdminRegularUpdate
- org.opensearch.security.auditlog.compliance.ComplianceAuditlogTest.testInternalConfig
- org.opensearch.security.auditlog.integration.BasicAuditlogTest.testDeleteByQuery
- org.opensearch.security.multitenancy.test.TenancyMultitenancyEnabledTests.testMultitenancyDisabled_endToEndTest
- org.opensearch.security.auditlog.compliance.ComplianceAuditlogTest.testInternalConfig
- org.opensearch.security.auditlog.integration.BasicAuditlogTest.testDeleteByQuery
- com.amazon.dlic.auth.ldap2.LdapBackendIntegTest2.testAttributesWithImpersonation
- org.opensearch.security.auditlog.compliance.ComplianceAuditlogTest.testWriteHistory
- org.opensearch.security.auditlog.compliance.RestApiComplianceAuditlogTest.testBCryptHashRedaction
- org.opensearch.security.multitenancy.test.MultitenancyTests.testTenantParametersSubstitution
- org.opensearch.security.multitenancy.test.MultitenancyTests.testMt
- org.opensearch.security.multitenancy.test.MultitenancyTests.testMtMulti
- org.opensearch.security.auditlog.compliance.RestApiComplianceAuditlogTest.testBCryptHashRedaction
3rd run - org.opensearch.security.SecurityAdminTests.testSecurityAdmin
- org.opensearch.security.SecurityAdminTests.testSecurityAdminInvalidCert
- org.opensearch.security.auditlog.compliance.RestApiComplianceAuditlogTest.testRestApiRolesEnabled
- org.opensearch.security.auditlog.compliance.RestApiComplianceAuditlogTest.testRestApiNewUser
- org.opensearch.security.auditlog.compliance.ComplianceAuditlogTest.testUpdate
- org.opensearch.security.auditlog.compliance.ComplianceAuditlogTest.testInternalConfig
- org.opensearch.security.auditlog.compliance.RestApiComplianceAuditlogTest.testRestApiRolesEnabled
- org.opensearch.security.auditlog.integration.BasicAuditlogTest.testDeleteByQuery
- org.opensearch.security.multitenancy.test.MultitenancyTests.testMt
- org.opensearch.security.auditlog.integration.BasicAuditlogTest.testDeleteByQuery
- org.opensearch.security.multitenancy.test.TenancyMultitenancyEnabledTests.testMultitenancyDisabled_endToEndTest
- org.opensearch.security.auditlog.compliance.RestApiComplianceAuditlogTest.testRestApiNewUser
- org.opensearch.security.auditlog.compliance.ComplianceAuditlogTest.testInternalConfig
- org.opensearch.security.auditlog.compliance.ComplianceAuditlogTest.testUpdate
Test Methods with failures across all mentioned runs
- org.opensearch.security.SecurityAdminTests.testSecurityAdmin
- org.opensearch.security.auditlog.compliance.RestApiComplianceAuditlogTest.testRestApiRolesEnabled
- org.opensearch.security.auditlog.compliance.ComplianceAuditlogTest.testComplianceEnable
- org.opensearch.security.auditlog.compliance.RestApiComplianceAuditlogTest.testRestApiRolesDisabled
- org.opensearch.security.multitenancy.test.MultitenancyTests.testTenantParametersSubstitution
- org.opensearch.security.auditlog.integration.BasicAuditlogTest.testDeleteByQuery
- org.opensearch.security.httpclient.HttpClientTest.testPlainConnection
- com.amazon.dlic.auth.http.saml.HTTPSamlAuthenticatorTest.initialConnectionFailureTest
- com.amazon.dlic.auth.ldap2.LdapBackendIntegTest2.testIntegLdapAuthenticationSSL
- org.opensearch.security.SecurityAdminTests.testIsLegacySecurityIndexOnV7Index
- org.opensearch.security.auditlog.compliance.ComplianceAuditlogTest.testInternalConfig
- org.opensearch.security.auditlog.compliance.RestApiComplianceAuditlogTest.testBCryptHashRedaction
- org.opensearch.security.multitenancy.test.TenancyPrivateTenantEnabledTests.testPrivateTenantDisabled_Update_EndToEnd
- org.opensearch.security.auditlog.integration.BasicAuditlogTest.testSensitiveMethodRedaction
- org.opensearch.security.auditlog.integration.BasicAuditlogTest.testScroll
- org.opensearch.security.auditlog.compliance.RestApiComplianceAuditlogTest.testBCryptHashRedaction
- org.opensearch.security.multitenancy.test.MultitenancyTests.testMt
- org.opensearch.security.multitenancy.test.MultitenancyTests.testMtMulti
- org.opensearch.security.auditlog.compliance.RestApiComplianceAuditlogTest.testWriteHistory
- org.opensearch.security.auditlog.compliance.RestApiComplianceAuditlogTest.testRestApiNewUser
- org.opensearch.security.auditlog.compliance.ComplianceAuditlogTest.testUpdate
- org.opensearch.security.SecurityAdminTests.testSecurityAdminInvalidCert
- org.opensearch.security.multitenancy.test.TenancyMultitenancyEnabledTests.testMultitenancyDisabled_endToEndTest

What is the expected behavior? No flakiness.

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 22 (18 by maintainers)

Most upvoted comments

We run 3 nodes in single JVM, with 3 security plugins, only one of them wins. @DarshitChanpura I think at this moment - this is 100% test related problem that should not leak into production.

I think so too, I was not able to replicate the issue on the RC.

That being said I think the code path here in the SecurityInterceptor and the corresponding code in the receiver is dead code. In reality, the user is not serialized on local node requests because it takes the bypass route described here

I think I nailed it folks:

    public static DiscoveryNode getLocalNode() {
        return localNode;
    }

We run 3 nodes in single JVM, with 3 security plugins, only one of them wins. @DarshitChanpura I think at this moment - this is 100% test related problem that should not leak into production.

In doing more local testing, I think I see an identity crisis happening where core thinks the local node is node1 and the security plugin thinks its node2. I’m looking into how this can happen now. Both the security plugin and core should be aware of node they are running on without any conflicts.

I added logging statements in TransportService.getConnection and then again inside SecurityInterceptor.sendRequestDecorate which is called directly after getConnection on remote node requests to determine that there was a discrepancy between what core thought the localNode was and what the security plugin thought the localNode was.

@DarshitChanpura it looks like there were a few fixes put into main for windows support that may not have been backported to 2.x.

Particularly this one: https://github.com/opensearch-project/security/pull/2180

There is a list of issues that @peternied added to this PR description: https://github.com/opensearch-project/security/pull/2291 - it may be best to backport the one’s of those that can be backported