rancher: [BUG] Rancher 2.7.5 Active Directory authentication fails completely in Rancher 2.7.5 due to wrong objectGUID escaping
Rancher Server Setup
- Rancher version: 2.7.5
- Installation option (Helm Chart): K8S 1.24.x RKE1
- Proxy/Cert Details:
Information about the Cluster
- Kubernetes version: 1.24.x
- Cluster Type (Local/Downstream): Local
User Information
- What is the role of the user logged in? (Admin/Cluster Owner/Cluster Member/Project Owner/Project Member/Custom)
- All users are affected!
Describe the bug After upgrading to Rancher 2.7.5 (from 2.7.4) LDAP based login using Active Directory integration fails for all users.
To Reproduce Upgrade to 2.7.5 and then login with a fresh browser.
Result
Login failure. In Rachner log (GUID data removed):
LDAP Result Code 201 Filter Compile Error: ldap: invalid characters for escape in filter: encoding/hex: invalid byte: U+004E N
Expected Result Login success
Additional context The following change has introduced to problem: https://github.com/rancher/rancher/commit/cdd90ea6c687d8c83cfd0289e51be32da5c21d14
I think the problem source is that guidString := html.EscapeString(fmt.Sprintf("%x", entry.GetRawAttributeValue(common.AttributeObjectGUID)))
is used to build an URL from the objectGUID
value. This might result in a string that is not fully hex encoded. All our objectGUIDs
contain the “N” byte with hex value ‘4E’. The HTML escaping might not escape this byte and then when it is used later in LDAP the filter string will contain “…\N…” which is not a proper escaping and leads to the above noted error message. This is still an assumption, because I have not understood the relevant code fully yet, but I am pretty sure, that it points at least in the right direction.
About this issue
- Original URL
- State: closed
- Created a year ago
- Reactions: 7
- Comments: 51 (7 by maintainers)
This is bug is more fatal than I thought initially, because most users cannot login even after having rolled back to Rancher 2.7.4. It seems that a migration of the user identifier from
DN
toobjectGUID
had already been running in the background and now all migrated users cannot login anymore.We really need very urgently a fix for this.
After upgrading from 2.7.1 to 2.7.5. Logging in with existsing account create duplicated users due to change in activedirectory_user:// from DN to GUID. This is not what i expected from just a patch…
how can it be that this issue exists for 2 weeks and no hotfix has been released? my client is also affected by this.
Very frustrating to have hit this bug having been forced to upgrade to
2.7.5
from2.7.3
due to the OOM bug, in addition to having to hold off upgrading to K8S1.24
due to the high CPU bug! Seems to be a very buggy trend here!I agree, just reverting these settings won’t help any customer that has already upgraded Rancher to 2.7.5. The fix must contain a logic that reverts the already migrated user accounts to their original state. Or at least there should be a release without this change to benefit from the fixes with the cluster agent and cilium bugs there were included in the 2.7.5.
Is this being actively investigated ? We were eagerly awaiting this release to fix the high memory consumption bug on Rancher pod and cattle-cluster-agents, as well as introducting k8s 1.25 support to our teams. We won’t deploy 2.7.5 on our production instance because of this issue. Still running 2.6.6 as it is the most stable release we experienced so far.
We are also affected. Fortunately, we upgraded out test environment first. We did postpone our production upgrade due to Kubernetes 1.24 high CPU bug with RKE and then the cattle cluster agent memory leak in 2.7.2+. Was hoping on 2.7.5, but this is a real show stopper.
We are also suffering this issue and can see for each user logging in a new account is created.
We also encounter this bug after upgrading from 2.7.4 to 2.7.5 We have a lot of users that have broken permissions or can’t log into Rancher at all.
I agree that this is a major bug! We also opened a critical issue in the SUSE customer center.
My comment how we solved this for us.
@crobby could you please make a statement
I think at least for all users, that have already tried to install 2.7.5 somewhere with several downstream clusters active, reverting back to 2.7.4 will not solve the issue created by the upgrade. I did that already using a full HA Rancher 2.7.4 pre-upgrade backup, but the issue now is in the API key handling/tokens of the downstream cluster users (scoped keys are no longer working for users affected by failed GUID migration - even after deletion and recreation of those users!).
So far, we have not found and are not aware of any workaround. So we need a forward fix that restores full AD functionality with 2.7.6+.
@crobby please revert the change that caused this issue and release a fix asap