istio: Problem with JWT token authentication in the pilot & envoy

Describe the bug When you use End User Authentication Policy to secure a service pilot is refreshing JWKS cache each hour. Once an IDP rotates new keys and you fetch a new token signed with a new key, envoy rejects it due to lack of JWKS configuration. Request results in 401 with KID_ALG_UNMATCH error in the envoy logs.

Looking at the implementation of the filter (https://github.com/istio/proxy/blob/master/src/envoy/http/jwt_auth/jwt.cc#L430) it looks like the cache is the only ‘source of truth’ for an envoy regarding JWT authentication. The pilot is responsible for pushing new JWKS configuration to the envoys. In such case, in the worst case scenario, there is a 1-hour gap until an envoy will receive new JWKS, therefore, will reject new tokens for this period of time. Shouldn’t envoy make a call for new JWKS if it detects missing piece in the cache despite pilot job? Restarting pod with running envoy helps after that new tokens are properly authenticated.

Bellow, I attach debug logs from an envoy injected to an application and pilot as well. I can see that the pilot is refreshing keys each hour.

Expected behavior If a request is made to the service with token signed by a new key, envoy is able to fetch new JWKS configuration in order to authenticate the token.

Steps to reproduce the bug

  1. Install istio
  2. Install dex and set key rotation period to for example to 10 minutes (to speed up problem reproduction)
  3. Secure sample service with EndUser Policy pointing to dex
  4. Fetch a token after key rotation
  5. Make a request with a new token to the service

In my case I use Dex as an IDP. To eliminate the problem in Dex itself, I have verified that new keys are available under public kyes endpoint.

Version Istio Version: 1.0.5

Kubectl Client Version: version.Info{Major:“1”, Minor:“12”, GitVersion:“v1.12.0”, GitCommit:“0ed33881dc4355495f623c6f22e7dd0b7632b7c0”, GitTreeState:“clean”, BuildDate:“2018-09-28T15:18:13Z”, GoVersion:“go1.11”, Compiler:“gc”, Platform:“darwin/amd64”} Server Version: version.Info{Major:“1”, Minor:“10”, GitVersion:“v1.10.0”, GitCommit:“fc32d2f3698e36b93322a3465f63a14e9f0eaead”, GitTreeState:“clean”, BuildDate:“2018-04-10T12:46:31Z”, GoVersion:“go1.9.4”, Compiler:“gc”, Platform:“linux/amd64”}

Installation {{ Please describe how Istio was installed }} Via Helm chart with security enabled

Environment {{ Which environment, cloud vendor, OS, etc are you using? }} Minikube & cluster env like GKE

Cluster state {{ If you’re running on Kubernetes, consider following the instructions to generate “istio-dump.tar.gz”, then attach it here by dragging and dropping the file onto this issue. }} envoy.log pilot.log

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Reactions: 2
  • Comments: 19 (5 by maintainers)

Most upvoted comments

Looking forward for this, right now I’m still handling the verification in my service itself, this is really an essential add on to end-user authentication and authorization in Istio.

@piotrmsc @yangminzhu We have same requirement, in case of kid miss in JWKS cache, we would like to refresh the cache as possibly there is a key rotation,

Has this change already implemented in istio ? Also is JWKS cache TTL configurable with ISTIO ?

We are hitting this issue when trying to integrate with https://www.gluu.org/. Apparently they have no grace period for key rotation in their implementation. Once they generate a new key, it is their expectation that they can immediately use it.

🚧 This issue or pull request has been closed due to not having had activity from an Istio team member since 2019-10-22. If you feel this issue or pull request deserves attention, please reopen the issue. Please see this wiki page for more information. Thank you for your contributions.

Created by the issue and PR lifecycle manager.

We will address this issue in the near future. We’re avoiding adding more flags to Istio control plane components and will probably introduce a new way to handle this.

I don’t think it should stale… the issue still exists and mentioned workaround pr (https://github.com/istio/istio/pull/11424) is closed with “will be done in the future” message.
@quanjielin @diemtvu it looks like you’re at kubecon Barcelona now, maybe we could discuss this? (I’m also here)

It will help but unfortunately will not solve the issue. Even if you set refresh time to 1 minute it will increase load on pilot and still for this period of time tokens will be rejected. I understand the problem with access to the internet of the Envoy, however, pilot needs this access as well so if the cloud policy blocks such calls it will not be possible to use AuthN policy at all cause even pilot will not be able to make well-known request. However, I understand that it’s better to allow a single component to have access to the external world rather then N components…

But what if the model would change? So pilot would still be responsible for providing keys to the envoys, however, envoy as a verifier, when detects missing keys in cache could ask pilot for new config, despite periodic refresh time? This way you would have system running all the time, able to authenticate new ID tokens regardless cache refreshing.

Or another option: what if the option to let Envoy fetch keys(the old behavior) make it configurable. By default, current solution - enovy relying on pilot cache refresh is on but can be extended with old behavior by config change.

When do you plan to release this PR with configurable refresh time ?😃

Hey @liminw thx for the reply 😃

I have a question regarding what you have mentioned, according to https://openid.net/specs/openid-connect-core-1_0.html#RotateSigKeys

" The verifier knows to go back to the jwks_uri location to re-retrieve the keys when it sees an unfamiliar kid value."

in this case the verifier is an envoy not pilot, however, envoy has config based on what pilot provides. So with unmatched kid envoy should re-retrieve the keys which currently is missing.

The draft does not mention that the issuer must keep keys as long as id token is valid, it mentions it should keep old/expired keys for a reasonable time. In my opinion the JWKS handling is broken, as envoy relies only on pilot here. Envoy should be able to re-retrive the JWKS by itself. What’s your thoughts? 😃

To rotate keys, you need to strictly follow the lifecycle of a key. In the following steps, from step 2 to step 3 is the duration you can use a key to sign tokens.

  1. Add a new key (k1) to the key server. At this time, you cannot use k1 to sign any token yet. Wait for the duration of cache refresh, and k1 is fetched by Pilot.
  2. Start using k1 to sign tokens.
  3. Stop using k1 to sign tokens. At this time, you have started to retire k1.
  4. Wait until all tokens signed by k1 has expired (longest token expiration time). After that, you can delete k1 from the key server.