pomerium: controlplane: dropping event due to full channel: session invalidated

What happened?

Some user sessions are returning Error 500 for all routes, others are intermittent.

What did you expect to happen?

Policy to proceed where allowed.

How’d it happen?

  1. Updated from Pomerium v15 > v18
  2. In testing post upgrade, second pomerium node appeared not be handling authorization properly, so I stopped all services on the second node.
  3. I am now running on a single node, node 1, and some users get error 500 for all routes.
  4. I get intermittent error 500s, noticed more often with streaming requests. Static sites will work or fail, but when failing a refresh will allow it through.

What’s your environment like?

  • pomerium: 0.18.0-1658889797+89a105c8
  • envoy: 1.21.3+4861429dfffb599f28b9399c34ea2a2c268bfb6d10aca0a53bc9b67d847a4595
  • Pomerium console version: 0.18.0-1658971952 + 7b8e18a8 + 2022-07-27T21:30:12-04:00
  • Server Operating System/Architecture/Cloud: 2x CentOS Stream on premise virtual machines Separate redis (did not see the redis deprecation) and postgres VMs

What’s your config.yaml?

authenticate_service_url: https://example.com

certificates:
  - cert: /path/to/cert.pem
    key: /path/to/key.pem

signing_key: SECRET
metrics_address: localhost:9999

http_redirect_addr: :80

idp_provider: google
idp_client_id: clientid
idp_client_secret: secret
idp_service_account: serviceaccount

idp_refresh_directory_timeout: 10m
idp_refresh_directory_internal: 20m

cookie_secret: cookiesecret

shared_secret: sharedsecret

databroker_storage_type: redis
databroker_storage_connection_string: redis://:@redis:6379

policy:
  - from: https://console.example.com
    to: https://127.0.0.1:8701
    pass_identity_headers: true
    allowed_groups:
      - group@example.com
    allowed_users:
      - example@example.com

What did you see in the logs?

  {
    "level": "error",
    "config_file_source": "/etc/pomerium/config.yaml",
    "bootstrap": true,
    "service": "identity_manager",
    "error": "identity/oidc: user info endpoint: 401 Unauthorized: {\"error\":\"invalid_request\",\"error_description\":\"Invalid Credentials\"}",
    "user_id": "xxx",
    "session_id": "xxx",
    "time": "2022-08-24T12:01:19-04:00",
    "message": "failed to update user info, deleting session"
  }

  {
    "level": "warn",
    "event": {
      "time": {
        "seconds": 1661363152,
        "nanos": 831339500
      },
      "message": "identity/oidc: user info endpoint: 401 Unauthorized: {\"error\":\"invalid_request\",\"error_description\":\"Invalid Credentials\"}",
      "id": "identity_manager_last_user_refresh_errors"
    },
    "time": "2022-08-24T13:45:52-04:00",
    "message": "controlplane: dropping event due to full channel"
  }

  {
    "level": "info",
    "service": "envoy",
    "upstream-cluster": "",
    "method": "POST",
    "authority": "example.com",
    "path": "/path/to/page",
    "user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36",
    "referer": "example.com",
    "forwarded-for": "192.168.1.1",
    "request-id": "xxx",
    "duration": 9999.325248,
    "size": 0,
    "response-code": 500,
    "response-code-details": "ext_authz_error",
    "time": "2022-08-24T17:13:34-04:00",
    "message": "http-request"
  }

Additional context

For my user account the problems are intermittment and I can get past them.

For another user, if I impersonate them with the console I can replicate 100%.

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 15 (5 by maintainers)

Most upvoted comments

If redis couldn’t keep up then it adds up that postgres seems to be acting better.

Also, I noticed in my original config copy

idp_refresh_directory_internal: 20m

It should be interval and I missed that for who knows how long

I performed the redis flushall, waiting for a sync. getting a lot of 403 and

{
  "level": "warn",
  "error": "record not found",
  "time": "2022-08-24T19:48:59-04:00",
  "message": "clearing session due to missing session or service account"
}

during a large user/group sync, I believe this will appear: "allow-why-false":["groups-unauthorized","non-pomerium-route"]

in the past, the sync from Google takes a very long time before group policy works.