eventing: Lost events due to 502 Bad Gateway error

Describe the bug Continuously feeding events to broker with reasonable rate (10 msec interval) with Knative Service configured (by eventing trigger) to receive them I regularly see about 2% loss of events if knative service is scaled to zero (sometimes it passes).

I was using default eventing broker, which by default uses in memory channel. I see the events are lost in default-broker-filter pod logs (full logs in attachment):

{"level":"warn","ts":1578062528.6414907,"logger":"fallback","caller":"http/transport.go:502","msg":"got an error from receiver fn","error":"error sending cloudevent: 502 Bad Gateway"}
{"level":"warn","ts":1578062528.641511,"logger":"fallback","caller":"http/transport.go:594","msg":"error returned from invokeReceiver","error":"error sending cloudevent: 502 Bad Gateway"}

Also, I see related errors in activator pod (full logs in attachment):

{"level":"info","ts":"2020-01-03T14:42:07.525Z","logger":"activator","caller":"net/throttler.go:244","msg":"Updating Revision Throttler with: clusterIP = \"\", trackers = 0, backends = 0 activator pos 0/1","knative.dev/controller":"activator","knative.dev/pod":"activator-947bd7448-9mxxv","knative.dev/key":"event-example/wathola-forwarder-d96jq"}
{"level":"info","ts":"2020-01-03T14:42:07.525Z","logger":"activator","caller":"net/throttler.go:230","msg":"Set capacity to 0 (backends: 0, index: 0/1)","knative.dev/controller":"activator","knative.dev/pod":"activator-947bd7448-9mxxv","knative.dev/key":"event-example/wathola-forwarder-d96jq"}
{"level":"info","ts":"2020-01-03T14:42:08.031Z","logger":"activator","caller":"net/throttler.go:244","msg":"Updating Revision Throttler with: clusterIP = \"\", trackers = 1, backends = 1 activator pos 0/1","knative.dev/controller":"activator","knative.dev/pod":"activator-947bd7448-9mxxv","knative.dev/key":"event-example/wathola-forwarder-d96jq"}
{"level":"info","ts":"2020-01-03T14:42:08.031Z","logger":"activator","caller":"net/throttler.go:230","msg":"Set capacity to 1000 (backends: 1, index: 0/1)","knative.dev/controller":"activator","knative.dev/pod":"activator-947bd7448-9mxxv","knative.dev/key":"event-example/wathola-forwarder-d96jq"}
{"level":"error","ts":"2020-01-03T14:42:08.615Z","logger":"activator","caller":"network/error_handler.go:31","msg":"error reverse proxying request; sockstat: sockets: used 2181\nTCP: inuse 1145 orphan 41 tw 494 alloc 3538 mem 2947\nUDP: inuse 0 mem 35\nUDPLITE: inuse 0\nRAW: inuse 0\nFRAG: inuse 0 memory 0\n","knative.dev/controller":"activator","knative.dev/pod":"activator-947bd7448-9mxxv","knative.dev/key":"event-example/wathola-forwarder-d96jq","error":"read tcp 10.128.4.9:56452->10.131.0.14:8012: read: connection reset by peer","stacktrace":"knative.dev/serving/pkg/network.ErrorHandler.func1\n\t/opt/app-root/src/go/src/knative.dev/serving/pkg/network/error_handler.go:31\nnet/http/httputil.(*ReverseProxy).ServeHTTP\n\t/usr/lib/golang/src/net/http/httputil/reverseproxy.go:251\nknative.dev/serving/pkg/activator/handler.(*activationHandler).proxyRequest\n\t/opt/app-root/src/go/src/knative.dev/serving/pkg/activator/handler/handler.go:155\nknative.dev/serving/pkg/activator/handler.(*activationHandler).ServeHTTP.func1\n\t/opt/app-root/src/go/src/knative.dev/serving/pkg/activator/handler/handler.go:105\nknative.dev/serving/pkg/activator/net.(*revisionThrottler).try.func1\n\t/opt/app-root/src/go/src/knative.dev/serving/pkg/activator/net/throttler.go:177\nknative.dev/serving/pkg/activator/net.(*infiniteBreaker).Maybe\n\t/opt/app-root/src/go/src/knative.dev/serving/pkg/activator/net/throttler.go:630\nknative.dev/serving/pkg/activator/net.(*revisionThrottler).try\n\t/opt/app-root/src/go/src/knative.dev/serving/pkg/activator/net/throttler.go:168\nknative.dev/serving/pkg/activator/net.(*Throttler).Try\n\t/opt/app-root/src/go/src/knative.dev/serving/pkg/activator/net/throttler.go:419\nknative.dev/serving/pkg/activator/handler.(*activationHandler).ServeHTTP\n\t/opt/app-root/src/go/src/knative.dev/serving/pkg/activator/handler/handler.go:95\nknative.dev/serving/pkg/activator/handler.(*RequestEventHandler).ServeHTTP\n\t/opt/app-root/src/go/src/knative.dev/serving/pkg/activator/handler/requestevent_handler.go:67\nknative.dev/serving/vendor/go.opencensus.io/plugin/ochttp.(*Handler).ServeHTTP\n\t/opt/app-root/src/go/src/knative.dev/serving/vendor/go.opencensus.io/plugin/ochttp/server.go:86\nknative.dev/serving/pkg/activator/config.(*storeMiddleware).ServeHTTP\n\t/opt/app-root/src/go/src/knative.dev/serving/pkg/activator/config/store.go:83\nknative.dev/serving/pkg/http.(*RequestLogHandler).ServeHTTP\n\t/opt/app-root/src/go/src/knative.dev/serving/pkg/http/request_log.go:135\nknative.dev/serving/pkg/activator/handler.(*ProbeHandler).ServeHTTP\n\t/opt/app-root/src/go/src/knative.dev/serving/pkg/activator/handler/probe_handler.go:41\nknative.dev/serving/pkg/activator/handler.(*HealthHandler).ServeHTTP\n\t/opt/app-root/src/go/src/knative.dev/serving/pkg/activator/handler/healthz_handler.go:38\nknative.dev/serving/pkg/network.(*handler).ServeHTTP\n\t/opt/app-root/src/go/src/knative.dev/serving/pkg/network/probe_handler.go:40\nknative.dev/serving/pkg/activator/handler.(*MetricHandler).ServeHTTP\n\t/opt/app-root/src/go/src/knative.dev/serving/pkg/activator/handler/metric_handler.go:91\nknative.dev/serving/vendor/golang.org/x/net/http2/h2c.h2cHandler.ServeHTTP\n\t/opt/app-root/src/go/src/knative.dev/serving/vendor/golang.org/x/net/http2/h2c/h2c.go:98\nnet/http.serverHandler.ServeHTTP\n\t/usr/lib/golang/src/net/http/server.go:2802\nnet/http.(*conn).serve\n\t/usr/lib/golang/src/net/http/server.go:1890"}
{"level":"error","ts":"2020-01-03T14:42:08.615Z","logger":"activator","caller":"network/error_handler.go:31","msg":"error reverse proxying request; sockstat: sockets: used 2178\nTCP: inuse 1145 orphan 41 tw 494 alloc 3535 mem 2947\nUDP: inuse 0 mem 35\nUDPLITE: inuse 0\nRAW: inuse 0\nFRAG: inuse 0 memory 0\n","knative.dev/controller":"activator","knative.dev/pod":"activator-947bd7448-9mxxv","knative.dev/key":"event-example/wathola-forwarder-d96jq","error":"read tcp 10.128.4.9:56414->10.131.0.14:8012: read: connection reset by peer","stacktrace":"knative.dev/serving/pkg/network.ErrorHandler.func1\n\t/opt/app-root/src/go/src/knative.dev/serving/pkg/network/error_handler.go:31\nnet/http/httputil.(*ReverseProxy).ServeHTTP\n\t/usr/lib/golang/src/net/http/httputil/reverseproxy.go:251\nknative.dev/serving/pkg/activator/handler.(*activationHandler).proxyRequest\n\t/opt/app-root/src/go/src/knative.dev/serving/pkg/activator/handler/handler.go:155\nknative.dev/serving/pkg/activator/handler.(*activationHandler).ServeHTTP.func1\n\t/opt/app-root/src/go/src/knative.dev/serving/pkg/activator/handler/handler.go:105\nknative.dev/serving/pkg/activator/net.(*revisionThrottler).try.func1\n\t/opt/app-root/src/go/src/knative.dev/serving/pkg/activator/net/throttler.go:177\nknative.dev/serving/pkg/activator/net.(*infiniteBreaker).Maybe\n\t/opt/app-root/src/go/src/knative.dev/serving/pkg/activator/net/throttler.go:630\nknative.dev/serving/pkg/activator/net.(*revisionThrottler).try\n\t/opt/app-root/src/go/src/knative.dev/serving/pkg/activator/net/throttler.go:168\nknative.dev/serving/pkg/activator/net.(*Throttler).Try\n\t/opt/app-root/src/go/src/knative.dev/serving/pkg/activator/net/throttler.go:419\nknative.dev/serving/pkg/activator/handler.(*activationHandler).ServeHTTP\n\t/opt/app-root/src/go/src/knative.dev/serving/pkg/activator/handler/handler.go:95\nknative.dev/serving/pkg/activator/handler.(*RequestEventHandler).ServeHTTP\n\t/opt/app-root/src/go/src/knative.dev/serving/pkg/activator/handler/requestevent_handler.go:67\nknative.dev/serving/vendor/go.opencensus.io/plugin/ochttp.(*Handler).ServeHTTP\n\t/opt/app-root/src/go/src/knative.dev/serving/vendor/go.opencensus.io/plugin/ochttp/server.go:86\nknative.dev/serving/pkg/activator/config.(*storeMiddleware).ServeHTTP\n\t/opt/app-root/src/go/src/knative.dev/serving/pkg/activator/config/store.go:83\nknative.dev/serving/pkg/http.(*RequestLogHandler).ServeHTTP\n\t/opt/app-root/src/go/src/knative.dev/serving/pkg/http/request_log.go:135\nknative.dev/serving/pkg/activator/handler.(*ProbeHandler).ServeHTTP\n\t/opt/app-root/src/go/src/knative.dev/serving/pkg/activator/handler/probe_handler.go:41\nknative.dev/serving/pkg/activator/handler.(*HealthHandler).ServeHTTP\n\t/opt/app-root/src/go/src/knative.dev/serving/pkg/activator/handler/healthz_handler.go:38\nknative.dev/serving/pkg/network.(*handler).ServeHTTP\n\t/opt/app-root/src/go/src/knative.dev/serving/pkg/network/probe_handler.go:40\nknative.dev/serving/pkg/activator/handler.(*MetricHandler).ServeHTTP\n\t/opt/app-root/src/go/src/knative.dev/serving/pkg/activator/handler/metric_handler.go:91\nknative.dev/serving/vendor/golang.org/x/net/http2/h2c.h2cHandler.ServeHTTP\n\t/opt/app-root/src/go/src/knative.dev/serving/vendor/golang.org/x/net/http2/h2c/h2c.go:98\nnet/http.serverHandler.ServeHTTP\n\t/usr/lib/golang/src/net/http/server.go:2802\nnet/http.(*conn).serve\n\t/usr/lib/golang/src/net/http/server.go:1890"}
{"level":"error","ts":"2020-01-03T14:42:08.615Z","logger":"activator","caller":"network/error_handler.go:31","msg":"error reverse proxying request; sockstat: sockets: used 2178\nTCP: inuse 1145 orphan 41 tw 494 alloc 3535 mem 2947\nUDP: inuse 0 mem 35\nUDPLITE: inuse 0\nRAW: inuse 0\nFRAG: inuse 0 memory 0\n","knative.dev/controller":"activator","knative.dev/pod":"activator-947bd7448-9mxxv","knative.dev/key":"event-example/wathola-forwarder-d96jq","error":"read tcp 10.128.4.9:56428->10.131.0.14:8012: read: connection reset by peer","stacktrace":"knative.dev/serving/pkg/network.ErrorHandler.func1\n\t/opt/app-root/src/go/src/knative.dev/serving/pkg/network/error_handler.go:31\nnet/http/httputil.(*ReverseProxy).ServeHTTP\n\t/usr/lib/golang/src/net/http/httputil/reverseproxy.go:251\nknative.dev/serving/pkg/activator/handler.(*activationHandler).proxyRequest\n\t/opt/app-root/src/go/src/knative.dev/serving/pkg/activator/handler/handler.go:155\nknative.dev/serving/pkg/activator/handler.(*activationHandler).ServeHTTP.func1\n\t/opt/app-root/src/go/src/knative.dev/serving/pkg/activator/handler/handler.go:105\nknative.dev/serving/pkg/activator/net.(*revisionThrottler).try.func1\n\t/opt/app-root/src/go/src/knative.dev/serving/pkg/activator/net/throttler.go:177\nknative.dev/serving/pkg/activator/net.(*infiniteBreaker).Maybe\n\t/opt/app-root/src/go/src/knative.dev/serving/pkg/activator/net/throttler.go:630\nknative.dev/serving/pkg/activator/net.(*revisionThrottler).try\n\t/opt/app-root/src/go/src/knative.dev/serving/pkg/activator/net/throttler.go:168\nknative.dev/serving/pkg/activator/net.(*Throttler).Try\n\t/opt/app-root/src/go/src/knative.dev/serving/pkg/activator/net/throttler.go:419\nknative.dev/serving/pkg/activator/handler.(*activationHandler).ServeHTTP\n\t/opt/app-root/src/go/src/knative.dev/serving/pkg/activator/handler/handler.go:95\nknative.dev/serving/pkg/activator/handler.(*RequestEventHandler).ServeHTTP\n\t/opt/app-root/src/go/src/knative.dev/serving/pkg/activator/handler/requestevent_handler.go:67\nknative.dev/serving/vendor/go.opencensus.io/plugin/ochttp.(*Handler).ServeHTTP\n\t/opt/app-root/src/go/src/knative.dev/serving/vendor/go.opencensus.io/plugin/ochttp/server.go:86\nknative.dev/serving/pkg/activator/config.(*storeMiddleware).ServeHTTP\n\t/opt/app-root/src/go/src/knative.dev/serving/pkg/activator/config/store.go:83\nknative.dev/serving/pkg/http.(*RequestLogHandler).ServeHTTP\n\t/opt/app-root/src/go/src/knative.dev/serving/pkg/http/request_log.go:135\nknative.dev/serving/pkg/activator/handler.(*ProbeHandler).ServeHTTP\n\t/opt/app-root/src/go/src/knative.dev/serving/pkg/activator/handler/probe_handler.go:41\nknative.dev/serving/pkg/activator/handler.(*HealthHandler).ServeHTTP\n\t/opt/app-root/src/go/src/knative.dev/serving/pkg/activator/handler/healthz_handler.go:38\nknative.dev/serving/pkg/network.(*handler).ServeHTTP\n\t/opt/app-root/src/go/src/knative.dev/serving/pkg/network/probe_handler.go:40\nknative.dev/serving/pkg/activator/handler.(*MetricHandler).ServeHTTP\n\t/opt/app-root/src/go/src/knative.dev/serving/pkg/activator/handler/metric_handler.go:91\nknative.dev/serving/vendor/golang.org/x/net/http2/h2c.h2cHandler.ServeHTTP\n\t/opt/app-root/src/go/src/knative.dev/serving/vendor/golang.org/x/net/http2/h2c/h2c.go:98\nnet/http.serverHandler.ServeHTTP\n\t/usr/lib/golang/src/net/http/server.go:2802\nnet/http.(*conn).serve\n\t/usr/lib/golang/src/net/http/server.go:1890"}

Expected behavior Events should not be lost, at least until in memory channel do not crash.

To Reproduce

Preparation of Openshift 4.2 (should be similar on other vendors):

  1. Run OCP 4.2 and have oc logged in as admin
  2. Checkout repo: git clone --branch v0.1.0 https://github.com/cardil/eventing-upgrade-scratchpad.git
  3. Run oc apply -f install/eventing-openshift.yaml
  4. Wait until oc get clusterserviceversion -n openshift-operators report succeeded for all operators
  5. Run oc apply -f install/eventing-cr.yaml
  6. Wait until oc get knativeserving/knative-serving -n knative-serving --template='{{range .status.conditions}}{{printf "%s=%s\n" .type .status}}{{end}}' and oc get knativeeventing/knative-eventing -n knative-eventing --template='{{range .status.conditions}}{{printf "%s=%s\n" .type .status}}{{end}}' reports as Ready
  7. Run oc apply -f install/event-example.yaml
  8. Wait until oc get broker default -n event-example reports Ready

Reproduction:

  1. Run for y in tests/config.yaml tests/trigger.yaml tests/receiver.yaml tests/forwarder.yaml; do oc apply -f $y; done
  2. Wait until oc get deployment $(oc get ksvc -n event-example -o jsonpath='{.items[0].status.latestCreatedRevisionName}')-deployment -n event-example report that Knative Service scaled to zero READY: 0/0
  3. Run oc apply -f tests/sender.yaml and wait until forwarder pods scale up READY: 14/14
  4. Run oc delete -f tests/sender.yaml. Receiver pod should receive finished event and report state. You should observe missing events. Sometimes this pass, in that case delete receiver pod (deployment will recreate it) and reperform from step 2 of reproduction.

Knative release version 0.10.x

Additional context Priority: medium.

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 24 (21 by maintainers)

Most upvoted comments

the dispatcher is in the fire-and-forget mode (the default mode unless retries are configured)

Was this an intentional default? Seems to me like the default should lean toward safety, with the more dangerous mode being an explicit choice by the user.

/label proposal/0.13

Upgrade Events Loss

The main reason behind this is that the dispatcher can lose events while dispatching if a network error occurs and the dispatcher is in the fire-and-forget mode (the default mode unless retries are configured). During upgrade the filter pod gets a rolling update and while the event gets published in the channel, there can be a time when the broker filter pod is unreachable and the dispatcher loses the event (see above) Following trace shows the above mentioned scenario

Screenshot from 2020-10-20 14-30-53

Kafka Channel Guarantees

Given the current KafkaChannel, with no subscriber based offset, it is impossible to provide at least once delivery guarantees. Retries and dead letter queue are our best options until this issue is addressed knative-sandbox/eventing-kafka#118

Upgrade Test Retries

I added delivery retry to upstream upgrade tests which should make the event loss less likely to happen. Gonna keep an eye on it. #4353