eventing: Lost events due to 502 Bad Gateway error
Describe the bug Continuously feeding events to broker with reasonable rate (10 msec interval) with Knative Service configured (by eventing trigger) to receive them I regularly see about 2% loss of events if knative service is scaled to zero (sometimes it passes).
I was using default eventing broker, which by default uses in memory channel. I see the events are lost in default-broker-filter pod logs (full logs in attachment):
{"level":"warn","ts":1578062528.6414907,"logger":"fallback","caller":"http/transport.go:502","msg":"got an error from receiver fn","error":"error sending cloudevent: 502 Bad Gateway"}
{"level":"warn","ts":1578062528.641511,"logger":"fallback","caller":"http/transport.go:594","msg":"error returned from invokeReceiver","error":"error sending cloudevent: 502 Bad Gateway"}
Also, I see related errors in activator pod (full logs in attachment):
{"level":"info","ts":"2020-01-03T14:42:07.525Z","logger":"activator","caller":"net/throttler.go:244","msg":"Updating Revision Throttler with: clusterIP = \"\", trackers = 0, backends = 0 activator pos 0/1","knative.dev/controller":"activator","knative.dev/pod":"activator-947bd7448-9mxxv","knative.dev/key":"event-example/wathola-forwarder-d96jq"}
{"level":"info","ts":"2020-01-03T14:42:07.525Z","logger":"activator","caller":"net/throttler.go:230","msg":"Set capacity to 0 (backends: 0, index: 0/1)","knative.dev/controller":"activator","knative.dev/pod":"activator-947bd7448-9mxxv","knative.dev/key":"event-example/wathola-forwarder-d96jq"}
{"level":"info","ts":"2020-01-03T14:42:08.031Z","logger":"activator","caller":"net/throttler.go:244","msg":"Updating Revision Throttler with: clusterIP = \"\", trackers = 1, backends = 1 activator pos 0/1","knative.dev/controller":"activator","knative.dev/pod":"activator-947bd7448-9mxxv","knative.dev/key":"event-example/wathola-forwarder-d96jq"}
{"level":"info","ts":"2020-01-03T14:42:08.031Z","logger":"activator","caller":"net/throttler.go:230","msg":"Set capacity to 1000 (backends: 1, index: 0/1)","knative.dev/controller":"activator","knative.dev/pod":"activator-947bd7448-9mxxv","knative.dev/key":"event-example/wathola-forwarder-d96jq"}
{"level":"error","ts":"2020-01-03T14:42:08.615Z","logger":"activator","caller":"network/error_handler.go:31","msg":"error reverse proxying request; sockstat: sockets: used 2181\nTCP: inuse 1145 orphan 41 tw 494 alloc 3538 mem 2947\nUDP: inuse 0 mem 35\nUDPLITE: inuse 0\nRAW: inuse 0\nFRAG: inuse 0 memory 0\n","knative.dev/controller":"activator","knative.dev/pod":"activator-947bd7448-9mxxv","knative.dev/key":"event-example/wathola-forwarder-d96jq","error":"read tcp 10.128.4.9:56452->10.131.0.14:8012: read: connection reset by peer","stacktrace":"knative.dev/serving/pkg/network.ErrorHandler.func1\n\t/opt/app-root/src/go/src/knative.dev/serving/pkg/network/error_handler.go:31\nnet/http/httputil.(*ReverseProxy).ServeHTTP\n\t/usr/lib/golang/src/net/http/httputil/reverseproxy.go:251\nknative.dev/serving/pkg/activator/handler.(*activationHandler).proxyRequest\n\t/opt/app-root/src/go/src/knative.dev/serving/pkg/activator/handler/handler.go:155\nknative.dev/serving/pkg/activator/handler.(*activationHandler).ServeHTTP.func1\n\t/opt/app-root/src/go/src/knative.dev/serving/pkg/activator/handler/handler.go:105\nknative.dev/serving/pkg/activator/net.(*revisionThrottler).try.func1\n\t/opt/app-root/src/go/src/knative.dev/serving/pkg/activator/net/throttler.go:177\nknative.dev/serving/pkg/activator/net.(*infiniteBreaker).Maybe\n\t/opt/app-root/src/go/src/knative.dev/serving/pkg/activator/net/throttler.go:630\nknative.dev/serving/pkg/activator/net.(*revisionThrottler).try\n\t/opt/app-root/src/go/src/knative.dev/serving/pkg/activator/net/throttler.go:168\nknative.dev/serving/pkg/activator/net.(*Throttler).Try\n\t/opt/app-root/src/go/src/knative.dev/serving/pkg/activator/net/throttler.go:419\nknative.dev/serving/pkg/activator/handler.(*activationHandler).ServeHTTP\n\t/opt/app-root/src/go/src/knative.dev/serving/pkg/activator/handler/handler.go:95\nknative.dev/serving/pkg/activator/handler.(*RequestEventHandler).ServeHTTP\n\t/opt/app-root/src/go/src/knative.dev/serving/pkg/activator/handler/requestevent_handler.go:67\nknative.dev/serving/vendor/go.opencensus.io/plugin/ochttp.(*Handler).ServeHTTP\n\t/opt/app-root/src/go/src/knative.dev/serving/vendor/go.opencensus.io/plugin/ochttp/server.go:86\nknative.dev/serving/pkg/activator/config.(*storeMiddleware).ServeHTTP\n\t/opt/app-root/src/go/src/knative.dev/serving/pkg/activator/config/store.go:83\nknative.dev/serving/pkg/http.(*RequestLogHandler).ServeHTTP\n\t/opt/app-root/src/go/src/knative.dev/serving/pkg/http/request_log.go:135\nknative.dev/serving/pkg/activator/handler.(*ProbeHandler).ServeHTTP\n\t/opt/app-root/src/go/src/knative.dev/serving/pkg/activator/handler/probe_handler.go:41\nknative.dev/serving/pkg/activator/handler.(*HealthHandler).ServeHTTP\n\t/opt/app-root/src/go/src/knative.dev/serving/pkg/activator/handler/healthz_handler.go:38\nknative.dev/serving/pkg/network.(*handler).ServeHTTP\n\t/opt/app-root/src/go/src/knative.dev/serving/pkg/network/probe_handler.go:40\nknative.dev/serving/pkg/activator/handler.(*MetricHandler).ServeHTTP\n\t/opt/app-root/src/go/src/knative.dev/serving/pkg/activator/handler/metric_handler.go:91\nknative.dev/serving/vendor/golang.org/x/net/http2/h2c.h2cHandler.ServeHTTP\n\t/opt/app-root/src/go/src/knative.dev/serving/vendor/golang.org/x/net/http2/h2c/h2c.go:98\nnet/http.serverHandler.ServeHTTP\n\t/usr/lib/golang/src/net/http/server.go:2802\nnet/http.(*conn).serve\n\t/usr/lib/golang/src/net/http/server.go:1890"}
{"level":"error","ts":"2020-01-03T14:42:08.615Z","logger":"activator","caller":"network/error_handler.go:31","msg":"error reverse proxying request; sockstat: sockets: used 2178\nTCP: inuse 1145 orphan 41 tw 494 alloc 3535 mem 2947\nUDP: inuse 0 mem 35\nUDPLITE: inuse 0\nRAW: inuse 0\nFRAG: inuse 0 memory 0\n","knative.dev/controller":"activator","knative.dev/pod":"activator-947bd7448-9mxxv","knative.dev/key":"event-example/wathola-forwarder-d96jq","error":"read tcp 10.128.4.9:56414->10.131.0.14:8012: read: connection reset by peer","stacktrace":"knative.dev/serving/pkg/network.ErrorHandler.func1\n\t/opt/app-root/src/go/src/knative.dev/serving/pkg/network/error_handler.go:31\nnet/http/httputil.(*ReverseProxy).ServeHTTP\n\t/usr/lib/golang/src/net/http/httputil/reverseproxy.go:251\nknative.dev/serving/pkg/activator/handler.(*activationHandler).proxyRequest\n\t/opt/app-root/src/go/src/knative.dev/serving/pkg/activator/handler/handler.go:155\nknative.dev/serving/pkg/activator/handler.(*activationHandler).ServeHTTP.func1\n\t/opt/app-root/src/go/src/knative.dev/serving/pkg/activator/handler/handler.go:105\nknative.dev/serving/pkg/activator/net.(*revisionThrottler).try.func1\n\t/opt/app-root/src/go/src/knative.dev/serving/pkg/activator/net/throttler.go:177\nknative.dev/serving/pkg/activator/net.(*infiniteBreaker).Maybe\n\t/opt/app-root/src/go/src/knative.dev/serving/pkg/activator/net/throttler.go:630\nknative.dev/serving/pkg/activator/net.(*revisionThrottler).try\n\t/opt/app-root/src/go/src/knative.dev/serving/pkg/activator/net/throttler.go:168\nknative.dev/serving/pkg/activator/net.(*Throttler).Try\n\t/opt/app-root/src/go/src/knative.dev/serving/pkg/activator/net/throttler.go:419\nknative.dev/serving/pkg/activator/handler.(*activationHandler).ServeHTTP\n\t/opt/app-root/src/go/src/knative.dev/serving/pkg/activator/handler/handler.go:95\nknative.dev/serving/pkg/activator/handler.(*RequestEventHandler).ServeHTTP\n\t/opt/app-root/src/go/src/knative.dev/serving/pkg/activator/handler/requestevent_handler.go:67\nknative.dev/serving/vendor/go.opencensus.io/plugin/ochttp.(*Handler).ServeHTTP\n\t/opt/app-root/src/go/src/knative.dev/serving/vendor/go.opencensus.io/plugin/ochttp/server.go:86\nknative.dev/serving/pkg/activator/config.(*storeMiddleware).ServeHTTP\n\t/opt/app-root/src/go/src/knative.dev/serving/pkg/activator/config/store.go:83\nknative.dev/serving/pkg/http.(*RequestLogHandler).ServeHTTP\n\t/opt/app-root/src/go/src/knative.dev/serving/pkg/http/request_log.go:135\nknative.dev/serving/pkg/activator/handler.(*ProbeHandler).ServeHTTP\n\t/opt/app-root/src/go/src/knative.dev/serving/pkg/activator/handler/probe_handler.go:41\nknative.dev/serving/pkg/activator/handler.(*HealthHandler).ServeHTTP\n\t/opt/app-root/src/go/src/knative.dev/serving/pkg/activator/handler/healthz_handler.go:38\nknative.dev/serving/pkg/network.(*handler).ServeHTTP\n\t/opt/app-root/src/go/src/knative.dev/serving/pkg/network/probe_handler.go:40\nknative.dev/serving/pkg/activator/handler.(*MetricHandler).ServeHTTP\n\t/opt/app-root/src/go/src/knative.dev/serving/pkg/activator/handler/metric_handler.go:91\nknative.dev/serving/vendor/golang.org/x/net/http2/h2c.h2cHandler.ServeHTTP\n\t/opt/app-root/src/go/src/knative.dev/serving/vendor/golang.org/x/net/http2/h2c/h2c.go:98\nnet/http.serverHandler.ServeHTTP\n\t/usr/lib/golang/src/net/http/server.go:2802\nnet/http.(*conn).serve\n\t/usr/lib/golang/src/net/http/server.go:1890"}
{"level":"error","ts":"2020-01-03T14:42:08.615Z","logger":"activator","caller":"network/error_handler.go:31","msg":"error reverse proxying request; sockstat: sockets: used 2178\nTCP: inuse 1145 orphan 41 tw 494 alloc 3535 mem 2947\nUDP: inuse 0 mem 35\nUDPLITE: inuse 0\nRAW: inuse 0\nFRAG: inuse 0 memory 0\n","knative.dev/controller":"activator","knative.dev/pod":"activator-947bd7448-9mxxv","knative.dev/key":"event-example/wathola-forwarder-d96jq","error":"read tcp 10.128.4.9:56428->10.131.0.14:8012: read: connection reset by peer","stacktrace":"knative.dev/serving/pkg/network.ErrorHandler.func1\n\t/opt/app-root/src/go/src/knative.dev/serving/pkg/network/error_handler.go:31\nnet/http/httputil.(*ReverseProxy).ServeHTTP\n\t/usr/lib/golang/src/net/http/httputil/reverseproxy.go:251\nknative.dev/serving/pkg/activator/handler.(*activationHandler).proxyRequest\n\t/opt/app-root/src/go/src/knative.dev/serving/pkg/activator/handler/handler.go:155\nknative.dev/serving/pkg/activator/handler.(*activationHandler).ServeHTTP.func1\n\t/opt/app-root/src/go/src/knative.dev/serving/pkg/activator/handler/handler.go:105\nknative.dev/serving/pkg/activator/net.(*revisionThrottler).try.func1\n\t/opt/app-root/src/go/src/knative.dev/serving/pkg/activator/net/throttler.go:177\nknative.dev/serving/pkg/activator/net.(*infiniteBreaker).Maybe\n\t/opt/app-root/src/go/src/knative.dev/serving/pkg/activator/net/throttler.go:630\nknative.dev/serving/pkg/activator/net.(*revisionThrottler).try\n\t/opt/app-root/src/go/src/knative.dev/serving/pkg/activator/net/throttler.go:168\nknative.dev/serving/pkg/activator/net.(*Throttler).Try\n\t/opt/app-root/src/go/src/knative.dev/serving/pkg/activator/net/throttler.go:419\nknative.dev/serving/pkg/activator/handler.(*activationHandler).ServeHTTP\n\t/opt/app-root/src/go/src/knative.dev/serving/pkg/activator/handler/handler.go:95\nknative.dev/serving/pkg/activator/handler.(*RequestEventHandler).ServeHTTP\n\t/opt/app-root/src/go/src/knative.dev/serving/pkg/activator/handler/requestevent_handler.go:67\nknative.dev/serving/vendor/go.opencensus.io/plugin/ochttp.(*Handler).ServeHTTP\n\t/opt/app-root/src/go/src/knative.dev/serving/vendor/go.opencensus.io/plugin/ochttp/server.go:86\nknative.dev/serving/pkg/activator/config.(*storeMiddleware).ServeHTTP\n\t/opt/app-root/src/go/src/knative.dev/serving/pkg/activator/config/store.go:83\nknative.dev/serving/pkg/http.(*RequestLogHandler).ServeHTTP\n\t/opt/app-root/src/go/src/knative.dev/serving/pkg/http/request_log.go:135\nknative.dev/serving/pkg/activator/handler.(*ProbeHandler).ServeHTTP\n\t/opt/app-root/src/go/src/knative.dev/serving/pkg/activator/handler/probe_handler.go:41\nknative.dev/serving/pkg/activator/handler.(*HealthHandler).ServeHTTP\n\t/opt/app-root/src/go/src/knative.dev/serving/pkg/activator/handler/healthz_handler.go:38\nknative.dev/serving/pkg/network.(*handler).ServeHTTP\n\t/opt/app-root/src/go/src/knative.dev/serving/pkg/network/probe_handler.go:40\nknative.dev/serving/pkg/activator/handler.(*MetricHandler).ServeHTTP\n\t/opt/app-root/src/go/src/knative.dev/serving/pkg/activator/handler/metric_handler.go:91\nknative.dev/serving/vendor/golang.org/x/net/http2/h2c.h2cHandler.ServeHTTP\n\t/opt/app-root/src/go/src/knative.dev/serving/vendor/golang.org/x/net/http2/h2c/h2c.go:98\nnet/http.serverHandler.ServeHTTP\n\t/usr/lib/golang/src/net/http/server.go:2802\nnet/http.(*conn).serve\n\t/usr/lib/golang/src/net/http/server.go:1890"}
Expected behavior Events should not be lost, at least until in memory channel do not crash.
To Reproduce
Preparation of Openshift 4.2 (should be similar on other vendors):
- Run OCP 4.2 and have oc logged in as admin
- Checkout repo:
git clone --branch v0.1.0 https://github.com/cardil/eventing-upgrade-scratchpad.git - Run
oc apply -f install/eventing-openshift.yaml - Wait until
oc get clusterserviceversion -n openshift-operatorsreport succeeded for all operators - Run
oc apply -f install/eventing-cr.yaml - Wait until
oc get knativeserving/knative-serving -n knative-serving --template='{{range .status.conditions}}{{printf "%s=%s\n" .type .status}}{{end}}'andoc get knativeeventing/knative-eventing -n knative-eventing --template='{{range .status.conditions}}{{printf "%s=%s\n" .type .status}}{{end}}'reports as Ready - Run
oc apply -f install/event-example.yaml - Wait until
oc get broker default -n event-examplereports Ready
Reproduction:
- Run
for y in tests/config.yaml tests/trigger.yaml tests/receiver.yaml tests/forwarder.yaml; do oc apply -f $y; done - Wait until
oc get deployment $(oc get ksvc -n event-example -o jsonpath='{.items[0].status.latestCreatedRevisionName}')-deployment -n event-examplereport that Knative Service scaled to zeroREADY: 0/0 - Run
oc apply -f tests/sender.yamland wait until forwarder pods scale upREADY: 14/14 - Run
oc delete -f tests/sender.yaml. Receiver pod should receive finished event and report state. You should observe missing events. Sometimes this pass, in that case delete receiver pod (deployment will recreate it) and reperform from step 2 of reproduction.
Knative release version
0.10.x
Additional context Priority: medium.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 24 (21 by maintainers)
Was this an intentional default? Seems to me like the default should lean toward safety, with the more dangerous mode being an explicit choice by the user.
/label proposal/0.13
Upgrade Events Loss
The main reason behind this is that the dispatcher can lose events while dispatching if a network error occurs and the dispatcher is in the fire-and-forget mode (the default mode unless retries are configured). During upgrade the filter pod gets a rolling update and while the event gets published in the channel, there can be a time when the broker filter pod is unreachable and the dispatcher loses the event (see above) Following trace shows the above mentioned scenario
Kafka Channel Guarantees
Given the current KafkaChannel, with no subscriber based offset, it is impossible to provide at least once delivery guarantees. Retries and dead letter queue are our best options until this issue is addressed knative-sandbox/eventing-kafka#118
Upgrade Test Retries
I added delivery retry to upstream upgrade tests which should make the event loss less likely to happen. Gonna keep an eye on it. #4353