kserve: aggressive probe error: dial tcp 127.0.0.1:8080: connect: connection refused

/kind bug

What steps did you take and what happened: [A clear and concise description of what the bug is.]

I am getting this error when seeing the logs in the Pytorch server gpu example model in kfserving v0.4.

aggressive probe error: dial tcp 127.0.0.1:8080: connect: connection refused aggressive probe error: dial tcp 127.0.0.1:8080: connect: connection refused …

{“level”:“error”,“ts”:“2020-10-21T08:15:01.782Z”,“logger”:“queueproxy”,“caller”:“network/error_handler.go:33”,“msg”:“error reverse proxying request; sockstat: sockets: used 16\nTCP: inuse 5 orphan 1 tw 12 alloc 51 mem 5\nUDP: inuse 0 mem 2\nUDPLITE: inuse 0\nRAW: inuse 0\nFRAG: inuse 0 memory 0\n”,“commit”:“3372d58”,“knative.dev/key”:{“knative.dev/key”:“default/pytorch-cifar10-gpu-predictor-default-2zprm”},“knative.dev/pod”:“pytorch-cifar10-gpu-predictor-default-2zprm-deployment-858m8cs5”,“error”:“context canceled”,“stacktrace”:“knative.dev/pkg/network.ErrorHandler.func1\n\tknative.dev/pkg@v0.0.0-20200812224206-44c860147a87/network/error_handler.go:33\nnet/http/httputil.(*ReverseProxy).ServeHTTP\n\tnet/http/httputil/reverseproxy.go:259\nknative.dev/serving/pkg/queue.(*appRequestMetricsHandler).ServeHTTP\n\tknative.dev/serving/pkg/queue/request_metric.go:205\nmain.proxyHandler.func1\n\tknative.dev/serving/cmd/queue/main.go:149\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2041\nknative.dev/serving/pkg/queue.ForwardedShimHandler.func1\n\tknative.dev/serving/pkg/queue/forwarded_shim.go:54\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2041\nknative.dev/serving/pkg/http/handler.(*timeToFirstByteTimeoutHandler).ServeHTTP.func1\n\tknative.dev/serving/pkg/http/handler/timeout.go:86”}

What did you expect to happen: I am suspecting if multiple GPU nodes are serving requests or not. It is taking almost the same time for 1 replica vs 3 replica

Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.]

Environment:

  • Istio Version: 1.7
  • Knative Version: v0.17.1
  • KFServing Version: 0.4
  • Kubeflow version: Usning only KFServing
  • Kfdef:[k8s_istio/istio_dex/gcp_basic_auth/gcp_iap/aws/aws_cognito/ibm]
  • Minikube version:
  • Kubernetes version: (use kubectl version): Client Version: version.Info{Major:“1”, Minor:“18”, GitVersion:“v1.18.6”, GitCommit:“dff82dc0de47299ab66c83c626e08b245ab19037”, GitTreeState:“clean”, BuildDate:“2020-07-15T16:58:53Z”, GoVersion:“go1.13.9”, Compiler:“gc”, Platform:“linux/amd64”} Server Version: version.Info{Major:“1”, Minor:“16+”, GitVersion:“v1.16.13-gke.401”, GitCommit:“eb94c181eea5290e9da1238db02cfef263542f5f”, GitTreeState:“clean”, BuildDate:“2020-09-09T00:57:35Z”, GoVersion:“go1.13.9b4”, Compiler:“gc”, Platform:“linux/amd64”}
  • OS (e.g. from /etc/os-release): Ubuntu 20.04

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 16 (3 by maintainers)

Most upvoted comments

Issue-Label Bot is automatically applying the labels:

Label Probability
area/inference 0.99

Please mark this comment with 👍 or 👎 to give our bot feedback! Links: app homepage, dashboard and code for this bot.

any updates? experiencing the same issue with KFServing 0.5.

any updates? experiencing the same issue with KFServing 0.5.

Can you help elaborate the issue? the following error is quite normal when queue proxy is still waiting for the model server to be up. aggressive probe error: dial tcp 127.0.0.1:8080: connect: connection refused

I have a similiar issue when deploying knative service. I have a http service in a pod and alone with a queue-proxy. The problem for me is that my http service does not listen on localhost by default, which cause the queue-proxy can not connct to it. So when I force my service to listion on localhost, like localhost:8080 ,the error disappears.

as above the error logging is from knative probes failures while waiting for the model server to be up serving requests.

The other related error reported in the original issue is due to request timeout, I’d recommend tuning the timeout setting if needed.

{"level":"error","ts":"2020-10-21T08:15:01.782Z","logger":"queueproxy","caller":"network/error_handler.go:33","msg":"error reverse proxying request; sockstat: sockets: used 16\nTCP: inuse 5 orphan 1 tw 12 alloc 51 mem 5\nUDP: inuse 0 mem 2\nUDPLITE: inuse 0\nRAW: inuse 0\nFRAG: inuse 0 memory 0\n","commit":"3372d58","knative.dev/key":{"knative.dev/key":"default/pytorch-cifar10-gpu-predictor-default-2zprm"},"knative.dev/pod":"pytorch-cifar10-gpu-predictor-default-2zprm-deployment-858m8cs5","error":"context canceled