dapr: Seeing 500 errors from Dapr when restarting pods running Go servers

Problem Statement

When restarting Kubernetes pods running a Go HTTP server (the base net/http package) which are being called by a Dapr service invoke (e.g. http://localhost:3500/v1.0/invoke/{app}/method/{foo}).

There is a window of time where Dapr returns a HTTP 500 as the pods restart. This happens even when there are 2 replicas in the deployment. If you try this with a different platform, e.g. Node.js or Python Flask - then no such issue is seen, no 500s are received at any point.

Notes.

  • Adding readiness and liveness probes does not solve the issue.
  • Neither does creating a custom HTTP server object, e.g. srv := &http.Server{} with different timeouts
  • I noticed Node.js sets Connection: keep-alive header by default, adding this to the Go server made no difference

At this point I don’t know if this is a bug with Dapr, something specific to the Go HTTP implementation or some other issue. Potentially TCP timeouts, keep-alives etc

I’ve created a repo with some simple steps to reproduce this https://github.com/benc-uk/dapr-go-error

The minimal Go HTTP server I’ve used for testing is

package main

import (
	"fmt"
	"log"
	"net/http"
)

func handler(w http.ResponseWriter, r *http.Request) {
	fmt.Fprintf(w, "HELLO there, from %s!", r.URL.Path[1:])
}

func main() {
	http.HandleFunc("/", handler)
	log.Fatal(http.ListenAndServe(":8080", nil))
}

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 2
  • Comments: 21 (19 by maintainers)

Most upvoted comments

I will say there is a noticeable delay on the direct test against the Go service at the same point the 500s would be seen on the Dapr call.

Like curl waits for the response longer than Dapr would

Which leads me to think it’s something related to timeouts or something keep-alive related, probably at the TCP rather than HTTP level

@benc-uk hmm this is odd and definitely warrants a deeper investigation. How confident are you that this is only and issue when invoked via Dapr? Might be worth dumping the requests received in the node app from dapr and curl and seeing if there any obvious distinctions.

Its expected that if the HTTP server in the app container is not ready yet when the pod comes up that the caller would get 500.

This is dependent on the user app (in this case and your example, Go or Node) so probably the node http server comes up faster, but in cases of restarts errors are expected until the HTTP server in the user app is running.