agones: Sidecar occasionally fails to start up

What happened: Out of 10 gameservers in a new fleet, 3 never left the ‘Scheduled’ state. Further investigation revealed that the game server pod successfully started, however the agones sidecar logged an error.

Truncated Logs: {“message”:“Starting SDKServer grpc service…”,“severity”:“info”} {“message”:“Starting SDKServer grpc-gateway…”,“severity”:“info”} {“error”:“listen tcp 127.0.0.1:59358: bind: address already in use”,“message”:“Could not serve http server”,“severity”:“fatal”}

After deleting the affected gameservers, new gameservers and sidecars do start up without any issue.

What you expected to happen: If the sidecar of a gameserver fails to initialize, either recycle the gameserver or repair the sidecar.

How to reproduce it (as minimally and precisely as possible): No consistent repro steps. Seems to be most replicable when rapidly scaling up a new gameserverset, but is not consistent in its frequency.

Anything else we need to know?: Fleet Yaml:

apiVersion: stable.agones.dev/v1alpha1
kind: Fleet
metadata:
  name: battleserver
spec:
  replicas: 10
  template:
    spec:
      health:
        disabled: true
        initialDelaySeconds: 60
      ports:
      - containerPort: 443
        name: default
        protocol: TCP
      template:
        spec:
          containers:
            image: <REMOVED>
            name: battleserver

Environment:

  • Agones version: 0.10.0
  • Kubernetes version (use kubectl version): Client: “v1.14.2” Server: “v1.13.6-gke.13”
  • Cloud provider or hardware configuration: GKE
  • Install method (yaml/helm): Helm
  • Troubleshooting guide log(s):
  • Others:

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 32 (28 by maintainers)

Most upvoted comments

Update:

The ports used by the sdkserver sidecar can be configured through the gameserver/fleet yaml (starting with the impending 1.1 release) and I’ve added support to all of the SDKs to connect to the port that is being used by the sidecar.

For backwards compatibility, the default ports are not going to be changed for the 1.1 release, but they will be changed in a future release to a lower numbered, non-ephemeral port so that by default we don’t encounter this issue.

@aLekSer I just realised something else this will enable us to do - run the SDK conformance tests in parallel on different ports 👍 (because the sdk conformance tests take a long time right now, just because they are in serial) 😄

Given the Murphy’s Law argument I’m convinced that we should make the sdkserver ports configurable.

Next question: how to express this in the API?

We already have a section called ‘ports’, but I think it would be confusing to put it in there. We are adding a section called ‘logging’ with a subfield for ‘sdkserver’ but I’m wondering if we should invert it and instead do something like:

spec:
  sdkserver:
    grpcPort: 7777
    httpPort: 7778
    logging: Error

Basically, add a configuration block for things related to the sdkserver sidecar. We previously made logging a block so that we could potentially add logging fields for other system services later, but if we have other system services, they may need other configuration anyway and it makes sense to me to keep things grouped by what you are configuring.

The logging blob was added after 1.0 and isn’t released yet, so we have a short window where we can change it. If it sticks for 1.1.0 then we will need to think of a different way to represent this – or maybe there is a better way to represent it irrespective of the logging parameters?

I’m sure everyone on this thread understands, but just to clarify for anyone else, there are actually two ports that need to be set: 59357 (gRPC) and 59358 (http)

That’s a good point. I guess the only reason that would make it a good long term strategy, is to stop the inevitable “my game server starts up on the same port as the SDK, and Ii can’t change it, what do I do”

I just figure via Murphy’s Law that will happen at some point 🤷‍♂️

@roberthbailey Here’s a small C program to demonstrate the problem.

https://github.com/drichardson/examples/blob/master/network/ephemeral.c

Here’s the result of running it several times in a row on an Ubuntu 18.04 instance GCE instance:

doug@instance-1:~/examples/network$ ./ephemeral                                                                         Connected to example.com...
Client was allocated port 44080
bind failed for server. 98: Address already in use
doug@instance-1:~/examples/network$ ./ephemeral                                                                         Connected to example.com...
Client was allocated port 44082
bind failed for server. 98: Address already in use
doug@instance-1:~/examples/network$ ./ephemeral                                                                         Connected to example.com...
Client was allocated port 44084
bind failed for server. 98: Address already in use