socket.io: "Session ID unknown" after handshake on high server load [Socket.io 1.0.6]

I am running a multi-node server (16 workers running Socket.io 1.0.6; accessed via Nginx, configured as a reverse proxy supporting sticky sessions) for ~ 5k users. While the load of the server is low (2~3 on a 20 core server / 2k users), everyone is able to connect instantly. When the load of the server gets higher (5~6 / 5k users), new users are not able to connect and receive data instantly. In this case, it takes 2~4 handshakes for the users to connect succesfully.

This is what happens (high load):

  • User opens the website; receives HTML and JS
  • User’s browser attempts to initialize a socket.io connection to the server (io.connect(...))
  • A handshake request is sent to the server, the server responds with a SID and other information ({"sid":"f-re6ABU3Si4pmyWADCx","upgrades":["websocket"],"pingInterval":25000,"pingTimeout":60000})
  • The client initiates a polling-request, including this SID: GET .../socket.io/?EIO=2&transport=polling&t=1408648886249-1&sid=f-re6ABU3Si4pmyWADCx
  • Instead of sending data, the server responds with 400 Bad Request: {"code":1,"message":"Session ID unknown"}
  • The client performs a new handshake (GET .../socket.io/?EIO=2&transport=polling&t=1408648888050-3, notice the previously received SID is omitted)
  • The server responds with new connection data, including a new SID: ({"sid":"DdRxn2gv6vrtZOBiAEAS","upgrades":["websocket"],"pingInterval":25000,"pingTimeout":60000})
  • The client performs a new polling request, including the new SID: GET .../socket.io/?EIO=2&transport=polling&t=1408648888097-4&sid=DdRxn2gv6vrtZOBiAEAS
  • The server responds with the data that is emitted in the worker source code.

Depending on the load of the server, it may happen 1~3 times that the server responds with "Session ID unknown" and the client needs to perform a new handshake before data is actually received.

About this issue

  • Original URL
  • State: closed
  • Created 10 years ago
  • Reactions: 9
  • Comments: 30 (7 by maintainers)

Most upvoted comments

for me, it was with nginx ssl http2, and it was polling, so the good config is:

 const ioSocket = io('', {
      // Send auth token on connection, you will need to DI the Auth service above
      // 'query': 'token=' + Auth.getToken()
      path: '/socket.io',
      transports: ['websocket'],
      secure: true,
    });

DON’T FORGET TO CONFIGURE CLIENT AS WELL

Making just nodejs backend to use transport as websocket protocol won’t do much. socket.io clients are also required to set with the same configuration. So, in my onion below should work:

in nodejs:

  const ioSocket = io('',  {
      transports: ['websocket',  'polling']
    });

and in js client:

  socket = io.connect(SocketServerRootURL, {
        transports:['websocket', 'polling']
    });

['websocket', 'polling'] will force socket.io to try webscoket as the first protocol to connect, otherwise fall back to polling (just in case some browsers/clients may not support websockets). For cluster environment, better to use ['websocket'] only.

you are using polling, you can’t have a sticky session with polling or overly complex, with a connection via websocket, it opens and it keeps it there, if you have a cluster socket does not know where he is and it will creating a new connection every time or fails image

We finally figured this out. The root cause in our case:

  • When receiving a 5xx error, nginx proxy by default will take the errant upstream out of rotation for 10 seconds
  • when upstream-A is unavailable, ip_hash will route all of A’s requests instead to upstream-B
  • unfortunately, when upstream-B gets the new requests, it spits out 5xx errors (correctly) because the SID is not found in this.clients
  • that makes them get taken out of rotation as well, and their requests get routed to upstream-C
  • recurse…

We solved it by changing the nginx max_fails to something more reasonable (and upping the open file-descriptor limit for our app’s user, which was a secondary failure point, exacerbated by the constant reconnects)

Hi, I am having this same problem with nginx, node and socket.io. There is a way for nginx to use ‘sticky’ session ids passed along in the HTTP cookie that would solve it, but its part of their commerical offering. I was hoping the socket.io redis would address this by storing the session id in redis and using it from another socket.io-redis enabled node, but it doesn’t work. Maybe this is something that could be made to work using the redis adaptor?

I had this problem hosting my project with Heroku when I switched to multiple dynos, I solved enabling the sticky sessions with heroku features:enable http-session-affinity.

I have the same problem… Any solution?

For future readers:

Please note that using transports: ['websocket'] disables HTTP long-polling, so there’s no fallback if the WebSocket connection cannot be achieved (which might be acceptable or not, depending on your use case).

Reference: https://socket.io/docs/v4/client-options/#transports

@over2000 if HTTP long-polling makes too many requests, then that surely means something is wrong with the setup, like CORS. Please check our troubleshooting guide: https://socket.io/docs/v4/troubleshooting-connection-issues/

Do you guys have any further debugging information? Could it be a problem in the stickiness logic? The only way for {"code":1,"message":"Session ID unknown"} to be returned is if the SID is simply not in the in-memory datastructure.