node-slack-sdk: Socket mode: Bolt doesn't handle WebSocket disconnections gracefully

Description

Bolt crashes when it loses its WebSocket connection in socket mode, and there doesn’t seem to be a great way for the calling code to handle the situation. Disconnections from an EC2 instance in us-east-1 are happening frequently enough for this to be a significant issue. This happens even when a bot is idle.

What type of issue is this? (place an x in one of the [ ])

  • bug
  • enhancement (feature request)
  • question
  • documentation related
  • example code related
  • testing related
  • discussion

Requirements (place an x in each of the [ ])

  • I’ve read and understood the Contributing guidelines and have done my best effort to follow them.
  • I’ve read and agree to the Code of Conduct.
  • I’ve searched for any related issues and avoided creating a duplicate issue.

Bug Report

Filling out the following details about bugs will help us solve your issue sooner.

Reproducible in:

package version: 3.3.0

node version: 16.1.0

OS version(s): Ubuntu 20.04.2 LTS (focal)

Steps to reproduce:

I have a fairly simple bot in the following form:

const app = new App({
	token: "...",
	appToken: "...",
	socketMode: true,
});

app.message(/^foobar$/, async ({ message, say, client }) => {
	// Doesn't matter what's here; it doesn't need to ever trigger for the bug to manifest
});

await app.start();

The crash happens if I:

  1. Start the bot on an EC2 us-east-1 instance.
  2. Leave it running for an arbitrary period of time–sometimes it takes an hour, sometimes a few days.
  3. At some point, the WebSocket will close. I’m not sure why, but I assume it’s something on Slack’s end, as the EC2 instance is left untouched.

Expected result:

The bot will either:

  • Handle connectivity interruptions gracefully, without the need for the programmer to manually reconnect or instantiate a new App.
  • Throw an exception or trigger an event that can be handled by the calling code in a trivial manner–one that doesn’t require instantiating a new App. This would probably need to be documented and demonstrated in the examples, since a socket mode Bolt bot isn’t really viable without some mechanism for handling connectivity interruptions.

Actual result:

[ERROR]  socket-mode:SocketModeClient:0 failed to send message on websocket: WebSocket is not open: readyState 3 (CLOSED)
/git/slackbot/node_modules/@slack/socket-mode/dist/errors.js:29
    const error = errorWithCode(new Error(`Failed to send message on websocket: ${original.message}`), ErrorCode.WebsocketError);
                                ^

Error: Failed to send message on websocket: WebSocket is not open: readyState 3 (CLOSED)
    at Object.websocketErrorWithOriginal (/git/slackbot/node_modules/@slack/socket-mode/dist/errors.js:29:33)
    at /git/slackbot/node_modules/@slack/socket-mode/dist/SocketModeClient.js:347:48
    at sendAfterClose (/git/slackbot/node_modules/ws/lib/websocket.js:758:5)
    at WebSocket.send (/git/slackbot/node_modules/ws/lib/websocket.js:345:7)
    at /git/slackbot/node_modules/@slack/socket-mode/dist/SocketModeClient.js:344:32
    at new Promise (<anonymous>)
    at SocketModeClient.send (/git/slackbot/node_modules/@slack/socket-mode/dist/SocketModeClient.js:329:16)
    at ack (/git/slackbot/node_modules/@slack/socket-mode/dist/SocketModeClient.js:469:24)
    at App.processEvent (/git/slackbot/node_modules/@slack/bolt/dist/App.js:428:19)
    at runMicrotasks (<anonymous>) {
  code: 'slack_socket_mode_websocket_error',
  original: Error: WebSocket is not open: readyState 3 (CLOSED)
      at sendAfterClose (/git/slackbot/node_modules/ws/lib/websocket.js:754:17)
      at WebSocket.send (/git/slackbot/node_modules/ws/lib/websocket.js:345:7)
      at /git/slackbot/node_modules/@slack/socket-mode/dist/SocketModeClient.js:344:32
      at new Promise (<anonymous>)
      at SocketModeClient.send (/git/slackbot/node_modules/@slack/socket-mode/dist/SocketModeClient.js:329:16)
      at ack (/git/slackbot/node_modules/@slack/socket-mode/dist/SocketModeClient.js:469:24)
      at App.processEvent (/git/slackbot/node_modules/@slack/bolt/dist/App.js:428:19)
      at runMicrotasks (<anonymous>)
      at processTicksAndRejections (node:internal/process/task_queues:96:5)
      at async SocketModeClient.<anonymous> (/git/slackbot/node_modules/@slack/bolt/dist/receivers/SocketModeReceiver.js:88:13)
}

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 7
  • Comments: 21 (8 by maintainers)

Commits related to this issue

Most upvoted comments

👋 Hi, all the developers watching this thread!

We’ve improved the SocketModeClient internals a lot to be more robust. You can try the updated one by installing the RC version by running the following command:

npm i @slack/socket-mode@next
# or npm i @slack/socket-mode@1.3.0-rc.0

If you are interested in all the changes included in the release, refer to https://github.com/slackapi/node-slack-sdk/releases/tag/%40slack%2Fsocket-mode%401.3.0-rc.0 for the details.

If everything goes well, we will release v1.3.0 next week!

I also notice connection failures lately. Twice today actually. Here are the debug logs. App is stuck at this state until i manually intervene and do an app restart. Though stuck at this state, it is still able to receive events but eventually it errors before it reaches the code app.event('app_mention', handler)

[DEBUG]  socket-mode:SocketModeClient:0 calling ack events_api
[DEBUG]  socket-mode:SocketModeClient:0 send() in state: connected,ready
[DEBUG]  socket-mode:SocketModeClient:0 sending message on websocket: {"envelope_id":"xxx","payload":{}}
[ERROR]  socket-mode:SocketModeClient:0 failed to send message on websocket: WebSocket is not open: readyState 3 (CLOSED)
(node:18) UnhandledPromiseRejectionWarning: Error: Failed to send message on websocket: WebSocket is not open: readyState 3 (CLOSED)
    at Object.websocketErrorWithOriginal (/opt/app/node_modules/@slack/socket-mode/dist/errors.js:29:33)
    at /opt/app/node_modules/@slack/socket-mode/dist/SocketModeClient.js:347:48
    at sendAfterClose (/opt/app/node_modules/@slack/socket-mode/node_modules/ws/lib/websocket.js:768:5)
    at WebSocket.send (/opt/app/node_modules/@slack/socket-mode/node_modules/ws/lib/websocket.js:345:7)
    at /opt/app/node_modules/@slack/socket-mode/dist/SocketModeClient.js:344:32
    at new Promise (<anonymous>)
    at SocketModeClient.send (/opt/app/node_modules/@slack/socket-mode/dist/SocketModeClient.js:329:16)
    at ack (/opt/app/node_modules/@slack/socket-mode/dist/SocketModeClient.js:469:24)
    at App.processEvent (/opt/app/node_modules/@slack/bolt/dist/App.js:428:19)
    at runMicrotasks (<anonymous>)
(node:18) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). To terminate the node process on unhandled promise rejection, use the CLI flag `--unhandled-rejections=strict` (see https://nodejs.org/api/cli.html#cli_unhandled_rejections_mode). (rejection id: 122)

version 1.3.0 is now available. Please upgrade to the latest version 👋

Anyone trying this, note my mistake in the first commit ⬆️ – if you’re using bolt, then socket-mode is a transitive dependency.

We encountered a very similar problem. But WebSocket disconnections occur constantly.

[ERROR]  socket-mode:SocketModeClient:0 A websocket error occurred: Client network socket disconnected before secure TLS connection was established
[INFO]  socket-mode:SocketModeClient:0 A ping wasn't received from the server before the timeout of 30000ms!
[ERROR]  socket-mode:SocketModeClient:0 cannot send message when client is not ready
/usr/src/app/node_modules/@slack/socket-mode/dist/errors.js:63
    return errorWithCode(new Error('Cannot send message when client is not ready'), ErrorCode.NoReplyReceivedError);
                         ^

Error: Cannot send message when client is not ready
    at sendWhileNotReadyError (/usr/src/app/node_modules/@slack/socket-mode/dist/errors.js:63:26)
    at /usr/src/app/node_modules/@slack/socket-mode/dist/SocketModeClient.js:343:60
    at new Promise (<anonymous>)
    at SocketModeClient.send (/usr/src/app/node_modules/@slack/socket-mode/dist/SocketModeClient.js:334:16)
    at ack (/usr/src/app/node_modules/@slack/socket-mode/dist/SocketModeClient.js:473:24)
    at App.processEvent (/usr/src/app/node_modules/@slack/bolt/dist/App.js:531:19)
    at processTicksAndRejections (node:internal/process/task_queues:96:5)
    at async SocketModeClient.<anonymous> (/usr/src/app/node_modules/@slack/bolt/dist/receivers/SocketModeReceiver.js:142:13) {
  code: 'slack_socket_mode_no_reply_received_error'
}

Meanwhile, i did this as a workaround. Not sure if it is the right approach to depend on app.receiver.client.badConnection, but based on my testing seems to works well. Alternatively, would be great for the library to support a health check ability for the bot. My deploy setup would restart the bot if /health fails with 500

  httpApp.get('/health', (req, res) => {
    if (app.receiver.client.badConnection) {
      res.status(500).send('Internal Server Error');
      return;
    }

    res.status(200).send(app.receiver.client.badConnection);
  });

We encountered this exact issue as well. WebSocket disconnections seem to happen pretty rarely but still, I think that this is a pretty major issue for socket mode bots.

Something I thought I would point out is that the default handling of unhandled promise rejections has changed between Node 14.x and 16.x. In 14.x (latest LTS) you just get UnhandledPromiseRejectionWarning but in 16.x the application will be terminated. This probably explains the different behavior reported by @Zenexer and @usubram .

We’re running in 14.x and decided that the easiest fix for now was to add a handler for unhandledRejection that forces the application to crash (so our process manager could then restart it):

process.on("unhandledRejection", error => {
	throw error;
});

But anyhow, restarting the whole app just to reconnect lost WebSocket connection doesn’t seem like a good long term solution…