discord.js: Bot randomly exiting process or going offline/unresponsive

Which package is this bug report for?

discord.js

Issue description

After around 5-6 hours of running, my discord bot decides to randomly exit with an exit code of 0. There is nothing in my code that should cause that, and adding a debug event listener shows the bot failing around a websocket reacquisition near the 6 hour mark.

I have tried running my bot with typescript, I’ve tried compiling the code to javascript and running that, I’ve tried packaging it as a docker container and running that locally, I’ve tried packing it as a docker container and running that on my production server. Every environment and method results in the same issue.

image

Bot code

https://github.com/NeonWizard/sockbot-discord

Logs

[WS => Shard 0] Heartbeat acknowledged, latency of 86ms.
[WS => Shard 0] [HeartbeatTimer] Sending a heartbeat.
[WS => Shard 0] Heartbeat acknowledged, latency of 89ms.
[WS => Shard 0] [HeartbeatTimer] Sending a heartbeat.
[WS => Shard 0] Heartbeat acknowledged, latency of 85ms.
[WS => Shard 0] [HeartbeatTimer] Sending a heartbeat.
[WS => Shard 0] Heartbeat acknowledged, latency of 83ms.
[WS => Shard 0] [HeartbeatTimer] Sending a heartbeat.
[WS => Shard 0] Heartbeat acknowledged, latency of 84ms.
[WS => Shard 0] [RECONNECT] Discord asked us to reconnect
[WS => Shard 0] [DESTROY]
    Close Code    : 4000
    Reset         : false
    Emit DESTROYED: true
[WS => Shard 0] Clearing the heartbeat interval.
[WS => Shard 0] [WebSocket] Destroy: Attempting to close the WebSocket. | WS State: OPEN      
[WS => Shard 0] [WebSocket] Close: Tried closing. | WS State: CLOSING
[WS => Shard 0] [WebSocket] Adding a WebSocket close timeout to ensure a correct WS reconnect.
        Timeout: 5000ms
[WS => Shard 0] [WebSocket] Clearing the close timeout.
[WS => Shard 0] [WebSocket] Close Emitted: false
[WS => Shard 0] [WebSocket] did not close properly, assuming a zombie connection.
Emitting close and reconnecting again.
[WS => Shard 0] [CLOSE]
    Event Code: 1011
    Clean     : false
    Reason    : INTERNAL_ERROR
[WS => Shard 0] Session id is present, attempting an immediate reconnect...
[WS => Shard 0] [CONNECT]
    Gateway    : wss://gateway.discord.gg/
    Version    : 10
    Encoding   : json
    Compression: none
[WS => Shard 0] Setting a HELLO timeout for 20s.
[WS => Shard 0] [CONNECTED] Took 146ms
[WS => Shard 0] Clearing the HELLO timeout.
[WS => Shard 0] Setting a heartbeat interval for 41250ms.
[WS => Shard 0] [RESUME] Session REDACTED, sequence 51
[WS => Shard 0] [RESUMED] Session REDACTED | Replayed 1 events.       
[WS => Shard 0] [ResumeHeartbeat] Sending a heartbeat.
[WS => Shard 0] Heartbeat acknowledged, latency of 88ms.
[WS => Shard 0] Clearing the heartbeat interval.
[WS => Shard 0] [CLOSE]
    Event Code: 4000
    Clean     : true
    Reason    :
[WS => Shard 0] Session id is present, attempting an immediate reconnect...
[WS => Shard 0] An open connection was found, attempting an immediate identify.
[WS => Shard 0] [RESUME] Session REDACTED, sequence 52
root@ruby:/home/spooky#

Code sample

https://github.com/NeonWizard/sockbot-discord

Package version

discord.js@14.1.2

Node.js version

v16.16.0, typescript@4.7.4

Operating system

Windows (WSL), Linux (Ubuntu), Docker (node:lts-alpine)

Priority this issue should have

High (immediate attention needed)

Which partials do you have configured?

No Partials

Which gateway intents are you subscribing to?

Guilds, GuildMessages, GuildMessageReactions, MessageContent

I have tested this issue on a development release

No response

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Reactions: 11
  • Comments: 85 (45 by maintainers)

Commits related to this issue

Most upvoted comments

My Findings and Pinging Test Bot

@NeonWizard So to test the theory of Discord 14.x potentially having issues with stability (bots remaining online), I went and created a ping bot that uses discord 14.5.0 which pinged a discord channel every 15 minutes it was online. I also logged to the discord channel whenever the bot logged in – and it had a setInterval to try automatically destroying the client and instantiating a new one.

Here’s a log of today’s results, starting from 12:25 AM. You’ll notice that no more than 3 pings are sent when the bot remained online (meaning it disconnected around an hour in, since 3 pings at 15 minutes each is 45 minutes). So it looks like there might be some timer that’s disconnecting the bot after an hour. Redacted the bot name and tag.

<PingingBot>
BOT
 — Today at 12:25 AM
<PingingBot>#0000 successfully logged in!

<PingingBot>
BOT
 — Today at 12:40 AM
15 minute ping interval successfully sent.

<PingingBot>
BOT
 — Today at 12:55 AM
15 minute ping interval successfully sent.

<PingingBot>
BOT
 — Today at 1:10 AM
15 minute ping interval successfully sent.

<PingingBot>
BOT
 — Today at 1:29 AM
<PingingBot>#0000 successfully logged in!

<PingingBot>
BOT
 — Today at 1:43 AM
15 minute ping interval successfully sent.

<PingingBot>
BOT
 — Today at 1:58 AM
15 minute ping interval successfully sent.

<PingingBot>
BOT
 — Today at 2:13 AM
15 minute ping interval successfully sent.

<PingingBot>
BOT
 — Today at 6:49 AM
<PingingBot>#0000 successfully logged in!

<PingingBot>
BOT
 — Today at 7:04 AM
15 minute ping interval successfully sent.

<PingingBot>
BOT
 — Today at 7:19 AM
15 minute ping interval successfully sent.

<PingingBot>
BOT
 — Today at 7:34 AM
15 minute ping interval successfully sent.

<PingingBot>
BOT
 — Today at 8:34 AM
<PingingBot>#0000 successfully logged in!

<PingingBot>
BOT
 — Today at 8:49 AM
15 minute ping interval successfully sent.

<PingingBot>
BOT
 — Today at 10:14 AM
<PingingBot>#0000 successfully logged in!

<PingingBot>
BOT
 — Today at 10:29 AM
15 minute ping interval successfully sent.

<PingingBot>
BOT
 — Today at 10:44 AM
15 minute ping interval successfully sent.

<PingingBot>
BOT
 — Today at 1:29 PM
<PingingBot>#0000 successfully logged in!

<PingingBot>
BOT
 — Today at 1:44 PM
15 minute ping interval successfully sent.

<PingingBot>
BOT
 — Today at 1:59 PM
15 minute ping interval successfully sent.

<PingingBot>
BOT
 — Today at 2:14 PM
15 minute ping interval successfully sent.

<PingingBot>
BOT
 — Today at 3:39 PM
<PingingBot>#0000 successfully logged in!

<PingingBot>
BOT
 — Today at 3:54 PM
15 minute ping interval successfully sent.

<PingingBot>
BOT
 — Today at 4:09 PM
15 minute ping interval successfully sent.

<PingingBot>
BOT
 — Today at 4:24 PM
15 minute ping interval successfully sent.

<PingingBot>
BOT
 — Today at 4:44 PM
<PingingBot>#0000 successfully logged in!

<PingingBot>
BOT
 — Today at 4:59 PM
15 minute ping interval successfully sent.

<PingingBot>
BOT
 — Today at 5:14 PM
15 minute ping interval successfully sent.

<PingingBot>
BOT
 — Today at 5:29 PM
15 minute ping interval successfully sent.

<PingingBot>
BOT
 — Today at 5:49 PM
<PingingBot>#0000 successfully logged in!

<PingingBot>
BOT
 — Today at 6:04 PM
15 minute ping interval successfully sent.

<PingingBot>
BOT
 — Today at 6:19 PM
15 minute ping interval successfully sent.

<PingingBot>
BOT
 — Today at 6:34 PM
15 minute ping interval successfully sent.

<PingingBot>
BOT
 — Today at 7:59 PM
<PingingBot>#0000 successfully logged in!

<PingingBot>
BOT
 — Today at 8:14 PM
<PingingBot>#0000 successfully logged in!
NEW

<PingingBot>
BOT
 — Today at 8:29 PM
15 minute ping interval successfully sent.

<PingingBot>
BOT
 — Today at 8:44 PM
15 minute ping interval successfully sent.

<PingingBot>
BOT
 — Today at 8:59 PM
15 minute ping interval successfully sent.

I stopped gathering data at 10:11 PM. So this bot was offline for an hour after that last ping.

Upon looking into the client instance, I saw something with a timeout of one hour – going by the observation that my bot did not give more than 3 pings.

  sweepers: Sweepers {
    options: { threads: [Object] },
    intervals: {
      applicationCommands: null,
      bans: null,
      emojis: null,
      invites: null,
      guildMembers: null,
      messages: null,
      presences: null,
      reactions: null,
      stageInstances: null,
      stickers: null,
      threadMembers: null,
      threads: Timeout {
        _idleTimeout: 3600000,
        _idlePrev: [TimersList],
        _idleNext: [Timeout],
        _idleStart: 3686,
        _onTimeout: [Function (anonymous)],
        _timerArgs: undefined,
        _repeat: 3600000,
        _destroyed: false,
        [Symbol(refed)]: false,
        [Symbol(kHasPrimitive)]: false,
        [Symbol(asyncId)]: 8,
        [Symbol(triggerId)]: 1
      },
      users: null,
      voiceStates: null
    }
  },

Under rest.requestManager there also looked to be handlerTimer which had an idleTimeout value of one hour.

      handlerTimer: Timeout {
        _idleTimeout: 3600000,
        _idlePrev: [Timeout],
        _idleNext: [TimersList],
        _idleStart: 3393,
        _onTimeout: [Function (anonymous)],
        _timerArgs: undefined,
        _repeat: 3600000,
        _destroyed: false,
        [Symbol(refed)]: false,
        [Symbol(kHasPrimitive)]: false,
        [Symbol(asyncId)]: 7,
        [Symbol(triggerId)]: 1
      },

When my pinging bot disconnected, on replit this is what I saw in the console:

Preparing to connect to the gateway...
> DEBUG: Guilds we are connected to listed below
(end guild output)

> DEBUG: options => shardCount 1
> DEBUG: presence => status online
client.user is null (bot is not logged in).

This means that the client status was online, just that the user is null (i.e. bot is not logged in).

So it’s not an issue with the client instance from what I see, it’s that the user (bot) is getting disconnected. And the idle timeout value of one hour looks suspicious to me considering the three 15 minute pings I was seeing at most at any given time the bot was online.

Not only that, but even when my code was checking to log back in every 2 minutes, the bot could take over an hour to go online again.

What a successful login looked like

When I was actually able to log in the bot, there were several web socket manager debug logs. I removed the other statements to focus on just the WS output.

[WS => Manager] Fetched Gateway Information
    URL: wss://gateway.discord.gg
    Recommended Shards: 1

[WS => Manager] Session Limit Information
    Total: 1000
    Remaining: 997

[WS => Manager] Spawning shards: 0

[WS => Shard 0] [CONNECT]
    Gateway    : wss://gateway.discord.gg/
    Version    : 10
    Encoding   : json
    Compression: none

[WS => Shard 0] Setting a HELLO timeout for 20s.

[WS => Shard 0] [CONNECTED] Took 123ms

[WS => Shard 0] Clearing the HELLO timeout.

[WS => Shard 0] Setting a heartbeat interval for 41250ms.

[WS => Shard 0] [IDENTIFY] Shard 0/1 with intents: 33283

[WS => Shard 0] [READY] Session <hash>.

[WS => Shard 0] [ReadyHeartbeat] Sending a heartbeat.

[WS => Shard 0] Shard received all its guilds. Marking as fully ready.

===
Logged in as <PingingBot>#0000!
===

[WS => Shard 0] Heartbeat acknowledged, latency of 34ms.

[WS => Shard 0] [HeartbeatTimer] Sending a heartbeat.

[WS => Shard 0] Heartbeat acknowledged, latency of 25ms.

Note: WS is WebSocketManager https://discord.js.org/#/docs/discord.js/main/class/WebSocketManager https://discord.js.org/#/docs/discord.js/main/class/WebSocketShard

  ws: WebSocketManager {
    _events: [Object: null prototype] {},
    _eventsCount: 0,
    _maxListeners: undefined,
    gateway: 'wss://gateway.discord.gg/',
    totalShards: 1,
    shards: Collection(1) [Map] { 0 => [WebSocketShard] },
    status: 3,
    destroyed: false,
    reconnecting: false,
    [Symbol(kCapture)]: false
  },

Findings in Thread

Some other findings in the this thread:

  • @zLupa mentioned discordjs 13.8.1 is not showing these issues.
  • The person in the OP was using discord.js@14.1.2
  • @djipgen reported v13.10.3 has the same problem.

The common data point seems to be that somewhere between v13.8.1 and v13.10.3, something introduced this issue of bots disconnecting from the client. And then that ended up getting into v14.x

You can see the diff between v13.8.1 and v13.10.3 here: https://github.com/discordjs/discord.js/compare/13.8.1...13.10.3

Update: My bot is going zombie mode about every 1-2 hours now.

On my side, I have good results with the combination of :

  • Better understanding of the <ClientOptions>.closeTimeout option (5_000 ⇒ 10_000)
  • Fix from @legendhimself
  • Fix #8927

I think @legendhimself’s Pull Request could now be submitted. In my opinion, we could also update the documentation for <ClientOptions>.closeTimeout, to better understand how proportional this value is to the ratio of the number of shards and the connection, which requires a different configuration based on that ratio.

The current default value is more than reasonable for small bots, but becomes unsustainable for slightly larger bots (or ratio too low).

It seems that the amount of data received also affects this ratio as if the Discord API puts the data to be sent to you in a global queue, and they are limited by how much they can send you in X amount of time which would make us receive the heartbeats later and therefore need more time before defining a shard connection as zombie.

I had observed that I was having a lot more trouble receiving GUILD_MEMBERS_CHUNK after a global fetch of members with presence intent enabled. This could potentially be related to this.

I started getting exits with status 0 today which are classified by systemctl as service success, after 3-5 days of running the bot with no visible disconnections. I configured the process to restart on failure, but idk what to do if the process either goes zombie mode like described above or if it exits with status 0.

This is discouraging.

I was having both of those issues (exiting with status code 0 and going zombie mode) when I originally created this ticket. I think this should be reopened and revisited.

Got a report stating the linked pull request didn’t resolve the issue. Feel free to open a pull request @legendhimself and we’ll see how things go from there.

not sure if that pr solved the issue for everyone and it got closed without even testing. Weird stuff

@DraftProducts There aren’t many or no changes in the websocket code of djs from v13 to v14 (we still run the bots on v13).

Why do zombie connections happen? Mostly due to a session timeout. Why does the session timeout happen? No idea.

Discord itself claimed to have been using a 3rd party library for their websocket server and they have no idea why these weird WS issues happen. I have friends with their bots in millions of servers and they’ve tried other libs but still they find it impossible to have good uptime because of these weird WS issues.

To fix this issue we need to see why the WS is going into a timeout state or some unrecoverable state for which we are generating a new session ( we cannot resume from timeout or unrecoverable states, doing so causes zombie issues / no ack)

I’m also having this problem, does anyone have a sure fix yet?

@dzlandis I’ve asked people to try my fix, Seems like no one has tried it. edits: legendhimslef@2e1c68e repo: https://github.com/legendhimslef/discord.js/tree/fix/ws

I am the one to fix the old zombie connection #7581 issue for which I added the 1011 error code. Usually, 1011 happens on a Session timeout; if this happens, the WebSocket needs to reset and make a fresh reconnect.

Please try my changes and if it works I will probably make a pull request if @vladfrangu allows it.

Unfortunately, I don’t think I am going to use this because I don’t think it would actually be helpful in fixing this problem. My bot has been running normally for 2 days now (with my automatic restart every once in a while setup) and I haven’t had a full idle crash yet while not changing anything with my code. That being said, I don’t know if I would be able to tell you if it is actually working or not. If I get an idle crash again (making it a third time), I will consider trying it out 👍🏻

@NeonWizard Restarting the bot doesn’t seem like a good fix. There is a limit on logins per day ie 1000.

It’s definitely a temporary fix until DiscordJS can fix it internally. But websockets are reacquired pretty infrequently, and this issue happens only on a small portion of reacquisitions. My bot only ends up restarting 1-2 times a day max.

I’ll disable my hotfix when I get the chance and try out your patch, and I’ll get back to you

Got same error there, shards exit silently after a certain time and the application stop responding to interaction (djs v14)

This is a really old issue and was fixed for v13. You might need to create a new issue similar to this for v14.

This issue was for discord v14. I’m still not sure why it’s closed.

rather than submitting a PR here to fix the issue not just for them

@kyranet I just wanted the fixes to be tested before I submitted the pr. That’s the only reason I asked them to use my pr fork to “test”

I have tried my fixes with @DraftProducts 's bot with 350+ shards.

For people who got their token reset when using my fix: The internet speed and the closeTimeout from the client property (default 5_000ms) seem to be the issue. Increase the closeTimeout to 30_000 ms. Increasing the closeTimeout means that the process would wait for 30s before marking a connection as a zombie(no heartbeat).

Why are we increasing the timeout? Well, we are giving it more time, it might not be a zombie connection but just a delayed heartbeat ack. So increasing closeTimeout fixed the token resetting for @DraftProducts for now.

This is my current hotfix. Basically, I update a variable with the current timestamp everytime a heartbeat is acknowledged. Normally, a heartbeat should be acknowledged around every 40 seconds, of course with reconnects it could take a bit longer. Then, I have an interval that runs every 5 minutes, and checks whether the last heartbeat is older than 5 minutes. If it is, we can assume that the client is no longer connected, and the process will be killed. Since I’m running this with pm2, the process will be restarted automatically.

image

@dzlandis I’ve asked people to try my fix, Seems like no one has tried it. edits: legendhimself@2e1c68e repo: https://github.com/legendhimslef/discord.js/tree/fix/ws

I am the one to fix the old zombie connection #7581 issue for which I added the 1011 error code. Usually, 1011 happens on a Session timeout; if this happens, the WebSocket needs to reset and make a fresh reconnect.

Please try my changes and if it works I will probably make a pull request if @vladfrangu allows it.

I got busy working on projects outside of Discord, but my bots been running into this issue multiple times a day for the last week (only some clusters of shards)

I’ll be pushing out your fix and seeing if that solves it tomorrow… Are the changes in the commit “fix: reset on abnormal close” what you’re wanting tested?

who solved?

Hi, just popping in here to say I’m experiencing the same issue and it is very frustrating!

Just to reiterate what appears to be said here, my bot ends up an idle state after errors 4000, 1000, 1006, and/or 1011 happen (at least from what I’ve seen so far). Eventually, my bot fully goes offline but remains in an idle state, preventing my auto restart on exit systems from preventing downtime.

Here are some of my logs which appear to reiterate what others have shared here (but I see no hurt in adding more data for others to see). I'm running DJS v14.6.0 but these issues also occured in v14.2.0 before I updated.
[WS => Shard 30] Shard received all its guilds. Marking as fully ready.
[WS => Shard 5] A connection object was found. Cleaning up before continuing.
     State: CLOSED
[WS => Shard 5] [DESTROY]
     Close Code    : 1000
     Reset         : false
     Emit DESTROYED: false
[WS => Shard 5] [WebSocket] Destroy: Attempting to close the WebSocket. | WS State: CLOSED
[WS => Shard 5] WS State: CLOSED
[WS => Shard 5] [WebSocket] Adding a WebSocket close timeout to ensure a correct WS reconnect.
         Timeout: 5000ms
[WS => Shard 5] [CONNECT]
     Gateway    : wss://gateway.discord.gg/
     Version    : 10
     Encoding   : json
     Compression: none
[WS => Shard 5] Setting a HELLO timeout for 20s.
[WS => Shard 5] [CONNECTED] Took 46ms
[WS => Shard 5] Clearing the HELLO timeout.
[WS => Shard 5] Setting a heartbeat interval for 41250ms.
[WS => Shard 5] [RESUME] Session <redacted> sequence 1551
[WS => Shard 5] [RESUMED] Session <redacted>| Replayed 353 events.
[WS => Shard 5] [ResumeHeartbeat] Sending a heartbeat.
[WS => Shard 5] Heartbeat acknowledged, latency of 29ms.
[WS => Shard 3] [HeartbeatTimer] Sending a heartbeat.
[WS => Shard 17] [HeartbeatTimer] Sending a heartbeat.
[WS => Shard 3] Heartbeat acknowledged, latency of 22ms.
[WS => Shard 17] Heartbeat acknowledged, latency of 24ms.
[WS => Shard 24] [HeartbeatTimer] Sending a heartbeat.
[WS => Shard 24] Heartbeat acknowledged, latency of 25ms.
[WS => Shard 10] [HeartbeatTimer] Sending a heartbeat.
[WS => Shard 10] Heartbeat acknowledged, latency of 32ms.
[WS => Shard 5] [WebSocket] Clearing the close timeout.
[WS => Shard 5] [WebSocket] Close Emitted: false
[WS => Shard 5] [WebSocket] did not close properly, assuming a zombie connection.
 Emitting close and reconnecting again.
[WS => Shard 5] [CLOSE]
     Event Code: 1011
     Clean     : false
     Reason    : INTERNAL_ERROR
[WS => Shard 5] Session id is present, attempting an immediate reconnect...
[WS => Shard 4] [HeartbeatTimer] Sending a heartbeat.
[WS => Shard 4] Heartbeat acknowledged, latency of 37ms.
[WS => Shard 25] [HeartbeatTimer] Sending a heartbeat.
[WS => Shard 25] Heartbeat acknowledged, latency of 43ms.
[WS => Shard 18] [HeartbeatTimer] Sending a heartbeat.

---> Each heartbeat continues infinitely while commands are unreachable until bot is manually restarted.

And here’s another scenario where the same thing appears to be happening:

08:01:39: :: 2022-11-23T13:01:39.901Z :: [WS => Shard 24] [HeartbeatTim
er] Sending a heartbeat.
08:01:39: :: 2022-11-23T13:01:39.923Z :: [WS => Shard 24] Heartbeat ack
nowledged, latency of 22ms.
08:02:16: :: 2022-11-23T13:02:16.119Z :: [WS => Shard 26] [HeartbeatTim
er] Sending a heartbeat.
08:02:16: :: 2022-11-23T13:02:16.145Z :: [WS => Shard 26] Heartbeat ack
nowledged, latency of 25ms.
08:02:21: :: 2022-11-23T13:02:21.151Z :: [WS => Shard 24] [HeartbeatTim
er] Sending a heartbeat.
08:02:21: :: 2022-11-23T13:02:21.175Z :: [WS => Shard 24] Heartbeat ack
nowledged, latency of 23ms.
08:02:57: :: 2022-11-23T13:02:57.369Z :: [WS => Shard 26] [HeartbeatTim
er] Sending a heartbeat.
08:02:57: :: 2022-11-23T13:02:57.395Z :: [WS => Shard 26] Heartbeat ack
nowledged, latency of 25ms.
08:03:02: :: 2022-11-23T13:03:02.401Z :: [WS => Shard 24] [HeartbeatTim
er] Sending a heartbeat.
08:03:02: :: 2022-11-23T13:03:02.426Z :: [WS => Shard 24] Heartbeat ack
nowledged, latency of 24ms.
08:03:14: :: 2022-11-23T13:03:14.857Z :: Posted stats to Top.gg!
08:03:38: :: 2022-11-23T13:03:38.619Z :: [WS => Shard 26] [HeartbeatTim
er] Sending a heartbeat.
08:03:38: :: 2022-11-23T13:03:38.642Z :: [WS => Shard 26] Heartbeat ack
nowledged, latency of 22ms.
08:03:43: :: 2022-11-23T13:03:43.651Z :: [WS => Shard 24] [HeartbeatTim
er] Sending a heartbeat.
08:03:43: :: 2022-11-23T13:03:43.676Z :: [WS => Shard 24] Heartbeat ack
nowledged, latency of 25ms.
08:04:19: :: 2022-11-23T13:04:19.870Z :: [WS => Shard 26] [HeartbeatTim
er] Sending a heartbeat.
08:04:19: :: 2022-11-23T13:04:19.894Z :: [WS => Shard 26] Heartbeat ack
nowledged, latency of 24ms.
08:04:24: :: 2022-11-23T13:04:24.475Z :: [WS => Shard 26] [RECONNECT] D
iscord asked us to reconnect
08:04:24: :: 2022-11-23T13:04:24.475Z :: [WS => Shard 26] [DESTROY]
08:04:24:     Close Code    : 4000
08:04:24:     Reset         : false
08:04:24:     Emit DESTROYED: true
08:04:24: :: 2022-11-23T13:04:24.475Z :: [WS => Shard 26] Clearing the
heartbeat interval.
08:04:24: :: 2022-11-23T13:04:24.475Z :: [WS => Shard 26] [WebSocket] D
estroy: Attempting to close the WebSocket. | WS State: OPEN
nowledged, latency of 22ms.
08:10:06: :: 2022-11-23T13:10:06.504Z :: [WS => Shard 24] Clearing the heartbeat interval.
08:10:06: :: 2022-11-23T13:10:06.504Z :: [WS => Shard 24] [CLOSE]
08:10:06:     Event Code: 1006
08:10:06:     Clean     : false
08:10:06:     Reason    :
08:10:06: :: 2022-11-23T13:10:06.504Z :: [WS => Shard 24] Session id is present, attempting an immediate reconnect...

---> Eventually, after some time from this error, the bot went fully offline while remaining idle as described earlier.

I’d really love to see this issue get resolved ASAP as this has been a huge pain as of recently and has caused lots of complaints from my users. I’m glad I found this issue so I know its not just me.

Update: My bot is going zombie mode about every 1-2 hours now.

here’s my hotfix, this just listens for every debug message and restarts the bot instantly when djs clears the heartbeat interval this will work until a proper fix is implemented

client.on('debug', async (debug) => {
    if (debug.includes("Clearing the heartbeat interval.")) {
        process.exit(0);
    }
})

the 1000 login per day limit will not be an issue here, I’ve been running this for a week now and I got no problems yet, my bot only restarts like 7 times per day

@NeonWizard and others who experience the same issue, can you all try this fix and check if it works?

This resets the session if a zombie connection ever appears regardless of any event code or close code. changes: https://github.com/legendhimslef/discord.js/commit/2e1c68e28a3b7bb30d327cb265857d1c0c662a9e

yarn add 'https://gitpkg.now.sh/legendhimslef/discord.js/packages/discord.js?fix/ws'

This is really interesting, from what I had experienced before. Zombie connections only used to happen with event code 4009 but here it also happened with 4000. In the case of 4000, the session reset wasn’t happening like the 4009. Right now with the above fix, it should reset the session every time a zombie connection happens. Normal reconnection won’t have as reset true.

Don’t use this code in production. Please use it at your own risk, Use it only for testing.

@Dossar That’s a nice test. The Internet is wack there is no way that any connection is gonna be active 24/7. I mean there is a reason why people use TCP over UDP even tho UDP is way faster than TCP. TCP helps them get back the lost packets. I said this to show you that the internet is not 100% reliable but we found ways to make it reliable. In this case, your bot was actually connected to the Discord API and it gets disconnected and gets reconnected back and this is normal. The disconnect -> reconnect -> resume happens in a few minutes or instantly sometimes and sometimes it’s not resumable so it won’t receive the missed events.

I have a bot with 60k+ servers and we have a hell lot of interactions we have around 2-5k active users playing the bot at any given time. It’s a game bot so this kind of activity is expected. From what I’ve seen staying below v14 is working for us even tho there weren’t any or no major changes in Ws codes. We get tons of disconnects and reconnects similar to the user who created the issue.

Below are the logs of a successful reconnect

[ DJS ]: | Fri, 07 Oct 2022 21:38:09 | debug | [WS => Shard 26] | [RECONNECT] Discord asked us to reconnect
[ DJS ]: | Fri, 07 Oct 2022 21:38:09 | debug | [WS => Shard 26] | [DESTROY]
    Close Code    : 4000
    Reset         : false
    Emit DESTROYED: true
[ DJS ]: | Fri, 07 Oct 2022 21:38:09 | debug | [WS => Shard 26] | Clearing the heartbeat interval.
[ DJS ]: | Fri, 07 Oct 2022 21:38:09 | debug | [WS => Shard 26] | [WebSocket] Destroy: Attempting to close the WebSocket. | WS State: OPEN
[ DJS ]: | Fri, 07 Oct 2022 21:38:09 | debug | [WS => Shard 26] | [WebSocket] Close: Tried closing. | WS State: CLOSING
[ DJS ]: | Fri, 07 Oct 2022 21:38:09 | debug | [WS => Shard 26] | [WebSocket] Adding a WebSocket close timeout to ensure a correct WS reconnect.
        Timeout: 2500ms
[ DJS ]: | Fri, 07 Oct 2022 21:38:09 | debug | [WS => Shard 26] | [WebSocket] Clearing the close timeout.
[ DJS ]: | Fri, 07 Oct 2022 21:38:09 | debug | [WS => Shard 26] | [CLOSE]
    Event Code: 4000
    Clean     : true
    Reason    : 
[ DJS ]: | Fri, 07 Oct 2022 21:38:09 | debug | [WS => Shard 26] | Session id is present, attempting an immediate reconnect...
[ DJS ]: | Fri, 07 Oct 2022 21:38:09 | debug | [WS => Shard 26] | [CONNECT]
    Gateway    : wss://gateway.discord.gg/
    Version    : 9
    Encoding   : json
    Compression: none
[ DJS ]: | Fri, 07 Oct 2022 21:38:09 | debug | [WS => Shard 26] | Setting a HELLO timeout for 20s.
Message from shard | Job: getFanArray
Message from shard | Job: remaining
[ DJS ]: | Fri, 07 Oct 2022 21:38:09 | debug | [WS => Shard 26] | [CONNECTED] Took 158ms
[ DJS ]: | Fri, 07 Oct 2022 21:38:09 | debug | [WS => Shard 26] | Clearing the HELLO timeout.
[ DJS ]: | Fri, 07 Oct 2022 21:38:09 | debug | [WS => Shard 26] | Setting a heartbeat interval for 41250ms.
[ DJS ]: | Fri, 07 Oct 2022 21:38:09 | debug | [WS => Shard 26] | [RESUME] Session c3bc0d408809a14ee812583979cfb462, sequence 17901
[ DJS ]: | Fri, 07 Oct 2022 21:38:09 | debug | [WS => Shard 26] | [RESUMED] Session c3bc0d408809a14ee812583979cfb462 | Replayed 1 events.
[ DJS ]: | Fri, 07 Oct 2022 21:38:09 | debug | [WS => Shard 26] | [ResumeHeartbeat] Sending a heartbeat.
[ DJS ]: | Fri, 07 Oct 2022 21:38:09 | debug | [WS => Shard 26] | Heartbeat acknowledged, latency of 105ms.

But there are times at which just after the above log ended the WS again sends a CLOSE event similar to the user who created this issue and it again goes through the reconnect process and reconnects. This usually takes a few minutes. From what I’ve known by talking to some of the Discord devs, Discord actually uses some external package for all the WebSocket-related stuff and they don’t have it natively. So even the Discord devs have no idea of the weird disconnects and reconnect. Before this DJS library didn’t handle the 4009 WS close code sent by Discord which caused people to have zombie shards [process is alive but not connected to the gateway until the process restarted]. Discord still sends 4009 close code but now the DJS handles the reconnect.

But if the reconnect is taking several minutes and causing huge downtimes then it might be a bug inside of the DJS. We don’t have any such issues with our bot and we are staying below v14 [13.9.1] until the discord/ws package is fully ready and launched. We won’t have the package in production until a few weeks of testing.

To sum it up, I think DJS can only work towards better handling and reconnecting to the Discord API. We cannot control the disconnects which are caused by many reasons [Internet, Discord, etc].

So in the meantime, what can treat this issue? How can you automatically diagnose when the bot has internally disconnected from the websocket and automatically restart the process? This issue is severe and has been drastically affecting my end users.