nextdns: EdgeOS NextDNS service died/killed after errors (also a question about multiple forwarders)

I’ve had this happen twice so far, the first about an hour after setting things up, and the second just now, ~26 hours later.

It seems to start with oodles (like dozens every second) of these errors:

Jul 03 19:17:41 hostname nextdns[10843]: Query 127.0.0.1 UDP AAAA unagi.amazon.com. (qry=34/res=12) 15866ms : doh resolve: context deadline exceeded Jul 03 19:17:41 hostname nextdns[10843]: Query 127.0.0.1 UDP A unagi.amazon.com. (qry=34/res=12) 15866ms : doh resolve: context deadline exceeded Jul 03 19:17:41 hostname nextdns[10843]: Query 127.0.0.1 UDP A connectivitycheck.gstatic.com. (qry=47/res=12) 15858ms : doh resolve: context deadline exceeded Jul 03 19:17:41 hostname nextdns[10843]: Query 127.0.0.1 UDP A fcmconnection.googleapis.com. (qry=46/res=12) 15858ms : doh resolve: context deadline exceeded Jul 03 19:17:41 hostname nextdns[10843]: Query 127.0.0.1 UDP A connectivitycheck.gstatic.com. (qry=47/res=12) 15858ms : doh resolve: context deadline exceeded Jul 03 19:17:41 hostname nextdns[10843]: Query 127.0.0.1 UDP A smile.amazon.com. (qry=34/res=12) 24457ms : doh resolve: context deadline exceeded Jul 03 19:17:41 hostname nextdns[10843]: Query 127.0.0.1 UDP AAAA smile.amazon.com. (qry=34/res=12) 24457ms : doh resolve: context deadline exceeded Jul 03 19:17:41 hostname nextdns[10843]: Query 127.0.0.1 UDP A api.box.com. (qry=29/res=12) 24457ms : doh resolve: context deadline exceeded Jul 03 19:17:41 hostname nextdns[10843]: Query 127.0.0.1 UDP A connectivitycheck.gstatic.com. (qry=47/res=12) 24458ms : doh resolve: context deadline exceeded Jul 03 19:17:41 hostname nextdns[10843]: Query 127.0.0.1 UDP A connectivitycheck.gstatic.com. (qry=47/res=12) 24458ms : doh resolve: context deadline exceeded Jul 03 19:17:41 hostname nextdns[10843]: Query 127.0.0.1 UDP A pixel.advertising.com. (qry=39/res=12) 24458ms : doh resolve: context deadline exceeded

During which browsing starts to get slow (though doesn’t seem to fail entirely). The errors then start getting replaced with these:

Jul 03 19:17:47 hostname nextdns[10843]: Query 127.0.0.1 UDP A unagi.amazon.com. (qry=34/res=12) 8657ms : doh resolve: dial tcp 1.1.1.1:443: socket: too many open files Jul 03 19:17:47 hostname nextdns[10843]: Query 127.0.0.1 UDP AAAA unagi.amazon.com. (qry=34/res=12) 8657ms : doh resolve: dial tcp 1.1.1.1:443: socket: too many open files Jul 03 19:17:47 hostname nextdns[10843]: Query 127.0.0.1 UDP A unagi.amazon.com. (qry=34/res=12) 8656ms : doh resolve: dial tcp 1.1.1.1:443: socket: too many open files Jul 03 19:17:47 hostname nextdns[10843]: Query 127.0.0.1 UDP AAAA unagi.amazon.com. (qry=34/res=12) 8656ms : doh resolve: dial tcp 1.1.1.1:443: socket: too many open files Jul 03 19:17:47 hostname nextdns[10843]: Query 127.0.0.1 UDP A connectivitycheck.gstatic.com. (qry=47/res=12) 8655ms : doh resolve: dial tcp 1.1.1.1:443: socket: too many open files Jul 03 19:17:47 hostname nextdns[10843]: Query 127.0.0.1 UDP A smile.amazon.com. (qry=34/res=12) 8655ms : doh resolve: dial tcp 1.1.1.1:443: socket: too many open files Jul 03 19:17:47 hostname nextdns[10843]: Query 127.0.0.1 UDP AAAA smile.amazon.com. (qry=34/res=12) 8654ms : doh resolve: dial tcp 1.1.1.1:443: socket: too many open files Jul 03 19:17:47 hostname nextdns[10843]: Query 127.0.0.1 UDP AAAA d.joinhoney.com. (qry=33/res=12) 8654ms : doh resolve: dial tcp 1.1.1.1:443: socket: too many open files

And then finally the service goes down:

Jul 03 19:19:25 hostname systemd[1]: nextdns.service: Main process exited, code=killed, status=9/KILL Jul 03 19:19:25 hostname systemd[1]: nextdns.service: Unit entered failed state. Jul 03 19:19:25 hostname systemd[1]: nextdns.service: Failed with result ‘signal’.

Starting the service again seems to resolve the problem. I could add a restart clause to the systemd configuration to just restart it (and probably will long-term if there’s not a resolution), but figure I’ll just let it keep crashing for now to see if I can root out any problems.

Does anyone know what the context deadline exceeded and too many open files errors mean and are likely caused by?

Also, I’ve added both 1.1.1.1 and 1.0.0.1 to my nextdns config like so: forwarder https://1.1.1.1/dns-query,https://1.0.0.1/dns-query but it seems like the errors I’m running into with 1.1.1.1 are preventing DNS lookups, so I assume nextdns isn’t trying the second forwarder (or these errors somehow prevent it from doing so?).

Context

Version [e.g. 1.5.2]: 1.7.0
Platform [e.g. macOS, ASUS-Merlin]: EdgeOS 2.0.8 on EdgeRouter X SFP

About this issue

Original URL
State: closed
Created 4 years ago
Reactions: 2
Comments: 17 (3 by maintainers)

Most upvoted comments

Try setting: bogus-priv false as a work-around. I still have yet to restart the device or service since changing that setting.

ralban on Aug 7, 2020

@ralban I’ve had some success adding/modifying the highlighted lines in /etc/systemd/system/nextdns.service:

[Unit] Description=NextDNS DNS53 to DoH proxy. ConditionFileIsExecutable=/config/nextdns/nextdns After=network.target Before=nss-lookup.target Wants=nss-lookup.target

[Service] StartLimitInterval=5 StartLimitBurst=10 Environment=SERVICE_RUN_MODE=1 ExecStart=/config/nextdns/nextdns run RestartSec=5 LimitMEMLOCK=infinity Restart=always RuntimeMaxSec=23000

[Install] WantedBy=multi-user.target

The two Restart* lines will ensure that if the service dies (which mine does after a few too many socket errors), it will be restarted after 5 seconds. In practice this seems to mean that there’s about 30-120 seconds of downtime from the user perspective as DNS queries will start to slow and fail before the service fully dies, but that’s quick enough that streaming shouldn’t be interrupted and it beats having to manually restart.

The RuntimeMaxSec line will proactively shutdown the service every 6-7 hours or so, but Restart=always will start it up again. My hope here is that if there’s some sort of leak, proactively restarting the service will keep things cleaner and avoid the errors I’m seeing. I’m not 100% sure how effective this is; it seems like it’s helping because at first I was getting a crash about once every 24 hours, but now I don’t think I’ve seen one for a few days-- on the other hand, because the service restarts when it crashes now, I might just not be noticing the crashes 😛

Sammy1Am on Jul 19, 2020

If you are able to run a lsof on the pid of nextdns it would be helpful. A nextdns trace too.

rs on Jul 7, 2020