coredns: coredns doesn't perform better despite having more cores

We are running CoreDNS 1.9.3 (retrieved from the official releases on GitHub), and have been having difficulty with increasing performance of a single instance of coredns.

With GOMAXPROCS set to 1, we observe ~60k qps and full utilization of one core.

With GOMAXPROCS set to 2, we seem to hit a performance limit of ~90-100k qps, but it consumes almost entirely two cores.

With GOMAXPROCS set to 4, we observe that coredns will use all 4 cores - but throughput does not increase, and latency seems to be the same.

With GOMAXPROCS set to 8-64, we observe the same CPU usage and throughput.

We have the following corefile:

.:55 {
  file db.example.org example.org
  cache 100
  whoami
}

db.example.org

$ORIGIN example.org.
@       3600 IN SOA sns.dns.icann.org. noc.dns.icann.org. 2017042745 7200 3600 1209600 3600
        3600 IN NS a.iana-servers.net.
        3600 IN NS b.iana-servers.net.

www     IN A     127.0.0.1
        IN AAAA  ::1

We are using dnsperf: https://github.com/DNS-OARC/dnsperf

And the following command:

  dnsperf -d test.txt -s 127.0.0.1 -p 55 -Q 10000000 -c 1 -l 10000000 -S .1 -t 8

test.txt:

www.example.com AAAA

Is there anything we could be missing?

Thanks!

About this issue

  • Original URL
  • State: open
  • Created 2 years ago
  • Reactions: 2
  • Comments: 22 (13 by maintainers)

Most upvoted comments

yes @lobshunter that is correct. I think lwn article explain the improvements and few caveats (esp. with TCP) of using SO_REUSEPORT option. Last week, I had validated the improvements by simply starting multiple servers on same port (as we’ve already set above option at ListenPacket as seen here) after making following code changes:

diff --git a/core/dnsserver/register.go b/core/dnsserver/register.go
index 8de55906..ac581eca 100644
--- a/core/dnsserver/register.go
+++ b/core/dnsserver/register.go
@@ -3,6 +3,8 @@ package dnsserver
 import (
  "fmt"
  "net"
+ "os"
+ "strconv"
  "time"

  "github.com/coredns/caddy"
@@ -157,36 +159,43 @@ func (h *dnsContext) MakeServers() ([]caddy.Server, error) {
  }
  // then we create a server for each group
  var servers []caddy.Server
- for addr, group := range groups {
- // switch on addr
- switch tr, _ := parse.Transport(addr); tr {
- case transport.DNS:
- s, err := NewServer(addr, group)
- if err != nil {
- return nil, err
- }
- servers = append(servers, s)

- case transport.TLS:
- s, err := NewServerTLS(addr, group)
- if err != nil {
- return nil, err
- }
- servers = append(servers, s)
+ numSock, err := strconv.ParseInt(os.Getenv("NUM_SOCK"), 10, 64)
+ if err != nil {
+ numSock = 1
+ }
+ for i := 0; i < int(numSock); i++ {
+ for addr, group := range groups {
+ // switch on addr
+ switch tr, _ := parse.Transport(addr); tr {
+ case transport.DNS:
+ s, err := NewServer(addr, group)
+ if err != nil {
+ return nil, err
+ }
+ servers = append(servers, s)

- case transport.GRPC:
- s, err := NewServergRPC(addr, group)
- if err != nil {
- return nil, err
- }
- servers = append(servers, s)
+ case transport.TLS:
+ s, err := NewServerTLS(addr, group)
+ if err != nil {
+ return nil, err
+ }
+ servers = append(servers, s)

- case transport.HTTPS:
- s, err := NewServerHTTPS(addr, group)
- if err != nil {
- return nil, err
+ case transport.GRPC:
+ s, err := NewServergRPC(addr, group)
+ if err != nil {
+ return nil, err
+ }
+ servers = append(servers, s)
+
+ case transport.HTTPS:
+ s, err := NewServerHTTPS(addr, group)
+ if err != nil {
+ return nil, err
+ }
+ servers = append(servers, s)
  }
- servers = append(servers, s)
  }
  }

Essentially, I’ve just exposed an env var NUM_SOCK representing no. of socket (thereby servers) one wants to use for serving requests. For validating the improvements, I’ve used similar Corefile as mentioned at issue description above:

.:55 {
  file db.example.org example.org
  cache 100
  whoami
}

1. With single listen socket, I’m able to achieve ~130K qps throughput from dnsperf on some private cloud instance.

$ NUM_SOCK=1 taskset -c 2-35 ./coredns-fix
.:55
CoreDNS-1.10.1
linux/amd64, go1.19.3
$ taskset -c 38-71 dnsperf -d test.txt -s 127.0.0.1 -p 55 -c 1000 -l 100000 -S .1 -T 16
  Queries sent:         5919568
  Queries completed:    5919470 (100.00%)
  Queries lost:         0 (0.00%)
  Queries interrupted:  98 (0.00%)

  Response codes:       NOERROR 5919470 (100.00%)
  Average packet size:  request 33, response 103
  Run time (s):         45.693927
  Queries per second:   129546.099200

  Average Latency (s):  0.000756 (min 0.000016, max 0.006743)
  Latency StdDev (s):   0.000400
CoreDNS CPU Utilization: 275%
DNS Perf CPU Utilization: 480%

2. With two listen socket, I’m able to achieve ~235K qps throughput from dnsperf.

$ NUM_SOCK=2 taskset -c 2-35 ./coredns-fix
.:55
.:55
CoreDNS-1.10.1
linux/amd64, go1.19.3
$ ss -u -a | grep 55
UNCONN 0      0                      *:55                *:*
UNCONN 0      0                      *:55                *:*
$ taskset -c 38-71 dnsperf -d test.txt -s 127.0.0.1 -p 55 -c 1000 -l 100000 -S .1 -T 16
  Queries sent:         17760093
  Queries completed:    17759997 (100.00%)
  Queries lost:         0 (0.00%)
  Queries interrupted:  96 (0.00%)

  Response codes:       NOERROR 17759997 (100.00%)
  Average packet size:  request 33, response 103
  Run time (s):         75.404526
  Queries per second:   235529.588768

  Average Latency (s):  0.000411 (min 0.000018, max 0.006754)
  Latency StdDev (s):   0.000379
CoreDNS CPU Utilization: 570%
DNS Perf CPU Utilization: 780%

3. With 4 listen socket, I’m able to achieve ~400K qps throughput from dnsperf.

$ NUM_SOCK=4 taskset -c 2-35 ./coredns-fix
.:55
.:55
.:55
.:55
CoreDNS-1.10.1
linux/amd64, go1.19.3
$ ss -u -a | grep 55
UNCONN 0      0                        *:55                *:*
UNCONN 0      0                        *:55                *:*
UNCONN 0      0                        *:55                *:*
UNCONN 0      0                        *:55                *:*
$ taskset -c 38-71 dnsperf -d test.txt -s 127.0.0.1 -p 55 -c 1000 -l 100000 -S .1 -T 16
  Queries sent:         20535534
  Queries completed:    20535443 (100.00%)
  Queries lost:         0 (0.00%)
  Queries interrupted:  91 (0.00%)

  Response codes:       NOERROR 20535443 (100.00%)
  Average packet size:  request 33, response 103
  Run time (s):         51.342591
  Queries per second:   399968.965337

  Average Latency (s):  0.000235 (min 0.000020, max 0.003655)
  Latency StdDev (s):   0.000197
CoreDNS CPU Utilization: 1371%
DNS Perf CPU Utilization: 1191%

So, I think bottleneck was indeed due to throughput limitation on single socket & we are able to scale throughput almost linearly as we increase no. of listen socket. I’ll create a pull request after validating the tcp traffic (non tls based) as I gets some more time. Thanks.

A memo: I found an interesting approach that uses SO_REUSEPORT and multiple net.ListenUDP call. According to the author’s benchmark, it outperforms the solution of single listen, multiple ReadFromUDP.

I shall give it a try when I got time.

I could try to find a way. But I do agree with the idea of redis team: scaling horizontally is paramount, and CoreDNS can scale horizontally pretty well. So it’s not a critical issue that it doesn’t scale vertically.

PS: @Lobshunter86 is me, too.

Could be something like that.

Generally, if giving more CPU doesn’t fix it, it is because you are hitting other bottlenecks. The question is whether those are in the CoreDNS code (for example, some mutex contention or somethign), or in the underlying OS or hardware. In this case it looks like writing to the UDP socket. Look into tuning UDP performance on your kernel. You may want to look at your UDP write buffer sizes, for example.