go-carbon: [Q] Context Canceled Errors on Large Queries to Carbonserver

I have a few large queries that are causing me issues. I originally had some larger queries with a lot of wildcards that would cause me errors like this:

"error": "find failed, can't expand globs"

I increased the max globs in the config and that seemed to fix the problem for most of the larger queries, but there’s still one left that was causing me issues, so I increased the max globs to an absurd value. This time I get a different error though:

ERROR [access] find failed "format": "carbonapi_v3_pb", "runtime_seconds": 4.00071045, "reason": "Internal error while processing request", "error": "could not expand globs - context canceled", "http_code": 500

INFO [carbonserver] slow_expand_globs

My question is what do I need to change to get this query to run. Skimming through the carbonserver code, it looks like this error isn’t necessarily related to the value of max globs, but is more like an issue with timeouts. On our python based graphite this query takes about 8 or 9 seconds, and the timeout is set much higher than that in our config.

Our go-carbon setup is running on a 16 core box with EBS volumes for storage, and currently hovering around 8 million metrics. Here is our go-carbon config:

[common]
user = "root"
graph-prefix = "carbon.agents.{host}"
metric-endpoint = "local"
metric-interval = "1m0s"
max-cpu = 14

[whisper]
data-dir = "/opt/graphite/storage/whisper"
schemas-file = "/opt/go-carbon/storage-schemas.conf"
aggregation-file = "/opt/go-carbon/storage-aggregation.conf"
workers = 6
max-updates-per-second = 0
max-creates-per-second = 0
hard-max-creates-per-second = false
sparse-create = false
flock = true
enabled = true
hash-filenames = true
compressed = true
remove-empty-file = false

[cache]
max-size = 0
write-strategy = "noop"

[udp]
listen = ":2003"
enabled = true
buffer-size = 0

[tcp]
listen = ":2003"
enabled = true
buffer-size = 0

[pickle]
listen = ":2004"
max-message-size = 67108864
enabled = true
buffer-size = 0

[carbonlink]
listen = "127.0.0.1:7002"
enabled = true
read-timeout = "30s"

[grpc]
listen = "127.0.0.1:7003"
enabled = true

[tags]
enabled = false

[carbonserver]
listen = ":8080"
enabled = true
buckets = 10
metrics-as-counters = false
read-timeout = "60s"
write-timeout = "60s"
query-cache-enabled = false
query-cache-size-mb = 2000
find-cache-enabled = false
trigram-index = true
scan-frequency = "2m30s"
trie-index = false
cache-scan = false
max-globs = 100000000
fail-on-max-globs = true
max-metrics-globbed  = 100000000
max-metrics-rendered = 1000000
graphite-web-10-strict-mode = true
stats-percentiles = [99, 98, 95, 75, 50]

[pprof]
listen = "localhost:7007"
enabled = false

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 30 (15 by maintainers)

Most upvoted comments

@jdblack : fix was done in https://github.com/go-graphite/go-carbon/pull/445, it was merged and released.

the empty directory doesn’t get indexed, and doesn’t show up in the query, which is how I think it should be.

I would treat that’s as a bug. Empty directory should magically show metrics which actually is not there.

the empty directory doesn’t get indexed, and doesn’t show up in the query, which is how I think it should be.

Well, that’s debatable. If directory contains no whisper files then it should not be indexed or visible by metric definition IMO.

We’re running 0.15.6.

Are you using trie with file compression enabled too? Like I said, it’s been a bit since I looked at the trie index but I want to say the problem only occurred with file compression enabled. Any queries basically just return no metrics. It effectively behaves like we don’t have a single metric under our metrics directory.

And as far as the query goes, I think that’s part of the problem. It looks like expandGlobBraces builds out all the possible metric/file paths from a query that uses curly braces. Our query looks roughly like this:

metric.{10 values}.{46 values}.{46 values}.foo.{65 values}.{423 values}.bar

By my math that means the expandGlobBraces needs to build 581 million metric paths. And then it also builds out all the metric paths with .wsp on the end so double that to get 1.163 billion metric paths.

If I pull each set of those curly braces out and put a wildcard (*) in their place, the query runs in about a second. I think building out all the metric paths is what’s causing the time out here. For a little context, these queries are coming from grafana, and grafana handles multiselect variables in their queries using the curly brace notation.

I think I’ve narrowed the issue down to the expandGlobBraces function in carbonserver.go

Running that function locally on my machine takes several minutes with the input I’m throwing at it. I’m going to do some more debugging and figure out what can be done to speed things up.