VictoriaMetrics: error while fetching data from remote storage: snappy: decoded block is too large

Describe the bug

vmctl exits with error with message:

2023/04/24 20:46:22 remote read error: request failed for: error while fetching data from remote storage: error while sending request to http://localhost:10080/api/v1/read: Post "http://localhost:10080/api/v1/read": EOF; Data len 36(36)

when importing data from using the prometheus remotre-read to read from a Thanos Store Gateway.

This may not be a VM bug, but is a blocker to Thanos data migration into VM. Any help is welcomed.

To Reproduce

  • start the thanos-remote-read pointing to a Thanos Storage Gateway pod
./bin/thanos-remote-read -store 10.68.6.4:10901 -log.level debug
  • start vmctl to read from remote-read and send to a vmstore pod:
./vmctl-prod  remote-read --remote-read-src-addr=http://localhost:10080 --remote-read-filter-time-start=2023-04-16T00:00:00Z --remote-read-step-interval=hour --vm-addr=http://10.68.5.59:8482 --vm-concurrency=2 --remote-read-filter-time-end=2023-04-16T12:00:00Z --verbose

Some start/stop dates will go through but others will stop processing.

Version

./vmctl-prod --version

vmctl version vmctl-20230407-010146-tags-v1.90.0-0-gb5d18c0d2
2023/04/24 20:57:46 Total time: 934.204_s

VmStorage is v1.90.0-cluster in Google GKE

Logs

thanos-remote-read:

./bin/thanos-remote-read -store 10.68.6.4:10901 -log.level debug

info: starting up thanos-remote-read...
ts=2023-04-24T20:44:26.867171185Z caller=main.go:278 level=info traceID=00000000000000000000000000000000 msg="thanos request" request="min_time:1681603200000 max_time:1681606799999 matchers:<type:RE name:\"__name__\" value:\".*\" > aggregates:RAW "

2023/04/24 20:46:22 http: panic serving 127.0.0.1:39718: snappy: decoded block is too large
goroutine 51 [running]:
net/http.(*conn).serve.func1()
        /usr/local/go/src/net/http/server.go:1854 +0xbf
panic({0xaaf440, 0xc000088ed0})
        /usr/local/go/src/runtime/panic.go:890 +0x263
github.com/golang/snappy.Encode({0x0?, 0xc2d9ed2240?, 0xb96dd0?}, {0xc417f00000?, 0xc0000342d0?, 0xc7a9e0?})
        /go/pkg/mod/github.com/golang/snappy@v0.0.1/encode.go:20 +0x2ba
main.(*API).remoteRead(0xc00218d7b0?, {0xc81c60, 0xc000034280}, 0xc00012a500, {0xc7a9e0, 0xc0001a4180})
        /go/pkg/mod/github.com/!g-!research/thanos-remote-read@v0.4.0/main.go:232 +0x626
main.setup.func2({0xc81c60?, 0xc000034280?}, 0x100?)
        /go/pkg/mod/github.com/!g-!research/thanos-remote-read@v0.4.0/main.go:163 +0x30
main.errorWrap.func1({0xc81c60, 0xc000034280}, 0xc78601?)
        /go/pkg/mod/github.com/!g-!research/thanos-remote-read@v0.4.0/main.go:169 +0x2b
net/http.HandlerFunc.ServeHTTP(0xc81fe0?, {0xc81c60?, 0xc000034280?}, 0xc786a8?)
        /usr/local/go/src/net/http/server.go:2122 +0x2f
go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp.(*Handler).ServeHTTP(0xc0001b8180, {0x7f18948766f8?, 0xc0000341e0}, 0xc00012a100)
        /go/pkg/mod/go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp@v0.16.0/handler.go:179 +0x971
github.com/prometheus/client_golang/prometheus/promhttp.InstrumentHandlerCounter.func1({0xc81300?, 0xc00013e000?}, 0xc00012a100)
        /go/pkg/mod/github.com/prometheus/client_golang@v1.5.1/prometheus/promhttp/instrument_server.go:100 +0x94
net/http.HandlerFunc.ServeHTTP(0xc00013e000?, {0xc81300?, 0xc00013e000?}, 0xb94581?)
        /usr/local/go/src/net/http/server.go:2122 +0x2f
net/http.(*ServeMux).ServeHTTP(0x0?, {0xc81300, 0xc00013e000}, 0xc00012a100)
        /usr/local/go/src/net/http/server.go:2500 +0x149
net/http.serverHandler.ServeHTTP({0xc7e3a8?}, {0xc81300, 0xc00013e000}, 0xc00012a100)
        /usr/local/go/src/net/http/server.go:2936 +0x316
net/http.(*conn).serve(0xc0002a4360, {0xc81fe0, 0xc0001931d0})
        /usr/local/go/src/net/http/server.go:1995 +0x612
created by net/http.(*Server).Serve
        /usr/local/go/src/net/http/server.go:3089 +0x5ed

vmctl:

./vmctl-prod  remote-read --remote-read-src-addr=http://localhost:10080 --remote-read-filter-time-start=2023-04-16T00:00:00Z --remote-read-step-interval=hour --vm-addr=http://10.68.5.59:8482 --vm-concurrency=2 --remote-read-filter-time-end=2023-04-16T12:00:00Z --verbose

Selected time range "2023-04-16 00:00:00 +0000 UTC" - "2023-04-16 12:00:00 +0000 UTC" will be split into 12 ranges according to "hour" step. Continue? [Y/n]
VM worker 0:_ ? p/s
VM worker 1:_ ? p/s
Processing ranges: 0 / 12 [_________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________] 0.00%
2023/04/24 20:46:22 Import finished!
2023/04/24 20:46:22 VictoriaMetrics importer stats:
  idle duration: 0s;
  time spent while importing: 1m56.725018692s;
  total samples: 0;
  samples/s: 0.00;
  total bytes: 0 B;
  bytes/s: 0 B;
  import requests: 0;
  import requests retries: 0;
2023/04/24 20:46:22 remote read error: request failed for: error while fetching data from remote storage: error while sending request to http://localhost:10080/api/v1/read: Post "http://localhost:10080/api/v1/read": EOF; Data len 36(36)

Screenshots

No response

Used command-line flags

No response

Additional information

No response

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 17 (9 by maintainers)

Most upvoted comments

I tried to understand what the issue is with snappy, but fail identifying a solution 😃 What I can tell is that the issue is 100% inside thanos-remote-read when encoding the file (even if the error is a decode error… which is why it’s puzzling me)

for 1), i’m all-in. I looked at some code and it should’nt be that hard. I used https://github.com/sepich/thanos-kit to dump the Thanos blocks into Prometheus style metrics, and added the External Labels from the meta.json file. It seems to be working this way, as long as you work on one block at a time…

So far i’m stopping working on Thanos data migration as we have way too much useless data and I will look into a way to define what needs to be migrated (who needs 3 years of up metric ?)