thanos: store: i/o timeouts with new memcached config

šŸ‘‹ Hello again and thanks for the quick response in https://github.com/thanos-io/thanos/issues/1974

Thanos version used: latest master with https://github.com/thanos-io/thanos/pull/1975

Object Storage Provider: AWS S3

What happened: I am testing the new memcached config for Thanos Store and I am seeing a lot of timeouts. The timeouts are gone when setting the limit to 10s šŸ¤•

Memcached servers look OK and number of connections doesn’t exceed max_idle_connections

Thanos Store process and Memcached servers are in the same AZ.

I tried the default config and also tweaked most of the parameters but to no avail.

Any recommendations ?

Arguments:

thanos store --index-cache.config-file=/app/cache-config.yaml --chunk-pool-size=6GB --objstore.config-file=/app/objstore-config.yaml

Config:

type: MEMCACHED
config:
  addresses:
    - "****.0001.euw1.cache.amazonaws.com:11211"
    - "****.0002.euw1.cache.amazonaws.com:11211"
  timeout: 1s
  max_idle_connections: 500
  max_async_concurrency: 200
  max_async_buffer_size: 10000
  max_get_multi_concurrency: 200
  max_get_multi_batch_size: 0
  dns_provider_update_interval: 10s

Logs:

level=warn ts=2020-01-10T11:57:31.215656814Z caller=memcached_client.go:266 msg="failed to store item to memcached" key=P:01DY46YH1MSD2N9PZ7M839E0SS:DcQKMLmxz9NRLFUvIYH7xHEXCRQCie5wnCttjitdnqw err="write tcp 172.17.0.3:35492->10.42.8.140:11211: i/o timeout"
level=warn ts=2020-01-10T11:57:31.215238251Z caller=memcached_client.go:266 msg="failed to store item to memcached" key=P:01DY46YH1MSD2N9PZ7M839E0SS:DcQKMLmxz9NRLFUvIYH7xHEXCRQCie5wnCttjitdnqw err="write tcp 172.17.0.3:35490->10.42.8.140:11211: i/o timeout"
level=warn ts=2020-01-10T11:57:31.076957889Z caller=memcached_client.go:266 msg="failed to store item to memcached" key=P:01DY46FXKKFSNF3D9TVXB13G1H:DcQKMLmxz9NRLFUvIYH7xHEXCRQCie5wnCttjitdnqw err="write tcp 172.17.0.3:35496->10.42.8.140:11211: i/o timeout"
level=warn ts=2020-01-10T11:57:30.831121025Z caller=memcached_client.go:266 msg="failed to store item to memcached" key=P:01DXJS43Y6CEXAMVT0HZVC480E:DcQKMLmxz9NRLFUvIYH7xHEXCRQCie5wnCttjitdnqw err="write tcp 172.17.0.3:34270->10.42.8.43:11211: i/o timeout"
level=warn ts=2020-01-10T11:57:30.830995908Z caller=memcached_client.go:266 msg="failed to store item to memcached" key=P:01DXJS43Y6CEXAMVT0HZVC480E:DcQKMLmxz9NRLFUvIYH7xHEXCRQCie5wnCttjitdnqw err="write tcp 172.17.0.3:50162->10.42.8.43:11211: i/o timeout"
level=warn ts=2020-01-10T11:57:30.754847072Z caller=memcached_client.go:266 msg="failed to store item to memcached" key=P:01DXJSHHT4N25BRSMM27M3QH8K:DcQKMLmxz9NRLFUvIYH7xHEXCRQCie5wnCttjitdnqw err="write tcp 172.17.0.3:50160->10.42.8.43:11211: i/o timeout"
level=warn ts=2020-01-10T11:57:30.754715844Z caller=memcached_client.go:266 msg="failed to store item to memcached" key=P:01DXJSHHT4N25BRSMM27M3QH8K:DcQKMLmxz9NRLFUvIYH7xHEXCRQCie5wnCttjitdnqw err="write tcp 172.17.0.3:46124->10.42.8.43:11211: i/o timeout"
level=warn ts=2020-01-10T11:57:29.148112488Z caller=memcached_client.go:277 msg="failed to fetch items from memcached" err="read tcp 172.17.0.3:34404->10.42.8.43:11211: i/o timeout"
level=warn ts=2020-01-10T11:57:29.148818833Z caller=memcached_client.go:277 msg="failed to fetch items from memcached" err="read tcp 172.17.0.3:47196->10.42.8.140:11211: i/o timeout"
level=warn ts=2020-01-10T11:57:29.108056739Z caller=memcached_client.go:277 msg="failed to fetch items from memcached" err="read tcp 172.17.0.3:48012->10.42.8.140:11211: i/o timeout"
level=warn ts=2020-01-10T11:57:29.105953129Z caller=memcached_client.go:277 msg="failed to fetch items from memcached" err="read tcp 172.17.0.3:34276->10.42.8.43:11211: i/o timeout"

About this issue

  • Original URL
  • State: open
  • Created 4 years ago
  • Reactions: 2
  • Comments: 48 (26 by maintainers)

Most upvoted comments

So, I have fixed this issue for myself. As it was an implementation error xD

After investigation, because the config (per above) does not seem to make any difference, I noticed that it was NOT about the memcached index cache, but the bucket cache.

When I investigated this further, I noticed that I made a mistake on it’s config. This was my config:

config:
  addresses: [*redacted*]
  chunk_subrange_size: 16000
  max_chunks_get_range_requests: 3
  chunk_object_attrs_ttl: 24h
  chunk_subrange_ttl: 24h
  blocks_iter_ttl: 5m
  metafile_exists_ttl: 2h
  metafile_doesnt_exist_ttl: 15m
  metafile_content_ttl: 24h
  metafile_max_size: 16MiB
  • I made the assumption that the bucket cache was not ā€œthat specialā€ in comparison to the index cache
  • I missed the line in the docs that stated: ā€œconfig field for memcached supports all the same configuration as memcached for index cacheā€

So after fixing my config file for the bucket cache to this, it was fixed:

type: MEMCACHED
config:
  addresses: [*redacted*]
  max_idle_connections: 750
  max_async_concurrency: 100
  max_async_buffer_size: 10000000
  max_item_size: 16MiB
  max_get_multi_concurrency: 750
  max_get_multi_batch_size: 1000
chunk_subrange_size: 16000
max_chunks_get_range_requests: 3
chunk_object_attrs_ttl: 24h
chunk_subrange_ttl: 24h
blocks_iter_ttl: 5m
metafile_exists_ttl: 2h
metafile_doesnt_exist_ttl: 15m
metafile_content_ttl: 24h
metafile_max_size: 16MiB

To be honest, a few improvements could be made IMO:

  • Update docs here: https://thanos.io/tip/components/store.md/#caching-bucket with an example of actually implementing the config for memcached
  • Give better log info regarding which memcached type is hit (is the log for index or bucket??)
  • When doing batchrequests, only the last error is being output. Add a debug log for each error, rather than saving the last.

I’m a bit tired, but will make issues/PR’s tomorrow or something šŸ˜‰

Hi, I am getting the same error.

error log

{"caller":"memcached_client.go:382","err":"write tcp 10.1.38.29:51390->10.1.38.48:11211: i/o timeout","firstKey":"P:01F2CZDVM6GTEW87W89YW4JDEW:p51fuquhpV774_dMrnY-ViH2EEbR1sEwJSSilUV7FjE","level":"warn","msg":"failed to fetch items from memcached","numKeys":4125,"ts":"2021-04-05T09:15:15.625479696Z"}
{"caller":"memcached_client.go:382","err":"read tcp 10.1.38.29:41902->10.1.38.48:11211: i/o timeout","firstKey":"P:01F2CHPCYEEKE1TDV4GT526A27:p51fuquhpV774_dMrnY-ViH2EEbR1sEwJSSilUV7FjE","level":"warn","msg":"failed to fetch items from memcached","numKeys":16751,"ts":"2021-04-05T09:15:15.63488144Z"}
{"caller":"memcached_client.go:382","err":"write tcp 10.1.38.29:51376->10.1.38.48:11211: i/o timeout","firstKey":"P:01F2BX37C68PA5C9WH9DJBX1PC:p51fuquhpV774_dMrnY-ViH2EEbR1sEwJSSilUV7FjE","level":"warn","msg":"failed to fetch items from memcached","numKeys":4036,"ts":"2021-04-05T09:15:15.635151769Z"}
{"caller":"memcached_client.go:382","err":"write tcp 10.1.38.29:41920->10.1.38.48:11211: i/o timeout","firstKey":"P:01F2CRJ46EPJPAJS9EYWEYKR03:p51fuquhpV774_dMrnY-ViH2EEbR1sEwJSSilUV7FjE","level":"warn","msg":"failed to fetch items from memcached","numKeys":16705,"ts":"2021-04-05T09:15:15.635206581Z"}
{"caller":"memcached_client.go:382","err":"write tcp 10.1.38.29:51396->10.1.38.48:11211: i/o timeout","firstKey":"P:01F2C3YYEE3MY9ZY93HG4QEP06:p51fuquhpV774_dMrnY-ViH2EEbR1sEwJSSilUV7FjE","level":"warn","msg":"failed to fetch items from memcached","numKeys":16707,"ts":"2021-04-05T09:15:15.635329868Z"}
{"caller":"memcached_client.go:382","err":"write tcp 10.1.38.29:47862->10.1.38.48:11211: i/o timeout","firstKey":"P:01F2DD59YDBWQV1QFNYZKWH745:p51fuquhpV774_dMrnY-ViH2EEbR1sEwJSSilUV7FjE","level":"warn","msg":"failed to fetch items from memcached","numKeys":16649,"ts":"2021-04-05T09:15:15.659835826Z"}
{"caller":"memcached_client.go:382","err":"write tcp 10.1.38.29:49368->10.1.38.48:11211: i/o timeout","firstKey":"P:01F2DTWRERTMH5F622XX5904YY:p51fuquhpV774_dMrnY-ViH2EEbR1sEwJSSilUV7FjE","level":"warn","msg":"failed to fetch items from memcached","numKeys":16698,"ts":"2021-04-05T09:15:15.685193363Z"}
{"caller":"memcached_client.go:382","err":"write tcp 10.1.38.29:47844->10.1.38.48:11211: i/o timeout","firstKey":"P:01F2E1RFPASEG1CYTZK6M4V3BY:p51fuquhpV774_dMrnY-ViH2EEbR1sEwJSSilUV7FjE","level":"warn","msg":"failed to fetch items from memcached","numKeys":16717,"ts":"2021-04-05T09:15:15.685274897Z"}
{"caller":"memcached_client.go:382","err":"write tcp 10.1.38.29:47868->10.1.38.48:11211: i/o timeout","firstKey":"P:01F2CZDVEB4E6W15BMRM0NMYYJ:p51fuquhpV774_dMrnY-ViH2EEbR1sEwJSSilUV7FjE","level":"warn","msg":"failed to fetch items from memcached","numKeys":16692,"ts":"2021-04-05T09:15:15.718721764Z"}
{"caller":"memcached_client.go:382","err":"write tcp 10.1.38.29:41866->10.1.38.48:11211: i/o timeout","firstKey":"P:01F2E1RFPASEG1CYTZK6M4V3BY:p51fuquhpV774_dMrnY-ViH2EEbR1sEwJSSilUV7FjE","level":"warn","msg":"failed to fetch items from memcached","numKeys":16717,"ts":"2021-04-05T09:15:15.719999645Z"}

thanos and system version

thanos, version 0.19.0 (branch: HEAD, revision: ba6c5c4726ff52807c7383c68f2159b1af7980bb)
  build user:       root@b7176da2c64e
  build date:       20210331-12:23:43
  go version:       go1.15.10
  platform:         linux/amd64

uname -ra
Linux hostname 3.10.0-1160.11.1.el7.x86_64 #1 SMP Fri Dec 18 16:34:56 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

systemd

[Unit]
Description=thanos store  service
After=network.target

[Service]
LimitNOFILE=65535
Type=simple
Restart=always
RestartSec=1

ExecStart=/data/app/thanos/bin/thanos store --data-dir=/data/thanos \
                --log.format=json \
                --index-cache-size=512MB  \
                --objstore.config-file=/data/app/thanos/conf.d/object.json \
                --sync-block-duration=1m \
                --ignore-deletion-marks-delay=124h  \
                --store.enable-index-header-lazy-reader \
                --min-time=-1w \
                --max-time=-20h \
                --index-cache.config-file=/data/app/thanos/conf.d/index.json \
                --store.caching-bucket.config-file=/data/app/thanos/conf.d/cache.json

[Install]
WantedBy=multi-user.target

index-cache-config

type: MEMCACHED
config:
  addresses:
    - "ip:port"
  chunk_subrange_size: 16000
  max_chunks_get_range_requests: 3
  chunk_object_attrs_ttl: 24h
  chunk_subrange_ttl: 24h
  blocks_iter_ttl: 5m
  metafile_exists_ttl: 2h
  metafile_doesnt_exist_ttl: 15m
  metafile_content_ttl: 24h
  metafile_max_size: 16MiB

store-cache-bucket

type: MEMCACHED
config:
  addresses:
    - "ip:port"
  max_idle_connections: 1000
  max_async_concurrency: 200
  max_async_buffer_size: 10000000
  max_item_size: 16MiB
  max_get_multi_concurrency: 750
  max_get_multi_batch_size: 1000
chunk_subrange_size: 16000
max_chunks_get_range_requests: 3
chunk_object_attrs_ttl: 24h
chunk_subrange_ttl: 24h
blocks_iter_ttl: 5m
metafile_exists_ttl: 2h
metafile_doesnt_exist_ttl: 15m
metafile_content_ttl: 24h
metafile_max_size: 16MiB

image

Thanks!

I talked offline to @kuberkaul and shared some tips. Parallelly, I’m investigating a potential issue causing unfair prioritisation within the memcached client due to the usage of the gate. I will open a PR to fix if confirmed (will work on it shortly).

Is there any update on this?

I am also getting the below error while trying to get the historical data.

level=warn ts=2022-03-08T21:07:40.894536137Z caller=memcached_client.go:412 name=caching-bucket msg="failed to fetch items from memcached" numKeys=6 firstKey=subrange:01FWYKM19J41JKFFY3A02K5BZ2/chunks/000002:342336000:342352000 err="memcache: connect timeout to 10.98.42.174:11211"
level=warn ts=2022-03-08T21:07:40.899116373Z caller=memcached_client.go:412 name=caching-bucket msg="failed to fetch items from memcached" numKeys=21 firstKey=subrange:01FWYKM19J41JKFFY3A02K5BZ2/chunks/000002:376128000:376144000 err="memcache: connect timeout to 10.98.42.174:11211"
level=warn ts=2022-03-08T21:07:40.899390159Z caller=memcached_client.go:412 name=caching-bucket msg="failed to fetch items from memcached" numKeys=101 firstKey=subrange:01FWYKM19J41JKFFY3A02K5BZ2/chunks/000002:364272000:364288000 err="memcache: connect timeout to 10.98.42.174:11211"
level=warn ts=2022-03-08T21:07:40.899632943Z caller=memcached_client.go:412 name=caching-bucket msg="failed to fetch items from memcached" numKeys=45 firstKey=subrange:01FWYKM19J41JKFFY3A02K5BZ2/chunks/000002:335872000:335888000 err="memcache: connect timeout to 10.98.42.174:11211"
level=warn ts=2022-03-08T21:07:40.899892628Z caller=memcached_client.go:412 name=caching-bucket msg="failed to fetch items from memcached" numKeys=78 firstKey=subrange:01FWYKM19J41JKFFY3A02K5BZ2/chunks/000002:339424000:339440000 err="memcache: connect timeout to 10.98.42.174:11211"
level=warn ts=2022-03-08T21:07:40.900115794Z caller=memcached_client.go:412 name=caching-bucket msg="failed to fetch items from memcached" numKeys=16 firstKey=subrange:01FWYKM19J41JKFFY3A02K5BZ2/chunks/000002:366992000:367008000 err="memcache: connect timeout to 10.98.42.174:11211"
level=warn ts=2022-03-08T21:07:40.901373702Z caller=memcached_client.go:412 name=caching-bucket msg="failed to fetch items from memcached" numKeys=3 firstKey=subrange:01FWYKM19J41JKFFY3A02K5BZ2/chunks/000002:366416000:366432000 err="memcache: connect timeout to 10.98.42.174:11211"
level=warn ts=2022-03-08T21:07:40.901437644Z caller=memcached_client.go:412 name=caching-bucket msg="failed to fetch items from memcached" numKeys=86 firstKey=subrange:01FWYKM19J41JKFFY3A02K5BZ2/chunks/000002:343088000:343104000 err="memcache: connect timeout to 10.98.42.174:11211"
level=warn ts=2022-03-08T21:07:40.901620772Z caller=memcached_client.go:412 name=caching-bucket msg="failed to fetch items from memcached" numKeys=6 firstKey=subrange:01FWYKM19J41JKFFY3A02K5BZ2/chunks/000002:354096000:354112000 err="memcache: connect timeout to 10.98.42.174:11211"
level=warn ts=2022-03-08T21:07:40.901786015Z caller=memcached_client.go:412 name=caching-bucket msg="failed to fetch items from memcached" numKeys=77 firstKey=subrange:01FWYKM19J41JKFFY3A02K5BZ2/chunks/000002:378976000:378992000 err="memcache: connect timeout to 10.98.42.174:11211"
level=warn ts=2022-03-08T21:07:40.901964644Z caller=memcached_client.go:412 name=caching-bucket msg="failed to fetch items from memcached" numKeys=53 firstKey=subrange:01FWYKM19J41JKFFY3A02K5BZ2/chunks/000002:400560000:400576000 err="memcache: connect timeout to 10.98.42.174:11211"
level=warn ts=2022-03-08T21:07:40.902197572Z caller=memcached_client.go:412 name=caching-bucket msg="failed to fetch items from memcached" numKeys=37 firstKey=subrange:01FWYKM19J41JKFFY3A02K5BZ2/chunks/000002:383280000:383296000 err="memcache: connect timeout to 10.98.42.174:11211"
level=warn ts=2022-03-08T21:07:40.902400874Z caller=memcached_client.go:412 name=caching-bucket msg="failed to fetch items from memcached" numKeys=18 firstKey=subrange:01FWYKM19J41JKFFY3A02K5BZ2/chunks/000002:377664000:377680000 err="memcache: connect timeout to 10.98.42.174:11211"
level=warn ts=2022-03-08T21:07:40.917716417Z caller=memcached_client.go:412 name=caching-bucket msg="failed to fetch items from memcached" numKeys=48 firstKey=subrange:01FPHZZWXRFEA8SAR1GFV929SK/chunks/000001:230000000:230016000 err="memcache: connect timeout to 10.98.42.174:11211"
level=warn ts=2022-03-08T21:07:40.917780801Z caller=memcached_client.go:412 name=caching-bucket msg="failed to fetch items from memcached" numKeys=113 firstKey=subrange:01FPHZZWXRFEA8SAR1GFV929SK/chunks/000001:238448000:238464000 err="memcache: connect timeout to 10.98.42.174:11211"
level=warn ts=2022-03-08T21:07:40.917982688Z caller=memcached_client.go:412 name=caching-bucket msg="failed to fetch items from memcached" numKeys=126 firstKey=subrange:01FPHZZWXRFEA8SAR1GFV929SK/chunks/000001:235616000:235632000 err="memcache: connect timeout to 10.98.42.174:11211"
level=warn ts=2022-03-08T21:07:40.918115958Z caller=memcached_client.go:412 name=caching-bucket msg="failed to fetch items from memcached" numKeys=51 firstKey=subrange:01FPHZZWXRFEA8SAR1GFV929SK/chunks/000001:225808000:225824000 err="memcache: connect timeout to 10.98.42.174:11211"
level=warn ts=2022-03-08T21:07:40.918333518Z caller=memcached_client.go:412 name=caching-bucket msg="failed to fetch items from memcached" numKeys=122 firstKey=subrange:01FPHZZWXRFEA8SAR1GFV929SK/chunks/000001:232976000:232992000 err="memcache: connect timeout to 10.98.42.174:11211"
level=warn ts=2022-03-08T21:07:40.947289769Z caller=memcached_client.go:412 name=caching-bucket msg="failed to fetch items from memcached" numKeys=104 firstKey=subrange:01FWYKM19J41JKFFY3A02K5BZ2/chunks/000002:388032000:388048000 err="memcache: connect timeout to 10.98.42.174:11211"
level=warn ts=2022-03-08T21:07:40.947434684Z caller=memcached_client.go:412 name=caching-bucket msg="failed to fetch items from memcached" numKeys=37 firstKey=subrange:01FWYKM19J41JKFFY3A02K5BZ2/chunks/000002:355552000:355568000 err="memcache: connect timeout to 10.98.42.174:11211"

I have two queries,

  1. Where the bucket cache config helps? Does it helps to get the historical data(which is not available in prometheus-thanos sidecar pod) quickly on querying? or Does it helps to load the indexes into memory while starting the thanos store pods?
  2. How to figure out the error happening above.

My thanos store memcached configurations are give below. index caching config - memcached

type: memcached
config:
  addresses: [thanos-memcached.kts-plb-duck-plb-w2.svc.cluster.local:11211]
  timeout: 10s
  max_idle_connections: 400
  max_async_concurrency: 1000
  max_async_buffer_size: 0
  max_get_multi_concurrency: 0
  max_get_multi_batch_size: 0
  dns_provider_update_interval: 10s
  max_item_size: 0

bucket caching config - memcached

type: memcached
config:
  addresses: [thanos-memcached.kts-plb-duck-plb-w2.svc.cluster.local:11211]
  timeout: 10s
  max_idle_connections: 400
  max_async_concurrency: 1000
  max_async_buffer_size: 0
  max_item_size: 0
  max_get_multi_concurrency: 0
  max_get_multi_batch_size: 0
  dns_provider_update_interval: 10s
chunk_subrange_size: 16000
max_chunks_get_range_requests: 3
chunk_object_attrs_ttl: 24h
chunk_subrange_ttl: 24h
blocks_iter_ttl: 5m
metafile_exists_ttl: 2h
metafile_doesnt_exist_ttl: 15m
metafile_content_ttl: 24h

I had se value 0 for max_get_multi_batch_size, max_async_buffer_size, max_item_size, max_get_multi_concurrency to set unlimited capacity so that it should cause any issue because of the limits.

Thanks in advance

I’ve run some benchmarks in our cluster, using two tools:

  1. memtier_benchmark by RedisLabs
  2. bench: a quick and dirty benchmark tool based on the Thanos MemcachedClient (code here)

Both benchmarks run against the same memcached server (1 instance), running in the same Kubernetes cluster where the benchmarking tool is running.

The results

$ memtier_benchmark --server=10.32.21.9 --port=11211 --protocol=memcache_text --requests=10000 --clients=200 --ratio=1:1 --data-size-range=100-200 --key-pattern=S:S --multi-key-get=100

4         Threads
200       Connections per thread
10000     Requests per client


ALL STATS
=========================================================================
Type         Ops/sec     Hits/sec   Misses/sec      Latency       KB/sec
-------------------------------------------------------------------------
Sets        77044.82          ---          ---      5.23000     13995.64
Gets        77044.82     77044.82         0.00      5.22700     15107.53
Waits           0.00          ---          ---      0.00000          ---
Totals     154089.65     77044.82         0.00      5.22900     29103.17

Request Latency Distribution
Type     <= msec         Percent
------------------------------------------------------------------------
[REDACTED]
SET      56.000       100.00
---
[REDACTED]
GET      56.000       100.00
---
$ ./bench
Clients: 200
Reqs / client: 10000
Keys / req: 100
Data size range: 100 - 200
Config MaxIdleConnections: 400

All clients started. Now waiting to complete.
SET ops: 2000000
GETMULTI avg: 5 ms max: 76 ms
GETMULTI ops: 2000000 ops/s: 36058.11037331169
Total time: 55.466023574s

Comparison on GetMulti()

  • Raw performances: 77044 ops/s, 5.2ms avg latency, 56ms max latency
  • Benchmark based on Thanos client: 36058 ops/s, 5ms avg latency, 76ms max latency

The performances of the Thanos client don’t look bad to me. It’s about 50% compared to the raw performances, which looks ā€œreasonableā€ to me (it shows there’s room for improvement, but not bad compared to a c++ tool designed to squeeze out the performances with no business logic around it).

What’s next

Other than the latest comments above:

  • May you run the same benchmark?
  • May you make sure that your Thanos store doesn’t hit the CPU limit? Another reason why you may experience the i/o timeout may be if the Thanos store doesn’t have enough CPU resources