qdrant: Timeout on collection deletion

Hey guys!

Current Behavior

HTTP deletion with default timeout

curl -X DELETE 127.0.0.1:6333/collections/collection-to-be-deleted

{"status":{"error":"Service internal error: Waiting for consensus operation commit failed. Timeout set at: 10 seconds"},"time":10.001176907}

gRPC deletion with timeout set to 100 seconds

grpcurl -plaintext -d '{"timeout": 100, "collection_name": "collection-to-be-deleted"}' 127.0.0.1:6334 qdrant.Collections/Delete
ERROR:
  Code: Internal
  Message: Service internal error: Waiting for consensus operation commit failed. Timeout set at: 100 seconds

Steps to Reproduce

  1. Create a collection using gRPC:
grpcurl -plaintext \
  -d '{"collection_name": "delete-me", "hnsw_config": {"m": 16, "ef_construct": 100}, "optimizers_config": {"indexing_threshold": 100000}, "replication_factor": 2, "write_consistency_factor": 2, "vectors_config": {"params": {"size": 512, "distance": "Cosine"}}}'  \
  localhost:6334 qdrant.Collections/Create 
  1. Create two integer indexes (foo, bar):
grpcurl -plaintext \
  -d '{"collection_name": "delete-me", "field_name": "foo", "field_type": 1, "wait": true}' \
  localhost:6334 qdrant.Points/CreateFieldIndex
  1. Fill the collection with some (415 in our case) bunch of points like in the following example:
grpcurl -plaintext \
  -d '{"collection_name": "delete-me", "points": [{"id": {"uuid": "400c9f7e-37c6-451d-b600-50cfcc9876fa"}, "vectors": {"vector": {"data": []}}, "payload": {"foo": {"integer_value": 1}, "bar": {"integer_value": 2}}}],  "wait": true}' \
   localhost:6334 qdrant.Points/Upsert

Keep in mind, that I dropped the vector content since it’s too big to be attached as an example, and ID should be unique per point. 4. Try to delete a collection with default timeout:

grpcurl -plaintext -d '{"collection_name": "delete-me"}' localhost:6334 qdrant.Collections/Delete
  1. In case of timeout, increase it by passing it to the request body:
grpcurl -plaintext -d '{"collection_name": "delete-me", "timeout": 100}' localhost:6334 qdrant.Collections/Delete

Expected Behavior

Collection deletion should not stuck on time outs.

Possible Solution

Context (Environment)

  1. qdrant version: v1.6.1
  2. 3 qdrant nodes in the same network, deployed on separate bare-metal machines
  3. Logs from qdrant during timeouts:
ERROR qdrant::tonic::logging: gRPC /qdrant.Collections/Delete unexpectedly failed with Internal error "Service internal error: Waiting for consensus operation commit failed. Timeout set at: 30 seconds"
  1. This issue happens only on one of the our environments
  2. A log that is produced by Qdrant appears to be in the following method - https://github.com/qdrant/qdrant/blob/5c36256caf94f7a90e31248415fd643818a09543/lib/storage/src/content_manager/consensus_manager.rs#L526
  3. Cluster info returns that everything is OK with the cluster:
[user@hostname ~]$ curl 192.168.0.1:6333/cluster | jq
{
  "result": {
    "status": "enabled",
    "peer_id": 1325853725882637,
    "peers": {
      "5300386240383757": {
        "uri": "http://192.168.0.1:6335/"
      },
      "1325853725882637": {
        "uri": "http://192.168.0.2:6335/"
      },
      "4135484918875342": {
        "uri": "http://192.168.0.3:6335/"
      }
    },
    "raft_info": {
      "term": 665,
      "commit": 5140,
      "pending_operations": 0,
      "leader": 1325853725882637,
      "role": "Leader",
      "is_voter": true
    },
    "consensus_thread_status": {
      "consensus_thread_status": "working",
      "last_update": "2023-12-01T11:18:20.594696103Z"
    },
    "message_send_failures": {}
  },
  "status": "ok",
  "time": 0.000008452
}
  1. Additionally, we’ve tried to create a new collection out of the problem one by creating a snapshot and recovering from it:
 curl -X POST 127.0.0.1:6333/collections/collection-to-be-deleted/snapshots
 
 curl -X GET 127.0.0.1:6333/collections/collection-to-be-deleted/snapshots/collection-to-be-deleted-1325853725882637-2023-12-01-09-33-55.snapshot > snapshot
 
 curl -X POST 127.0.0.1:6333/collections/recovered-collection/snapshots/upload?wait=true -H 'Accept: application/json' -H 'Content-Type: multipart/form-data' --form snapshot=@./snapshot

curl -X DELETE 127.0.0.1:6333/collections/recovered-collection  # No timeout, operation is successfull

Detailed Description

We’ve started to see an issue with deleting collections from both HTTP and gRPC methods invoking. Qdrant server cannot execute delete operation within default (10 seconds) timeout, and even when we set it to high value (for instance, 100 seconds) the issue is not gone.

The collection we trying to delete (there are several of them, to be honest) has only 415 points and it’s not used by clients anymore - no upsert/search operations on it.

I understand that the issue is quite difficult to reproduce, but it’d’ve been nice if we could get some help with resolving it 😃

Possible Implementation

About this issue

  • Original URL
  • State: open
  • Created 7 months ago
  • Reactions: 1
  • Comments: 29 (10 by maintainers)

Most upvoted comments

Qdrant 1.7.4 has just been released, which may improve the situation around this.

Hi @mautini, we usually recommend to create payload indexes before uploading the data. It is expected that creating the index on existing collection is a long-running operation and it can timeout, especially if you run it with wait=true.

So I think that timeout during creating index is not related to original issue

No issue on my side since 1.7.4

I saw a similar issue when creating or deleting collections though listing collections worked fine.

My cluster version is v1.6.1. I upgraded the cluster version to v1.7.1 and it solved the issue. Hope it can help people seeing similar issues.