qdrant: Timeout on collection deletion
Hey guys!
Current Behavior
HTTP deletion with default timeout
curl -X DELETE 127.0.0.1:6333/collections/collection-to-be-deleted
{"status":{"error":"Service internal error: Waiting for consensus operation commit failed. Timeout set at: 10 seconds"},"time":10.001176907}
gRPC deletion with timeout set to 100 seconds
grpcurl -plaintext -d '{"timeout": 100, "collection_name": "collection-to-be-deleted"}' 127.0.0.1:6334 qdrant.Collections/Delete
ERROR:
Code: Internal
Message: Service internal error: Waiting for consensus operation commit failed. Timeout set at: 100 seconds
Steps to Reproduce
- Create a collection using gRPC:
grpcurl -plaintext \
-d '{"collection_name": "delete-me", "hnsw_config": {"m": 16, "ef_construct": 100}, "optimizers_config": {"indexing_threshold": 100000}, "replication_factor": 2, "write_consistency_factor": 2, "vectors_config": {"params": {"size": 512, "distance": "Cosine"}}}' \
localhost:6334 qdrant.Collections/Create
- Create two integer indexes (foo, bar):
grpcurl -plaintext \
-d '{"collection_name": "delete-me", "field_name": "foo", "field_type": 1, "wait": true}' \
localhost:6334 qdrant.Points/CreateFieldIndex
- Fill the collection with some (415 in our case) bunch of points like in the following example:
grpcurl -plaintext \
-d '{"collection_name": "delete-me", "points": [{"id": {"uuid": "400c9f7e-37c6-451d-b600-50cfcc9876fa"}, "vectors": {"vector": {"data": []}}, "payload": {"foo": {"integer_value": 1}, "bar": {"integer_value": 2}}}], "wait": true}' \
localhost:6334 qdrant.Points/Upsert
Keep in mind, that I dropped the vector content since it’s too big to be attached as an example, and ID should be unique per point. 4. Try to delete a collection with default timeout:
grpcurl -plaintext -d '{"collection_name": "delete-me"}' localhost:6334 qdrant.Collections/Delete
- In case of timeout, increase it by passing it to the request body:
grpcurl -plaintext -d '{"collection_name": "delete-me", "timeout": 100}' localhost:6334 qdrant.Collections/Delete
Expected Behavior
Collection deletion should not stuck on time outs.
Possible Solution
Context (Environment)
- qdrant version: v1.6.1
- 3 qdrant nodes in the same network, deployed on separate bare-metal machines
- Logs from qdrant during timeouts:
ERROR qdrant::tonic::logging: gRPC /qdrant.Collections/Delete unexpectedly failed with Internal error "Service internal error: Waiting for consensus operation commit failed. Timeout set at: 30 seconds"
- This issue happens only on one of the our environments
- A log that is produced by Qdrant appears to be in the following method - https://github.com/qdrant/qdrant/blob/5c36256caf94f7a90e31248415fd643818a09543/lib/storage/src/content_manager/consensus_manager.rs#L526
- Cluster info returns that everything is OK with the cluster:
[user@hostname ~]$ curl 192.168.0.1:6333/cluster | jq
{
"result": {
"status": "enabled",
"peer_id": 1325853725882637,
"peers": {
"5300386240383757": {
"uri": "http://192.168.0.1:6335/"
},
"1325853725882637": {
"uri": "http://192.168.0.2:6335/"
},
"4135484918875342": {
"uri": "http://192.168.0.3:6335/"
}
},
"raft_info": {
"term": 665,
"commit": 5140,
"pending_operations": 0,
"leader": 1325853725882637,
"role": "Leader",
"is_voter": true
},
"consensus_thread_status": {
"consensus_thread_status": "working",
"last_update": "2023-12-01T11:18:20.594696103Z"
},
"message_send_failures": {}
},
"status": "ok",
"time": 0.000008452
}
- Additionally, we’ve tried to create a new collection out of the problem one by creating a snapshot and recovering from it:
curl -X POST 127.0.0.1:6333/collections/collection-to-be-deleted/snapshots
curl -X GET 127.0.0.1:6333/collections/collection-to-be-deleted/snapshots/collection-to-be-deleted-1325853725882637-2023-12-01-09-33-55.snapshot > snapshot
curl -X POST 127.0.0.1:6333/collections/recovered-collection/snapshots/upload?wait=true -H 'Accept: application/json' -H 'Content-Type: multipart/form-data' --form snapshot=@./snapshot
curl -X DELETE 127.0.0.1:6333/collections/recovered-collection # No timeout, operation is successfull
Detailed Description
We’ve started to see an issue with deleting collections from both HTTP and gRPC methods invoking. Qdrant server cannot execute delete operation within default (10 seconds) timeout, and even when we set it to high value (for instance, 100 seconds) the issue is not gone.
The collection we trying to delete (there are several of them, to be honest) has only 415 points and it’s not used by clients anymore - no upsert/search operations on it.
I understand that the issue is quite difficult to reproduce, but it’d’ve been nice if we could get some help with resolving it 😃
Possible Implementation
About this issue
- Original URL
- State: open
- Created 7 months ago
- Reactions: 1
- Comments: 29 (10 by maintainers)
Qdrant 1.7.4 has just been released, which may improve the situation around this.
Hi @mautini, we usually recommend to create payload indexes before uploading the data. It is expected that creating the index on existing collection is a long-running operation and it can timeout, especially if you run it with
wait=true
.So I think that timeout during creating index is not related to original issue
@timvisee hello! I posted the questions on Discord. can you check it out?
https://discord.com/channels/907569970500743200/1225339095929196589/1225339095929196589
No issue on my side since 1.7.4
I saw a similar issue when creating or deleting collections though listing collections worked fine.
My cluster version is v1.6.1. I upgraded the cluster version to v1.7.1 and it solved the issue. Hope it can help people seeing similar issues.