qdrant: Verbose logging to indicate when optimizer / vacuum / indexing process is triggered, plus number of workers as they are dynamically allocated.

Is your feature request related to a problem? Please describe.

ETL Indexing operation is preventing con-current searching.

In its most basic form, a single indexing operation involves deleting a set of points from a collection, before inserting their replacements. This represents a set of document pages that have changed. Multiple indexing operations can be executed at the same time. Both the point deletion and insertion calls (made via the Python qdrant-client package) DO NOT wait for their task to complete and instead just check for an acknowledged message (i.e. we don’t want to hold up the ETL process and qdrant can just work its way through its back log).

However, if an attempt is made to carry out a search on the the same collection targeted by the indexing operation, the search either takes a very long time to complete (> 30s), or it just gives up and times out. If an indexing operation is not being undertaken, the same search returns more or less instantaneously.

If the point deletion and insertion calls are changed to wait until their respective calls are completed the log output from qdrant seems to suggest that the deletion operation is the one taking a considerable amount of time to complete (30+s). So let’s consider the possible issues that might cause this…

  • Is the filter used by the point deletion targeting indexed fields from the payload? Yes its targeting the field ‘pageId’:

Collection details:

{ “result”: { “status”: “green”, “optimizer_status”: “ok”, “vectors_count”: 21135, “indexed_vectors_count”: 9293, “points_count”: 12380, “segments_count”: 8, “config”: { “params”: { “vectors”: { “size”: 1536, “distance”: “Cosine” }, “shard_number”: 1, “replication_factor”: 1, “write_consistency_factor”: 1, “on_disk_payload”: true }, “hnsw_config”: { “m”: 16, “ef_construct”: 100, “full_scan_threshold”: 10000, “max_indexing_threads”: 0, “on_disk”: false }, “optimizer_config”: { “deleted_threshold”: 0.2, “vacuum_min_vector_number”: 1000, “default_segment_number”: 0, “max_segment_size”: null, “memmap_threshold”: null, “indexing_threshold”: 20000, “flush_interval_sec”: 5, “max_optimization_threads”: 1 }, “wal_config”: { “wal_capacity_mb”: 32, “wal_segments_ahead”: 0 }, “quantization_config”: null }, “payload_schema”: { “domain”: { “data_type”: “text”, “points”: 10712 }, “type”: { “data_type”: “text”, “points”: 10712 }, “relevantDateTime”: { “data_type”: “float”, “points”: 10712 }, “documentGuid”: { “data_type”: “text”, “points”: 10712 }, “pageId”: { “data_type”: “text”, “points”: 10712 } } }, “status”: “ok”, “time”: 0.0019367 }

  • Deletion operations are computationally expensive and qdrant gets around this by marking the point deleted rather than actually carrying out the removal. So what could cause the deletion to take so long? Has the optimizer / vacuum process been triggered? Unfortunately there’s no log output that indicates that. Has this caused all of the available threads to be eaten up, so incoming requests can’t be serviced? How do I know how many workers qdrant has allocated from its worker pool to a specific job (i.e. search / indexing / optimization / vacuum etc)? Given this allocation can be dynamic, it would be very helpful to be able to see this change in the logs. At the moment all we see is the number of available workers at start-up.

  • Is the deletion operation triggering an index adjustment? Similar problem to the above, no logging to help diagnose this.

Any suggestions / advice would be really helpful, Thanks - Kev.

Describe the solution you’d like More verbose logging to be able to diagnose issues as described above.

Describe alternatives you’ve considered The only other alternative is to try and build my own image of qdrant, adding debug to achieve the same ends. However my knowledge of its internals and the language its written in is zero, so had I’ve to commit the time to do this.

Additional context

Architecture

[Rest API]: Azure Function App v4 / python 3.10

qdrant-client = 1.8.0 llama-index = 0.9.48 fastapi = 0.104.1

[qdrant deployment]: Azure Container App

docker image: qdrant/qdrant v1.8.1 (single node)

Memory: 8GB CPU Cores: 4 CPU Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 46 bits physical, 57 bits virtual Byte Order: Little Endian CPU(s): 4 On-line CPU(s) list: 0-3 Vendor ID: GenuineIntel Model name: Intel® Xeon® Platinum 8370C CPU @ 2.80GHz CPU family: 6 Model: 106 Thread(s) per core: 2 Core(s) per socket: 2 Socket(s): 1 Stepping: 6 CPU(s) scaling MHz: 122% CPU max MHz: 2800.0000 CPU min MHz: 800.0000 BogoMIPS: 5586.87 Virtualization features: Hypervisor vendor: Microsoft Virtualization type: full Caches (sum of all):
L1d: 96 KiB (2 instances) L1i: 64 KiB (2 instances) L2: 2.5 MiB (2 instances) L3: 48 MiB (1 instance) NUMA:
NUMA node(s): 1 NUMA node0 CPU(s): 0-3

[qdrant data storage]: Azure Storage Account (Premium File Storage)

File share mapped to by Azure Container App Environment.

[qdrant config]

`log_level: DEBUG

storage: storage_path: ./storage snapshots_path: ./snapshots temp_path: null on_disk_payload: true update_concurrency: null

wal: wal_capacity_mb: 32 wal_segments_ahead: 0

node_type: “Normal”

performance: max_search_threads: 0 max_optimization_threads: 0 optimizer_cpu_budget: 1 update_rate_limit: null

optimizers: deleted_threshold: 0.2 vacuum_min_vector_number: 1000 default_segment_number: 0 max_segment_size_kb: null memmap_threshold_kb: null indexing_threshold_kb: 20000 flush_interval_sec: 5 max_optimization_threads: null

hnsw_index: m: 16 ef_construct: 100 full_scan_threshold_kb: 10000 max_indexing_threads: 0 on_disk: false payload_m: null

shard_transfer_method: null

service: max_request_size_mb: 32 max_workers: 0 host: 0.0.0.0 http_port: 6333 grpc_port: 6334 enable_cors: true enable_tls: false verify_https_client_certificate: false

cluster: enabled: false

p2p: port: 6335 enable_tls: false

consensus: tick_period_ms: 100

telemetry_disabled: true

tls: cert: ./tls/cert.pem key: ./tls/key.pem ca_cert: ./tls/cacert.pem cert_ttl: 3600 `

About this issue

  • Original URL
  • State: closed
  • Created 3 months ago
  • Comments: 23 (8 by maintainers)

Most upvoted comments

So I’ve managed to fix the error / issue with our ‘live’ qdrant collection - by creating a new collection based upon an existing collection. The idea here is to adopt the latest default optimization options vs the available hardware - plus by creating a new collection, hopefully get rid of any residual collection errors.

It may be worth noting that if you significantly change your hardware profile (especially CPU’s) it could be beneficial to copy your existing collections to new ones in order to achieve the best optimizations / segment arrangement.

And again Tim, thanks for all your help and in-depth comments / advice to help resolve this issue - because without them, I’d still be floundering around in the dark!

Thanks Kev.

  • It fixes as far as it starts up, however the indexer does not kick into action. Should this not automatically happen with indexing_threshold=1 and max_optimization_threads=1?

No, we don’t start indexing automatically on start to prevent crash loops. We actually want to clearly show this with a grey collection status in Qdrant 1.10. To trigger indexing again you can send another update operation.

If I send a request to change the collections max optimization threads to 1 (when it already equals 1), the index op triggers again and the same exception is raised.

Yes, that’s one of the things you can do to trigger it again. It is expected behavior. Here’s a documentation draft with a simple example request describing this behavior: https://deploy-preview-779--condescending-goldwasser-91acf0.netlify.app/documentation/concepts/collections/#grey-collection-status

I have mentioned this before, but you likely have way to few points for optimizations to take place (and to make sense). You can try to lower the indexing threshold. Though I wouldn’t recommend using it in the end, you can temporarily set it to 1 to see optimizations from taking place.

You can see what specific optimizations are triggered, and when they took place, in the /telemetry?details_level=9 output.

I have no good explanation at this time for the behavior you’re seeing.

Thanks Tim for all your comments, I’ll have a look into the telemetry first to see if I can see anything obvious.