OpenSearch: [BUG] Searchable Snapshot: Search hangs when parallel searches to same remote index
Describe the bug
A clear and concise description of what the bug is.
When performing a aggregation on a nested field on a searchable snapshot (where storage_type: remote_snapshot), the search task hangs for days if no timeout is defined (which is default behavior). This can block the node from handling future searches once search thread queue filled up.
To Reproduce Steps to reproduce the behavior:
- Created a document which contains a nested field
- Restore the index as a
remote_snapshot - Perform a terms agg on a nested field (see example below)
- The search tasks will keep running and never complete
{
"aggs": {
"entTmSrs": {
"filter": {
"match_all": {}
},
"aggs": {
"_nest_agg": {
"nested": {
"path": "nested_doc"
},
"aggs": {
"_key_match": {
"filter": {
"term": {
"nested_doc.key": "some_value"
}
},
"aggs": {
"nested_doc.status": {
"terms": {
"field": "nested_doc.some_field",
"size": 5,
"min_doc_count": 1,
"missing": "(none)"
}
}
}
}
}
}
}
}
},
"size": 0
}
Expected behavior A clear and concise description of what you expected to happen. The search request should complete or have a reasonable default timeout to not deadlock future searches on the node. The same query on local index (not remote_snapshot) takes <100ms. We expect the queries to take longer, but not 2 days and keep retrying.
Plugins Please list all plugins currently enabled.
Screenshots If applicable, add screenshots to help explain your problem.
Dump of stuck tasks
curl localhost:9200/_cat/tasks
indices:data/read/search VCwgCfiNTBKetKNgHY9j5A:12586568 - transport 1676066479789 22:01:19 1.9d 10.2.0.18 763d0e53942b
indices:data/read/search[phase/query] pyR8oklXQpGgC2lYXgEDOg:5275069 VCwgCfiNTBKetKNgHY9j5A:12586568 transport 1676066507519 22:01:47 1.9d 10.2.0.19 2616efc46d6b
indices:data/read/search B4qWdIFETWe3AglIN4-Krg:13826364 - transport 1676066490202 22:01:30 1.9d 10.2.0.6 9cb9b11d3b13
indices:data/read/search[phase/query] pyR8oklXQpGgC2lYXgEDOg:5275062 B4qWdIFETWe3AglIN4-Krg:13826364 transport 1676066506248 22:01:46 1.9d 10.2.0.19 2616efc46d6b
indices:data/read/search xYZuUqqDQACxQraz2oF_Rw:10224070 - transport 1676066501292 22:01:41 1.9d 10.2.0.10 c2492386389b
indices:data/read/search[phase/query] pyR8oklXQpGgC2lYXgEDOg:5275074 xYZuUqqDQACxQraz2oF_Rw:10224070 transport 1676066507805 22:01:47 1.9d 10.2.0.19 2616efc46d6b
indices:data/read/search FeQkJipvT-qgPloq89yosw:19005585 - transport 1676066502092 22:01:42 1.9d 10.2.0.3 es-master2
indices:data/read/search[phase/query] pyR8oklXQpGgC2lYXgEDOg:5275098 FeQkJipvT-qgPloq89yosw:19005585 transport 1676066508604 22:01:48 1.9d 10.2.0.19 2616efc46d6b
indices:data/read/search CioczzjnTKSQEh8uLvEpgA:5319933 - transport 1676066502318 22:01:42 1.9d 10.2.0.22 f694462bab56
indices:data/read/search[phase/query] pyR8oklXQpGgC2lYXgEDOg:5275096 CioczzjnTKSQEh8uLvEpgA:5319933 transport 1676066508514 22:01:48 1.9d 10.2.0.19 2616efc46d6b
indices:data/read/search 8qSnQ7U4SFK-7_MPyvawow:4579861 - transport 1676066530687 22:02:10 1.9d 10.2.0.25 5bbc0ebfd5b7
indices:data/read/search[phase/query] pyR8oklXQpGgC2lYXgEDOg:5275310 8qSnQ7U4SFK-7_MPyvawow:4579861 transport 1676066537199 22:02:17 1.9d 10.2.0.19 2616efc46d6b
indices:data/read/search xYZuUqqDQACxQraz2oF_Rw:10224230 - transport 1676066531030 22:02:11 1.9d 10.2.0.10 c2492386389b
indices:data/read/search[phase/query] pyR8oklXQpGgC2lYXgEDOg:5275314 xYZuUqqDQACxQraz2oF_Rw:10224230 transport 1676066537542 22:02:17 1.9d 10.2.0.19 2616efc46d6b
indices:data/read/search FeQkJipvT-qgPloq89yosw:19006130 - transport 1676066531796 22:02:11 1.9d 10.2.0.3 es-master2
indices:data/read/search[phase/query] pyR8oklXQpGgC2lYXgEDOg:5275329 FeQkJipvT-qgPloq89yosw:19006130 transport 1676066538308 22:02:18 1.9d 10.2.0.19 2616efc46d6b
indices:data/read/search xYZuUqqDQACxQraz2oF_Rw:10224240 - transport 1676066532799 22:02:12 1.9d 10.2.0.10 c2492386389b
indices:data/read/search[phase/query] pyR8oklXQpGgC2lYXgEDOg:5275350 xYZuUqqDQACxQraz2oF_Rw:10224240 transport 1676066539311 22:02:19 1.9d 10.2.0.19 2616efc46d6b
indices:data/read/search 8qSnQ7U4SFK-7_MPyvawow:4579910 - transport 1676066533031 22:02:13 1.9d 10.2.0.25 5bbc0ebfd5b7
indices:data/read/search[phase/query] pyR8oklXQpGgC2lYXgEDOg:5275354 8qSnQ7U4SFK-7_MPyvawow:4579910 transport 1676066539543 22:02:19 1.9d 10.2.0.19 2616efc46d6b
indices:data/read/search iaqRNoTCStaRJNbnv2S7Sw:5330461 - transport 1676066540623 22:02:20 1.9d 10.2.0.8 b6c0a935adb0
indices:data/read/search[phase/query] pyR8oklXQpGgC2lYXgEDOg:5275402 iaqRNoTCStaRJNbnv2S7Sw:5330461 transport 1676066546819 22:02:26 1.9d 10.2.0.19 2616efc46d6b
indices:data/read/search iaqRNoTCStaRJNbnv2S7Sw:5330467 - transport 1676066540646 22:02:20 1.9d 10.2.0.8 b6c0a935adb0
indices:data/read/search[phase/query] pyR8oklXQpGgC2lYXgEDOg:5275404 iaqRNoTCStaRJNbnv2S7Sw:5330467 transport 1676066546842 22:02:26 1.9d 10.2.0.19 2616efc46d6b
indices:data/read/search iaqRNoTCStaRJNbnv2S7Sw:5330494 - transport 1676066545520 22:02:25 1.9d 10.2.0.8 b6c0a935adb0
indices:data/read/search[phase/query] pyR8oklXQpGgC2lYXgEDOg:5275419 iaqRNoTCStaRJNbnv2S7Sw:5330494 transport 1676066551716 22:02:31 1.9d 10.2.0.19 2616efc46d6b
Host/Environment (please complete the following information):
- OS: [e.g. iOS] Ubuntiu
- Version [e.g. 22] OS 2.4
Additional context Add any other context about the problem here. Using Azure Blob Storage as the snapshot repo.
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 16 (7 by maintainers)
Commits related to this issue
- Fix race with eviction when reading from FileCache The previous implementation had an inherent race condition where a zero-reference count IndexInput read from the cache could be evicted before the I... — committed to andrross/OpenSearch by andrross a year ago
- Fix race with eviction when reading from FileCache The previous implementation had an inherent race condition where a zero-reference count IndexInput read from the cache could be evicted before the I... — committed to andrross/OpenSearch by andrross a year ago
- Fix race with eviction when reading from FileCache The previous implementation had an inherent race condition where a zero-reference count IndexInput read from the cache could be evicted before the I... — committed to andrross/OpenSearch by andrross a year ago
- Fix race with eviction when reading from FileCache The previous implementation had an inherent race condition where a zero-reference count IndexInput read from the cache could be evicted before the I... — committed to andrross/OpenSearch by andrross a year ago
- Fix race with eviction when reading from FileCache The previous implementation had an inherent race condition where a zero-reference count IndexInput read from the cache could be evicted before the I... — committed to andrross/OpenSearch by andrross a year ago
- Fix race with eviction when reading from FileCache The previous implementation had an inherent race condition where a zero-reference count IndexInput read from the cache could be evicted before the I... — committed to andrross/OpenSearch by andrross a year ago
- Fix race with eviction when reading from FileCache The previous implementation had an inherent race condition where a zero-reference count IndexInput read from the cache could be evicted before the I... — committed to andrross/OpenSearch by andrross a year ago
- Fix race with eviction when reading from FileCache The previous implementation had an inherent race condition where a zero-reference count IndexInput read from the cache could be evicted before the I... — committed to andrross/OpenSearch by andrross a year ago
- Fix race with eviction when reading from FileCache The previous implementation had an inherent race condition where a zero-reference count IndexInput read from the cache could be evicted before the I... — committed to andrross/OpenSearch by andrross a year ago
- Fix race with eviction when reading from FileCache (#6592) The previous implementation had an inherent race condition where a zero-reference count IndexInput read from the cache could be evicted be... — committed to opensearch-project/OpenSearch by andrross a year ago
- Fix race with eviction when reading from FileCache (#6592) The previous implementation had an inherent race condition where a zero-reference count IndexInput read from the cache could be evicted befo... — committed to opensearch-project/OpenSearch by github-actions[bot] a year ago
- Fix race with eviction when reading from FileCache (#6592) (#6630) The previous implementation had an inherent race condition where a zero-reference count IndexInput read from the cache could be evi... — committed to opensearch-project/OpenSearch by opensearch-trigger-bot[bot] a year ago
- Fix race with eviction when reading from FileCache (#6592) The previous implementation had an inherent race condition where a zero-reference count IndexInput read from the cache could be evicted befo... — committed to mingshl/OpenSearch-Mingshl by andrross a year ago
@kartg Let’s keep this open until we get full verification of the fix