OpenSearch: [BUG] Searchable Snapshot: Search hangs when parallel searches to same remote index

Describe the bug A clear and concise description of what the bug is. When performing a aggregation on a nested field on a searchable snapshot (where storage_type: remote_snapshot), the search task hangs for days if no timeout is defined (which is default behavior). This can block the node from handling future searches once search thread queue filled up.

To Reproduce Steps to reproduce the behavior:

  1. Created a document which contains a nested field
  2. Restore the index as a remote_snapshot
  3. Perform a terms agg on a nested field (see example below)
  4. The search tasks will keep running and never complete
{
    "aggs": {
        "entTmSrs": {
            "filter": {
                "match_all": {}
            },
            "aggs": {
                "_nest_agg": {
                    "nested": {
                        "path": "nested_doc"
                    },
                    "aggs": {
                        "_key_match": {
                            "filter": {
                                "term": {
                                    "nested_doc.key": "some_value"
                                }
                            },
                            "aggs": {
                                "nested_doc.status": {
                                    "terms": {
                                        "field": "nested_doc.some_field",
                                        "size": 5,
                                        "min_doc_count": 1,
                                        "missing": "(none)"
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    },
    "size": 0
}

Expected behavior A clear and concise description of what you expected to happen. The search request should complete or have a reasonable default timeout to not deadlock future searches on the node. The same query on local index (not remote_snapshot) takes <100ms. We expect the queries to take longer, but not 2 days and keep retrying.

Plugins Please list all plugins currently enabled.

Screenshots If applicable, add screenshots to help explain your problem.

Dump of stuck tasks

curl localhost:9200/_cat/tasks
indices:data/read/search              VCwgCfiNTBKetKNgHY9j5A:12586568 -                               transport 1676066479789 22:01:19 1.9d        10.2.0.18 763d0e53942b
indices:data/read/search[phase/query] pyR8oklXQpGgC2lYXgEDOg:5275069  VCwgCfiNTBKetKNgHY9j5A:12586568 transport 1676066507519 22:01:47 1.9d        10.2.0.19 2616efc46d6b
indices:data/read/search              B4qWdIFETWe3AglIN4-Krg:13826364 -                               transport 1676066490202 22:01:30 1.9d        10.2.0.6  9cb9b11d3b13
indices:data/read/search[phase/query] pyR8oklXQpGgC2lYXgEDOg:5275062  B4qWdIFETWe3AglIN4-Krg:13826364 transport 1676066506248 22:01:46 1.9d        10.2.0.19 2616efc46d6b
indices:data/read/search              xYZuUqqDQACxQraz2oF_Rw:10224070 -                               transport 1676066501292 22:01:41 1.9d        10.2.0.10 c2492386389b
indices:data/read/search[phase/query] pyR8oklXQpGgC2lYXgEDOg:5275074  xYZuUqqDQACxQraz2oF_Rw:10224070 transport 1676066507805 22:01:47 1.9d        10.2.0.19 2616efc46d6b
indices:data/read/search              FeQkJipvT-qgPloq89yosw:19005585 -                               transport 1676066502092 22:01:42 1.9d        10.2.0.3  es-master2
indices:data/read/search[phase/query] pyR8oklXQpGgC2lYXgEDOg:5275098  FeQkJipvT-qgPloq89yosw:19005585 transport 1676066508604 22:01:48 1.9d        10.2.0.19 2616efc46d6b
indices:data/read/search              CioczzjnTKSQEh8uLvEpgA:5319933  -                               transport 1676066502318 22:01:42 1.9d        10.2.0.22 f694462bab56
indices:data/read/search[phase/query] pyR8oklXQpGgC2lYXgEDOg:5275096  CioczzjnTKSQEh8uLvEpgA:5319933  transport 1676066508514 22:01:48 1.9d        10.2.0.19 2616efc46d6b
indices:data/read/search              8qSnQ7U4SFK-7_MPyvawow:4579861  -                               transport 1676066530687 22:02:10 1.9d        10.2.0.25 5bbc0ebfd5b7
indices:data/read/search[phase/query] pyR8oklXQpGgC2lYXgEDOg:5275310  8qSnQ7U4SFK-7_MPyvawow:4579861  transport 1676066537199 22:02:17 1.9d        10.2.0.19 2616efc46d6b
indices:data/read/search              xYZuUqqDQACxQraz2oF_Rw:10224230 -                               transport 1676066531030 22:02:11 1.9d        10.2.0.10 c2492386389b
indices:data/read/search[phase/query] pyR8oklXQpGgC2lYXgEDOg:5275314  xYZuUqqDQACxQraz2oF_Rw:10224230 transport 1676066537542 22:02:17 1.9d        10.2.0.19 2616efc46d6b
indices:data/read/search              FeQkJipvT-qgPloq89yosw:19006130 -                               transport 1676066531796 22:02:11 1.9d        10.2.0.3  es-master2
indices:data/read/search[phase/query] pyR8oklXQpGgC2lYXgEDOg:5275329  FeQkJipvT-qgPloq89yosw:19006130 transport 1676066538308 22:02:18 1.9d        10.2.0.19 2616efc46d6b
indices:data/read/search              xYZuUqqDQACxQraz2oF_Rw:10224240 -                               transport 1676066532799 22:02:12 1.9d        10.2.0.10 c2492386389b
indices:data/read/search[phase/query] pyR8oklXQpGgC2lYXgEDOg:5275350  xYZuUqqDQACxQraz2oF_Rw:10224240 transport 1676066539311 22:02:19 1.9d        10.2.0.19 2616efc46d6b
indices:data/read/search              8qSnQ7U4SFK-7_MPyvawow:4579910  -                               transport 1676066533031 22:02:13 1.9d        10.2.0.25 5bbc0ebfd5b7
indices:data/read/search[phase/query] pyR8oklXQpGgC2lYXgEDOg:5275354  8qSnQ7U4SFK-7_MPyvawow:4579910  transport 1676066539543 22:02:19 1.9d        10.2.0.19 2616efc46d6b
indices:data/read/search              iaqRNoTCStaRJNbnv2S7Sw:5330461  -                               transport 1676066540623 22:02:20 1.9d        10.2.0.8  b6c0a935adb0
indices:data/read/search[phase/query] pyR8oklXQpGgC2lYXgEDOg:5275402  iaqRNoTCStaRJNbnv2S7Sw:5330461  transport 1676066546819 22:02:26 1.9d        10.2.0.19 2616efc46d6b
indices:data/read/search              iaqRNoTCStaRJNbnv2S7Sw:5330467  -                               transport 1676066540646 22:02:20 1.9d        10.2.0.8  b6c0a935adb0
indices:data/read/search[phase/query] pyR8oklXQpGgC2lYXgEDOg:5275404  iaqRNoTCStaRJNbnv2S7Sw:5330467  transport 1676066546842 22:02:26 1.9d        10.2.0.19 2616efc46d6b
indices:data/read/search              iaqRNoTCStaRJNbnv2S7Sw:5330494  -                               transport 1676066545520 22:02:25 1.9d        10.2.0.8  b6c0a935adb0
indices:data/read/search[phase/query] pyR8oklXQpGgC2lYXgEDOg:5275419  iaqRNoTCStaRJNbnv2S7Sw:5330494  transport 1676066551716 22:02:31 1.9d        10.2.0.19 2616efc46d6b

Host/Environment (please complete the following information):

  • OS: [e.g. iOS] Ubuntiu
  • Version [e.g. 22] OS 2.4

Additional context Add any other context about the problem here. Using Azure Blob Storage as the snapshot repo.

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 16 (7 by maintainers)

Commits related to this issue

Most upvoted comments

@kartg Let’s keep this open until we get full verification of the fix