OpenSearch: [BUG] StackOverflow crash - large regex produced by Discover filter not limited by index.regex_max_length
Describe the bug It seems to possible to crash Opensearch nodes by providing a very large string when attempting to filter on a field value (StackOverflow related to regexp processing). When filtering on a field value, a query containing a ‘suggestions’ aggregation is sent to the cluster in the background before the filter is saved, in order to populate an autocomplete drop down. This aggregation includes a regex which is constructed from taking the large string and suffixing with a “.*” . The resulting regexp does not seem to respect the default index.max_regex_length limit of 1000 - the query is submitted causing an instant crash of nodes.
To Reproduce Reproduced using the latest Opensearch Docker images -
{
"name" : "opensearch-node1",
"cluster_name" : "opensearch-cluster",
"cluster_uuid" : "ftk3wyp1RqOa0Yq5SS4ELA",
"version" : {
"distribution" : "opensearch",
"number" : "1.2.4",
"build_type" : "tar",
"build_hash" : "e505b10357c03ae8d26d675172402f2f2144ef0f",
"build_date" : "2022-01-14T03:38:06.881862Z",
"build_snapshot" : false,
"lucene_version" : "8.10.1",
"minimum_wire_compatibility_version" : "6.8.0",
"minimum_index_compatibility_version" : "6.0.0-beta1"
},
"tagline" : "The OpenSearch Project: https://opensearch.org/"
}
- load sample data in the dashboard
- in the Discover pane, add a Filter , select a Field, select ‘is’ , paste a huge string into the search box - 50k chars will do the trick. The node receiving the query will crash.
Expected behavior Expect that the query will be terminated before being allowed to crash the cluster. The bug is present in some versions of Elasticsearch, but does not appear to be present in the latest version (7.16) . It’s present in 7.10.2 , the last version tracked before the Opensearch split - so it probably needs to be addressed in the Opensearch codebase now. In Elasticsearch 7.16 the following response is returned -
{"_shards":{"total":1,"successful":0,"failed":1,"failures":[{"shard":0,"index":"kibana_sample_data_ecommerce","status":"INTERNAL_SERVER_ERROR","reason":{"type":"broadcast_shard_operation_failed_exception","reason":"java.lang.IllegalArgumentException: input automaton is too large: 1001.......................
Plugins Nothing additional to default plugins (security etc…)
Host/Environment (please complete the following information):
~ ❯❯❯ uname -a
5.4.0-96-generic #109-Ubuntu SMP Wed Jan 12 16:49:16 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
~ ❯❯❯ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 20.04.2 LTS
Release: 20.04
Codename: focal
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 26 (23 by maintainers)
No problem, and thanks for following up on it .
My original attempt at a mitigation failed miserably, so I’m back at the drawing board.
To reiterate, this change does not seek to prevent the StackOverflow error from reg-ex parsing; doing so is not feasible since the root cause lies within the Lucene implementation and the overflow threshold is dictated by a JVM setting. Instead, we’re seeking to correctly enforce the
index.regex_max_lengthindex-level setting. I believe this is currently enforced for the search query itself (via QueryStringQueryParser), but not for aggregations.The complexity here stems from the fact that the logic for parsing the “include” reg-ex is set up at bootstrap/startup time across multiple term parsers (example). Since there is no notion of a “index” in this context, the index-level setting/limit cannot be retrieved/applied here.
The right location to enforce this would be at the runtime point where a search query against an index arrives at the node and must be parsed. In the ideal case, the
QueryContextobject would then be passed to the IncludeExclude parsing implementation, which would own the enforcement of the regex length limit.I think we’re just hitting the fact that the recursive algorithm uses one stack frame per regex operation. See this simple test not using OpenSearch at all:
@kartg Can you please open an issue in Lucene? A
StackOverflowmay not be a problem, that’s a valid exception, though.Hey @dreamer-89 ,
Thanks for picking this up. Just tested again using clean container images and reproduced. Just a quick test - but hopefully enough to point you in the right direction for reproducing the issue .
Disabled HTTPS for convenience to grab a traffic capture in order to view the request (alternatively enabling audit logging for the REST interface should work just as well) .
Steps taken -
Here’s a sample from a traffic capture I ran while doing this test -
Note the first query contains “include”: “.*” - this autocompleted the drop down when no text was entered (eg all _id values) . The second query was sent when I pasted 50K chars into the value box - I did not submit the request, just pasted the chars.
I attached the opensearch logs in an earlier comment - see ‘opensearch-regex-fatal-error.log’ . Here’s a snippet from the container stdout viewable in the terminal (please check the earlier attachment for the full error). We can see opensearch-node1 die after which point opensearch-dashboards cannot connect anymore.
Hope that helps! Any other questions, shout.
Cheers