OpenSearch: [BUG] StackOverflow crash - large regex produced by Discover filter not limited by index.regex_max_length

Describe the bug It seems to possible to crash Opensearch nodes by providing a very large string when attempting to filter on a field value (StackOverflow related to regexp processing). When filtering on a field value, a query containing a ‘suggestions’ aggregation is sent to the cluster in the background before the filter is saved, in order to populate an autocomplete drop down. This aggregation includes a regex which is constructed from taking the large string and suffixing with a “.*” . The resulting regexp does not seem to respect the default index.max_regex_length limit of 1000 - the query is submitted causing an instant crash of nodes.

To Reproduce Reproduced using the latest Opensearch Docker images -

{
  "name" : "opensearch-node1",
  "cluster_name" : "opensearch-cluster",
  "cluster_uuid" : "ftk3wyp1RqOa0Yq5SS4ELA",
  "version" : {
    "distribution" : "opensearch",
    "number" : "1.2.4",
    "build_type" : "tar",
    "build_hash" : "e505b10357c03ae8d26d675172402f2f2144ef0f",
    "build_date" : "2022-01-14T03:38:06.881862Z",
    "build_snapshot" : false,
    "lucene_version" : "8.10.1",
    "minimum_wire_compatibility_version" : "6.8.0",
    "minimum_index_compatibility_version" : "6.0.0-beta1"
  },
  "tagline" : "The OpenSearch Project: https://opensearch.org/"
}

load sample data in the dashboard
in the Discover pane, add a Filter , select a Field, select ‘is’ , paste a huge string into the search box - 50k chars will do the trick. The node receiving the query will crash.

Expected behavior Expect that the query will be terminated before being allowed to crash the cluster. The bug is present in some versions of Elasticsearch, but does not appear to be present in the latest version (7.16) . It’s present in 7.10.2 , the last version tracked before the Opensearch split - so it probably needs to be addressed in the Opensearch codebase now. In Elasticsearch 7.16 the following response is returned -

{"_shards":{"total":1,"successful":0,"failed":1,"failures":[{"shard":0,"index":"kibana_sample_data_ecommerce","status":"INTERNAL_SERVER_ERROR","reason":{"type":"broadcast_shard_operation_failed_exception","reason":"java.lang.IllegalArgumentException: input automaton is too large: 1001.......................

Plugins Nothing additional to default plugins (security etc…)

Host/Environment (please complete the following information):

~ ❯❯❯ uname -a
5.4.0-96-generic #109-Ubuntu SMP Wed Jan 12 16:49:16 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

~ ❯❯❯ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 20.04.2 LTS
Release:        20.04
Codename:       focal

About this issue

Original URL
State: closed
Created 2 years ago
Comments: 26 (23 by maintainers)

Most upvoted comments

Apologies @gplechuck for delay on this issue and thank you for sharing the detailed steps for isssue reproduction along with traffic capture. This is very helpful and I can easily reproduce the issue following provided steps.

I am also able to reproduce the bug (node shutdown) locally on my mac system by starting OpenSearch engine and OpenSearch dashboards locally from codebase and following the replication steps.

No problem, and thanks for following up on it .

gplechuck on Apr 14, 2022

My original attempt at a mitigation failed miserably, so I’m back at the drawing board.

To reiterate, this change does not seek to prevent the StackOverflow error from reg-ex parsing; doing so is not feasible since the root cause lies within the Lucene implementation and the overflow threshold is dictated by a JVM setting. Instead, we’re seeking to correctly enforce the index.regex_max_length index-level setting. I believe this is currently enforced for the search query itself (via QueryStringQueryParser), but not for aggregations.

The complexity here stems from the fact that the logic for parsing the “include” reg-ex is set up at bootstrap/startup time across multiple term parsers (example). Since there is no notion of a “index” in this context, the index-level setting/limit cannot be retrieved/applied here.

The right location to enforce this would be at the runtime point where a search query against an index arrives at the node and must be parsed. In the ideal case, the QueryContext object would then be passed to the IncludeExclude parsing implementation, which would own the enforcement of the regex length limit.

kartg on Apr 6, 2022

I think we’re just hitting the fact that the recursive algorithm uses one stack frame per regex operation. See this simple test not using OpenSearch at all:

$ ls
RegExpTest.java       lucene-core-9.1.0.jar
$ cat RegExpTest.java
class RegExpTest {
    public static void main(String[] args) {
        StringBuilder strBuilder = new StringBuilder();
        for (int i = 0; i < 50000; i++) {
            strBuilder.append("a");
        }
        try {
            new org.apache.lucene.util.automaton.RegExp(strBuilder.toString());
        } catch (StackOverflowError e) {
            System.out.println("Stack overflow");
            System.exit(-1);
        }
        System.out.println("Success");
    }
}
$ javac -cp './lucene-core-9.1.0.jar:.' RegExpTest.java
$ java -cp './lucene-core-9.1.0.jar:.' RegExpTest
Stack overflow
$ java -Xss1G -cp './lucene-core-9.1.0.jar:.' RegExpTest
Success

andrross on Apr 5, 2022

@kartg Can you please open an issue in Lucene? A StackOverflow may not be a problem, that’s a valid exception, though.

dblock on Apr 5, 2022

Hey @dreamer-89 ,

Thanks for picking this up. Just tested again using clean container images and reproduced. Just a quick test - but hopefully enough to point you in the right direction for reproducing the issue .

Disabled HTTPS for convenience to grab a traffic capture in order to view the request (alternatively enabling audit logging for the REST interface should work just as well) .

Steps taken -

docker-compose up
browse opensearch dashboards
load sample data
go to discover pane
pick a field (in this example I chose the ID field)
wait for it to start auto populating the drop down with ID values
paste a large string
opensearch-node1 dies, opensearch dashboards receives an internal server error

Here’s a sample from a traffic capture I ran while doing this test -

POST /opensearch_dashboards_sample_data_logs/_search HTTP/1.1
x-opensearch-product-origin: opensearch-dashboards
x-opaque-id: 922beffe-66c6-43a9-96ba-56a839106c7b
content-type: application/json
Host: opensearch-node1:9200
Content-Length: 188
Connection: keep-alive

{"size":0,"timeout":"1000ms","terminate_after":100000,"query":{"bool":{"filter":[]}},"aggs":{"suggestions":{"terms":{"field":"_id","include":".*","execution_hint":"map","shard_size":10}}}}HTTP/1.1 200 OK
X-Opaque-Id: 922beffe-66c6-43a9-96ba-56a839106c7b
content-type: application/json; charset=UTF-8
content-length: 746

{"took":90,"timed_out":false,"terminated_early":false,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0},"hits":{"total":{"value":10000,"relation":"gte"},"max_score":null,"hits":[]},"aggregations":{"suggestions":{"doc_count_error_upper_bound":0,"sum_other_doc_count":14064,"buckets":[{"key":"-17fmn4B6PJPuEScb9cD","doc_count":1},{"key":"-17fmn4B6PJPuEScb9gG","doc_count":1},{"key":"-17fmn4B6PJPuESccdlt","doc_count":1},{"key":"-17fmn4B6PJPuESccdpv","doc_count":1},{"key":"-17fmn4B6PJPuEScctv0","doc_count":1},{"key":"-17fmn4B6PJPuEScctz1","doc_count":1},{"key":"-17fmn4B6PJPuEScd-M4","doc_count":1},{"key":"-17fmn4B6PJPuEScd-Q5","doc_count":1},{"key":"-17fmn4B6PJPuEScdN0J","doc_count":1},{"key":"-17fmn4B6PJPuEScdN4K","doc_count":1}]}}}POST /opensearch_dashboards_sample_data_logs/_search HTTP/1.1
x-opensearch-product-origin: opensearch-dashboards
x-opaque-id: ddf6ded0-8e21-4cd0-aaa9-fb20dcf9bd3d
content-type: application/json
Host: opensearch-node1:9200
Content-Length: 60194
Connection: keep-alive

{"size":0,"timeout":"1000ms","terminate_after":100000,"query":{"bool":{"filter":[]}},"aggs":{"suggestions":{"terms":{"field":"_id","include":"abcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcab......................................................

Note the first query contains “include”: “.*” - this autocompleted the drop down when no text was entered (eg all _id values) . The second query was sent when I pasted 50K chars into the value box - I did not submit the request, just pasted the chars.

I attached the opensearch logs in an earlier comment - see ‘opensearch-regex-fatal-error.log’ . Here’s a snippet from the container stdout viewable in the terminal (please check the earlier attachment for the full error). We can see opensearch-node1 die after which point opensearch-dashboards cannot connect anymore.

...
...
opensearch-node1         |      at org.apache.lucene.util.automaton.RegExp.parseConcatExp(RegExp.java:1146)
opensearch-node1         |      at org.apache.lucene.util.automaton.RegExp.parseConcatExp(RegExp.java:1146)
opensearch-node1         |      at org.apache.lucene.util.automaton.RegExp.parseConcatExp(RegExp.java:1146)
opensearch-node1         |      at org.apache.lucene.util.automaton.RegExp.parseConcatExp(RegExp.java:1146)
opensearch-node1         | Killing performance analyzer process 106
opensearch-node1         | OpenSearch exited with code 126
opensearch-node1         | Performance analyzer exited with code 143
opensearch-dashboards    | {"type":"log","@timestamp":"2022-02-10T21:43:38Z","tags":["error","opensearch","data"],"pid":1,"message":"Request error, retrying\nPOST http://opensearch-node1:9200/opensearch_dashboards_sample_data_logs/_search => socket hang up"}
opensearch-dashboards    | {"type":"log","@timestamp":"2022-02-10T21:43:38Z","tags":["warning","opensearch","data"],"pid":1,"message":"Unable to revive connection: http://opensearch-node1:9200/"}
opensearch-dashboards    | {"type":"log","@timestamp":"2022-02-10T21:43:38Z","tags":["warning","opensearch","data"],"pid":1,"message":"No living connections"}
opensearch-dashboards    | {"type":"error","@timestamp":"2022-02-10T21:43:37Z","tags":[],"pid":1,"level":"error","error":{"message":"Internal Server Error","name":"Error","stack":"Error: Internal Server Error\n    at HapiResponseAdapter.toError (/usr/share/opensearch-dashboards/src/core/server/http/router/response_adapter.js:145:19)\n    at HapiResponseAdapter.toHapiResponse (/usr/share/opensearch-dashboards/src/core/server/http/router/response_adapter.js:99:19)\n    at HapiResponseAdapter.handle (/usr/share/opensearch-dashboards/src/core/server/http/router/response_adapter.js:94:17)\n    at Router.handle (/usr/share/opensearch-dashboards/src/core/server/http/router/router.js:164:34)\n    at process._tickCallback (internal/process/next_tick.js:68:7)"},"url":{"protocol":null,"slashes":null,"auth":null,"host":null,"port":null,"hostname":null,"hash":null,"search":null,"query":{},"pathname":"/api/opensearch-dashboards/suggestions/values/opensearch_dashboards_sample_data_logs","path":"/api/opensearch-dashboards/suggestions/values/opensearch_dashboards_sample_data_logs","href":"/api/opensearch-dashboards/suggestions/values/opensearch_dashboards_sample_data_logs"},"message":"Internal Server Error"}
opensearch-dashboards    | {"type":"response","@timestamp":"2022-02-10T21:43:37Z","tags":[],"pid":1,"method":"post","statusCode":500,"req":{"url":"/api/opensearch-dashboards/suggestions/values/opensearch_dashboards_sample_data_logs","method":"post","headers":{"host":"127.0.0.1:5601","user-agent":"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:96.0) Gecko/20100101 Firefox/96.0","accept":"*/*","accept-language":"en-US,en;q=0.5","accept-encoding":"gzip, deflate","referer":"http://127.0.0.1:5601/app/discover","content-type":"application/json","osd-version":"1.2.0","origin":"http://127.0.0.1:5601","content-length":"60048","dnt":"1","connection":"keep-alive","sec-fetch-dest":"empty","sec-fetch-mode":"cors","sec-fetch-site":"same-origin"},"remoteAddress":"192.168.32.1","userAgent":"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:96.0) Gecko/20100101 Firefox/96.0","referer":"http://127.0.0.1:5601/app/discover"},"res":{"statusCode":500,"responseTime":518,"contentLength":9},"message":"POST /api/opensearch-dashboards/suggestions/values/opensearch_dashboards_sample_data_logs 500 518ms - 9.0B"}
opensearch-node1 exited with code 0
opensearch-dashboards    | {"type":"log","@timestamp":"2022-02-10T21:43:39Z","tags":["error","opensearch","data"],"pid":1,"message":"[ConnectionError]: getaddrinfo EAI_AGAIN opensearch-node1 opensearch-node1:9200"}
opensearch-dashboards    | {"type":"log","@timestamp":"2022-02-10T21:43:39Z","tags":["error","savedobjects-service"],"pid":1,"message":"Unable to retrieve version information from OpenSearch nodes."}
opensearch-dashboards    | {"type":"log","@timestamp":"2022-02-10T21:43:41Z","tags":["error","opensearch","data"],"pid":1,"message":"[ConnectionError]: getaddrinfo EAI_AGAIN opensearch-node1 opensearch-node1:9200"}
opensearch-dashboards    | {"type":"log","@timestamp":"2022-02-10T21:43:44Z","tags":["error","opensearch","data"],"pid":1,"message":"[ConnectionError]: getaddrinfo EAI_AGAIN opensearch-node1 opensearch-node1:9200"}
opensearch-dashboards    | {"type":"log","@timestamp":"2022-02-10T21:43:46Z","tags":["error","opensearch","data"],"pid":1,"message":"[ConnectionError]: getaddrinfo EAI_AGAIN opensearch-node1 opensearch-node1:9200"}
opensearch-dashboards    | {"type":"log","@timestamp":"2022-02-10T21:43:49Z","tags":["error","opensearch","data"],"pid":1,"message":"[ConnectionError]: getaddrinfo EAI_AGAIN opensearch-node1 opensearch-node1:9200"}
...
...

Hope that helps! Any other questions, shout.

Cheers

gplechuck on Feb 10, 2022