OpenSearch: [BUG] OpenSearch 2.0 possible JIT Compiler memory leak or related issue
Describe the bug When running Index Management Dashboards cypress tests against an OpenSearch 2.0 cluster, the tests will all pass but if I leave the cluster up for 0-10 minutes then at some point the memory usage starts to spike and go crazy, more specifically the Compiler and Arena Chunk memory grow incredibly fast until my entire instance runs out of memory and then OS kills the process.
I haven’t tried this with other plugins yet or OpenSearch by itself, so I can’t 100% say this is a problem purely in core and not something in Index Management or Job Scheduler, but so far while debugging what I have done is after the Index Management tests are run, I deleted the .opendistro-ism-config index which is where all the IM jobs are located which deschedules anything IM related and it still happened after a few minutes. That along with the major changes done in core vs minimal in Index Management makes me think it’s currently something in core, but still trying to root cause it. If no other plugin though is able to replicate this at all, then perhaps it is something in IM still.
Here’s NMT baseline https://gist.github.com/dbbaughe/b3eb7457e8380f39db7cfa13ae22c5e7
Here’s a diff after it’s been spiking and about to go OOM in the next minute. https://gist.github.com/dbbaughe/85975886d291e8edc0c63bdf85247b72
Went from 5GB committed to 50GB committed, with a 32GB increase in Compiler and 16GB increase in Arena Chunk.
This has happened on my dev machine using JDK11 and Log4jhotpatch disabled.
dmesg:
[535702.922079] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/user.slice,task=java,pid=30080,uid=6606827
[535702.931813] Out of memory: Killed process 30080 (java) total-vm:70035316kB, anon-rss:60797644kB, file-rss:0kB, shmem-rss:20kB, UID:6606827 pgtables:120164kB oom_score_adj:0
[535704.307968] oom_reaper: reaped process 30080 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
To Reproduce Steps to reproduce the behavior: See further below for updated shortened steps…
- Set JDK to 11
- Pull down Index Management
2.1. While on main branch run
./gradlew runand make sure JD 2.2 If you’d like to see NMT info use./gradlew run -Dtests.jvm.argline="-XX:NativeMemoryTracking=summary"or./gradlew run -Dtests.jvm.argline="-XX:NativeMemoryTracking=detail2.3 While cluster is running you can executejcmd <pid> VM.native_memory summaryorjcmd <pid> VM.native_memory detailto see information andjcmd <pid> VM.native_memory baselineto take a baseline and then compare againstjcmd <pid> VM.native_memory summary.difforjcmd <pid> VM.native_memory detail.diff - Pull down OpenSearch Dashboards and make sure you’re on the 2.0 version (as of my testing that was main branch)
3.0 Note: I specifically had the backend running on my dev machine and the dashboards running on my mac and ssh tunneled to my dev machine so they could communicate, i.e.
ssh -L 9200:localhost:9200 <dev host>3.1 cd into plugins directory and pull down Index Management Dashboards 3.2 Make sure your node version is set to 14.18.2 (nvm use 14.18.2) 3.3 Runyarn osd bootstrapwhile inside the Index Management Dashboards directory 3.4 Once that’s done, from the OpenSearch Dashboards directory runyarn start --no-base-path --no-watch3.5 Once OpenSearch Dashboards is up then from the Index Management Dashboards plugin runyarn run cypress runto run the cypress test suite. 3.6 After the cypress tests are done… usually it’s anywhere from 0 to 10 minutes until memory starts spiking and then goes OOM after a short while. During this time you can run the jcmd commands above to see if the committed memory is growing.
I’m trying to see if I can scope down the steps needed to reproduce to just some backend API calls. Leaving the cluster up by itself with no calls to it did not cause it to run OOM eventually.
Looks like I was able to also reproduce by just running our backend integ tests against the cluster now…
[599373.011209] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/user.slice,task=java,pid=23132,uid=6606827
[599373.022039] Out of memory: Killed process 23132 (java) total-vm:68547404kB, anon-rss:59905372kB, file-rss:0kB, shmem-rss:20kB, UID:6606827 pgtables:118544kB oom_score_adj:0
[599374.185090] oom_reaper: reaped process 23132 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
So shortened steps would be steps 1 and 2 above and then just:
./gradlew integTestRemote -Dtests.rest.cluster=localhost:9200 -Dtests.cluster=localhost:9200 -Dtests.clustername=integTest
From another terminal in the same directory (Index Management).
This wasn’t able to make it through IM tests now… but I believe we do have CI running in Index Management… checking it looks like a recent failure from 17 hours ago for a commit merged to main.
https://github.com/opensearch-project/index-management/actions/runs/2099364761
Expected behavior Not going OOM
Plugins Index Management / Job Scheduler
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 18 (9 by maintainers)
Yeah I did, but I was only cross-checking for other similar issues that others had reported… didn’t have enough expertise on those compiler logs to be able to find something on my own without having to spend a significant amount of time.
I’ll close this.
@uschindler Care to open a feature request describing startup checks? OpenSearch 2.0 is min JDK 11 so that can work.
If it does turn out to be fixed w/ newer JDK 11 version… wondering what our recourse here is. I had just gone to https://jdk.java.net/11/ which pointed me to the OpenJDK Archive and the latest JDK 11 there is 11.0.2 which I figured was the newest one. I would assume others would possibly be doing the same… unless I just went to the complete wrong resource to try and grab a JDK 11 download.
Do we just release OS 2.0 potentially w/ a caveat that says only JDK 11.x.x.x and up is supported? @dblock @nknize Ideally we could still try to track down what the issue is… but if it’s something in the JDK fixes itself then it might not be worth the effort.
Edit: Seems like Nick mentioned there’s a minimum jdk variable we can set so it’s more enforced.
I was running these tests on JDK 11.0.2 on three different ec2 boxes and I was able to reproduce the failure fairly consistently on all of them. When running
./gradlew integTestRemote -Dtests.rest.cluster=localhost:9200 -Dtests.cluster=localhost:9200 -Dtests.clustername=integTest --tests "org.opensearch.indexmanagement.transform.TransformRunnerIT.*"on repeat, I was typically able to reproduce this failure in around five minutes.After upgrading to 11.0.14 I have not yet been able to reproduce the OOM failure while looping the same set of tests for over half an hour. I am going to keep trying for a bit longer and with the full test suite to see if I am able to reproduce the OOM failure in any way on 11.0.14
Ah I think I had a brainfart moment and put a dot between the
14and figured 2 was greater than 1… lolWe can try testing with later version.