OpenSearch: [BUG] 2.9.0 Restoring snapshot with remote_snapshot fails with exception from functioning S3 repository
Describe the bug I’m trying to use the searchable snapshots feature, but restoring the snapshot fails as remote_snapshot fails. Picture > 1000 words, so I made a video detailing the steps and effects:
https://download.ict-one.nl/searchable_snapshots.mp4
Config:
#s3.client.default.disable_chunked_encoding: false # Disables chunked encoding for compatibility with some storage services, but you probably don't need to change this value.
s3.client.default.endpoint: appliance.domain.nl:443 # S3 has alternate endpoints, but you probably don't need to change this value.
s3.client.default.max_retries: 3 # number of retries if a request fails
s3.client.default.path_style_access: true # whether to use the deprecated path-style bucket URLs.
#You probably don't need to change this value, but for more information, see https://docs.aws.amazon.com/AmazonS3/latest/dev/VirtualHosting.html#path-style-access.
s3.client.default.protocol: https # http or https
s3.client.default.read_timeout: 50s # the S3 connection timeout
s3.client.default.use_throttle_retries: true
s3.client.default.region: us-east-2 #lijkt verplicht
Exception:
[2023-08-07T16:44:36,439][WARN ][o.o.i.c.IndicesClusterStateService] [opensearch-search-nodes-1] [alert-suricata-alert-2023.03.27][0] marking and sending shard failed due to [failed recovery] │
│ opensearch-search-nodes org.opensearch.indices.recovery.RecoveryFailedException: [alert-suricata-alert-2023.03.27][0]: Recovery failed on {opensearch-search-nodes-1}{FRC9rEwWSSS0NQXYZZqHCw}{b1gzqNL1RS2trAN9OO9R5w}{10.244.45.28}{10.244.45.28: │
│ opensearch-search-nodes at org.opensearch.index.shard.IndexShard.lambda$executeRecovery$30(IndexShard.java:3554) [opensearch-2.9.0.jar:2.9.0] │
│ opensearch-search-nodes at org.opensearch.action.ActionListener$1.onFailure(ActionListener.java:88) [opensearch-2.9.0.jar:2.9.0] │
│ opensearch-search-nodes at org.opensearch.index.shard.StoreRecovery.lambda$recoveryListener$8(StoreRecovery.java:510) [opensearch-2.9.0.jar:2.9.0] │
│ opensearch-search-nodes at org.opensearch.action.ActionListener$1.onFailure(ActionListener.java:88) [opensearch-2.9.0.jar:2.9.0] │
│ opensearch-search-nodes at org.opensearch.action.ActionListener.completeWith(ActionListener.java:345) [opensearch-2.9.0.jar:2.9.0] │
│ opensearch-search-nodes at org.opensearch.index.shard.StoreRecovery.recoverFromStore(StoreRecovery.java:113) [opensearch-2.9.0.jar:2.9.0] │
│ opensearch-search-nodes at org.opensearch.index.shard.IndexShard.recoverFromStore(IndexShard.java:2620) [opensearch-2.9.0.jar:2.9.0] │
│ opensearch-search-nodes at org.opensearch.action.ActionRunnable$2.doRun(ActionRunnable.java:88) [opensearch-2.9.0.jar:2.9.0] │
│ opensearch-search-nodes at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:908) [opensearch-2.9.0.jar:2.9.0] │
│ opensearch-search-nodes at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-2.9.0.jar:2.9.0] │
│ opensearch-search-nodes at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?] │
│ opensearch-search-nodes at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?] │
│ opensearch-search-nodes at java.lang.Thread.run(Thread.java:833) [?:?] │
│ opensearch-search-nodes Caused by: org.opensearch.index.shard.IndexShardRecoveryException: failed recovery │
│ opensearch-search-nodes ... 11 more │
│ opensearch-search-nodes Caused by: java.lang.ArithmeticException: long overflow │
│ opensearch-search-nodes at java.lang.Math.addExact(Math.java:903) ~[?:?] │
│ opensearch-search-nodes at org.opensearch.repositories.s3.S3RetryingInputStream.openStream(S3RetryingInputStream.java:121) ~[?:?] │
│ opensearch-search-nodes at org.opensearch.repositories.s3.S3RetryingInputStream.<init>(S3RetryingInputStream.java:100) ~[?:?] │
│ opensearch-search-nodes at org.opensearch.repositories.s3.S3BlobContainer.readBlob(S3BlobContainer.java:143) ~[?:?] │
│ opensearch-search-nodes at org.opensearch.index.store.remote.utils.TransferManager.lambda$createIndexInput$1(TransferManager.java:87) ~[opensearch-2.9.0.jar:2.9.0] │
│ opensearch-search-nodes at java.security.AccessController.doPrivileged(AccessController.java:318) ~[?:?] │
│ opensearch-search-nodes at org.opensearch.index.store.remote.utils.TransferManager.createIndexInput(TransferManager.java:83) ~[opensearch-2.9.0.jar:2.9.0] │
│ opensearch-search-nodes at org.opensearch.index.store.remote.utils.TransferManager$DelayedCreationCachedIndexInput.getIndexInput(TransferManager.java:135) ~[opensearch-2.9.0.jar:2.9.0] │
│ opensearch-search-nodes at org.opensearch.index.store.remote.utils.TransferManager.fetchBlob(TransferManager.java:72) ~[opensearch-2.9.0.jar:2.9.0] │
│ opensearch-search-nodes at org.opensearch.index.store.remote.file.OnDemandBlockSnapshotIndexInput.fetchBlock(OnDemandBlockSnapshotIndexInput.java:147) ~[opensearch-2.9.0.jar:2.9.0] │
│ opensearch-search-nodes at org.opensearch.index.store.remote.file.OnDemandBlockIndexInput.demandBlock(OnDemandBlockIndexInput.java:340) ~[opensearch-2.9.0.jar:2.9.0] │
│ opensearch-search-nodes at org.opensearch.index.store.remote.file.OnDemandBlockIndexInput.seekInternal(OnDemandBlockIndexInput.java:311) ~[opensearch-2.9.0.jar:2.9.0] │
│ opensearch-search-nodes at org.opensearch.index.store.remote.file.OnDemandBlockIndexInput.seek(OnDemandBlockIndexInput.java:209) ~[opensearch-2.9.0.jar:2.9.0] │
│ opensearch-search-nodes at org.opensearch.index.store.remote.file.OnDemandBlockSnapshotIndexInput.seek(OnDemandBlockSnapshotIndexInput.java:28) ~[opensearch-2.9.0.jar:2.9.0] │
│ opensearch-search-nodes at org.apache.lucene.codecs.CodecUtil.retrieveChecksum(CodecUtil.java:533) ~[lucene-core-9.7.0.jar:9.7.0 ccf4b198ec328095d45d2746189dc8ca633e8bcf - 2023-06-21 11:48:16] │
│ opensearch-search-nodes at org.apache.lucene.codecs.lucene90.Lucene90CompoundReader.<init>(Lucene90CompoundReader.java:87) ~[lucene-core-9.7.0.jar:9.7.0 ccf4b198ec328095d45d2746189dc8ca633e8bcf - 2023-06-21 11:48:16] │
│ opensearch-search-nodes at org.apache.lucene.codecs.lucene90.Lucene90CompoundFormat.getCompoundReader(Lucene90CompoundFormat.java:86) ~[lucene-core-9.7.0.jar:9.7.0 ccf4b198ec328095d45d2746189dc8ca633e8bcf - 2023-06-21 11:48:16] │
│ opensearch-search-nodes at org.apache.lucene.index.SegmentCoreReaders.<init>(SegmentCoreReaders.java:103) ~[lucene-core-9.7.0.jar:9.7.0 ccf4b198ec328095d45d2746189dc8ca633e8bcf - 2023-06-21 11:48:16] │
│ opensearch-search-nodes at org.apache.lucene.index.SegmentReader.<init>(SegmentReader.java:92) ~[lucene-core-9.7.0.jar:9.7.0 ccf4b198ec328095d45d2746189dc8ca633e8bcf - 2023-06-21 11:48:16] │
│ opensearch-search-nodes at org.apache.lucene.index.StandardDirectoryReader$1.doBody(StandardDirectoryReader.java:94) ~[lucene-core-9.7.0.jar:9.7.0 ccf4b198ec328095d45d2746189dc8ca633e8bcf - 2023-06-21 11:48:16] │
│ opensearch-search-nodes at org.apache.lucene.index.StandardDirectoryReader$1.doBody(StandardDirectoryReader.java:77) ~[lucene-core-9.7.0.jar:9.7.0 ccf4b198ec328095d45d2746189dc8ca633e8bcf - 2023-06-21 11:48:16] │
│ opensearch-search-nodes at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:774) ~[lucene-core-9.7.0.jar:9.7.0 ccf4b198ec328095d45d2746189dc8ca633e8bcf - 2023-06-21 11:48:16] │
│ opensearch-search-nodes at org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:109) ~[lucene-core-9.7.0.jar:9.7.0 ccf4b198ec328095d45d2746189dc8ca633e8bcf - 2023-06-21 11:48:16] │
│ opensearch-search-nodes at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:146) ~[lucene-core-9.7.0.jar:9.7.0 ccf4b198ec328095d45d2746189dc8ca633e8bcf - 2023-06-21 11:48:16] │
│ opensearch-search-nodes at org.opensearch.index.engine.ReadOnlyEngine.open(ReadOnlyEngine.java:235) ~[opensearch-2.9.0.jar:2.9.0] │
│ opensearch-search-nodes at org.opensearch.index.engine.ReadOnlyEngine.<init>(ReadOnlyEngine.java:147) ~[opensearch-2.9.0.jar:2.9.0] │
│ opensearch-search-nodes at org.opensearch.indices.IndicesService.lambda$getEngineFactory$10(IndicesService.java:854) ~[opensearch-2.9.0.jar:2.9.0] │
│ opensearch-search-nodes at org.opensearch.index.shard.IndexShard.innerOpenEngineAndTranslog(IndexShard.java:2340) ~[opensearch-2.9.0.jar:2.9.0] │
│ opensearch-search-nodes at org.opensearch.index.shard.IndexShard.innerOpenEngineAndTranslog(IndexShard.java:2302) ~[opensearch-2.9.0.jar:2.9.0] │
│ opensearch-search-nodes at org.opensearch.index.shard.IndexShard.openEngineAndRecoverFromTranslog(IndexShard.java:2272) ~[opensearch-2.9.0.jar:2.9.0] │
│ opensearch-search-nodes at org.opensearch.index.shard.StoreRecovery.internalRecoverFromStore(StoreRecovery.java:630) ~[opensearch-2.9.0.jar:2.9.0] │
│ opensearch-search-nodes at org.opensearch.index.shard.StoreRecovery.lambda$recoverFromStore$0(StoreRecovery.java:115) ~[opensearch-2.9.0.jar:2.9.0] │
│ opensearch-search-nodes at org.opensearch.action.ActionListener.completeWith(ActionListener.java:342) ~[opensearch-2.9.0.jar:2.9.0] │
│ opensearch-search-nodes ... 8 more
To Reproduce Steps to reproduce the behavior: See video
Expected behavior Snapshot restored and available for searching
Plugins s3 repository plugin
Screenshots If applicable, add screenshots to help explain your problem.
Host/Environment (please complete the following information): Default Opensearch 2.9.0 docker image with S3 plugin enabled. S3 storage is a Scality appliance, but since the normal restore works I don’t suspect that’s the issue?
Additional context Add any other context about the problem here.
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 22 (8 by maintainers)
This has been fixed. The 2.9.1 patch release did not happen but this fix will be included in the upcoming 2.10 release.
Will try, I tried running the docker image with the -ea option in the OPENSEARCH_JAVA_OPTS env var but I didn’t hit it. But I’m not sure the -ea flag has been picked up properly. Running opensearch on my mac is somewhat troublesome, so I need to revive my Windows laptop. WIll try to look into it asap.
Update: assertions did work, but dit not hit them. Created a totally fresh cluster and tested it again, result is a different error than in the first post. I made a full capture of the process with logging of all nodes in it. Contacted Andrew on Slack to discuss if I can share the grab with the dev team under TLP:AMBER.
Behavior on 2.8.0 is definitely different from 2.9.0. I’m able to restore the snapshot:
And it searches:
But index state is yellow and thus the cluster state is yellow as well. I’m guessing because it has 1 primary shard and 1 replica (the default), but it can’t assign the replica. Is this default correct?
Update: Added a second search node: cluster state green. Should this be necessary and shouldn’t this be documented? Will look into 2.9.0 now.
Update: created a small pull request on the docs to address the points I ran into.