lucene: Corruption read on term dictionaries in Lucene 9.9
Description
It seems that https://github.com/apache/lucene/pull/12699/ has inadvertantly broken reading term dictionaries created in Lucene 9.8<=.
To replicate a bug, one can index wikibigall with LuceneUtil & Lucene 9.8 & force-merge.
Then attempt to read the created index using a wildcard query:
Path path = Paths.get("/data/local/lucene/indices/wikibigall.lucene-main.opt.Lucene90.dvfields.nd6.72652M/index");
try (FSDirectory dir = FSDirectory.open(path);
DirectoryReader reader = DirectoryReader.open(dir)) {
IndexSearcher searcher = new IndexSearcher(reader);
searcher.count(new WildcardQuery(new Term("body", "*fo*")));
}
This will result in a trace similar to below:
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: Index 3 out of bounds for length 3
at org.apache.lucene.store.ByteArrayDataInput.readByte(ByteArrayDataInput.java:136)
at org.apache.lucene.store.DataInput.readVInt(DataInput.java:110)
at org.apache.lucene.store.ByteArrayDataInput.readVInt(ByteArrayDataInput.java:114)
at org.apache.lucene.codecs.lucene90.blocktree.IntersectTermsEnumFrame.load(IntersectTermsEnumFrame.java:158)
at org.apache.lucene.codecs.lucene90.blocktree.IntersectTermsEnumFrame.load(IntersectTermsEnumFrame.java:149)
at org.apache.lucene.codecs.lucene90.blocktree.IntersectTermsEnum.pushFrame(IntersectTermsEnum.java:203)
at org.apache.lucene.codecs.lucene90.blocktree.IntersectTermsEnum._next(IntersectTermsEnum.java:531)
at org.apache.lucene.codecs.lucene90.blocktree.IntersectTermsEnum.next(IntersectTermsEnum.java:373)
at org.apache.lucene.search.MultiTermQueryConstantScoreBlendedWrapper$1.rewriteInner(MultiTermQueryConstantScoreBlendedWrapper.java:111)
at org.apache.lucene.search.AbstractMultiTermQueryConstantScoreWrapper$RewritingWeight.rewrite(AbstractMultiTermQueryConstantScoreWrapper.java:179)
at org.apache.lucene.search.AbstractMultiTermQueryConstantScoreWrapper$RewritingWeight.bulkScorer(AbstractMultiTermQueryConstantScoreWrapper.java:220)
at org.apache.lucene.search.LRUQueryCache$CachingWrapperWeight.bulkScorer(LRUQueryCache.java:930)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:678)
at org.apache.lucene.search.IndexSearcher.lambda$4(IndexSearcher.java:636)
at org.apache.lucene.search.TaskExecutor$TaskGroup.lambda$0(TaskExecutor.java:118)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
at org.apache.lucene.search.TaskExecutor$TaskGroup.invokeAll(TaskExecutor.java:153)
at org.apache.lucene.search.TaskExecutor.invokeAll(TaskExecutor.java:76)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:640)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:607)
at org.apache.lucene.search.IndexSearcher.count(IndexSearcher.java:423)
at Corruption.main(Corruption.java:18)
We are currently not sure if this effects Lucene 9.9 created indices & reading via Lucene 9.9.
EDIT: This failure does NOT occur for indices created by 9.9 and read by 9.9.
NOTE: This also fails with just a prefix wildcard query. It seems to be all multi-term queries could be affected.
Will provide more example stack traces in issue comments.
Version and environment details
Lucene 9.9 reading Lucene 9.8 indices.
About this issue
- Original URL
- State: closed
- Created 7 months ago
- Comments: 24 (24 by maintainers)
Commits related to this issue
- Update Lucene to remove terms enum optimzation This commit updates Lucene to a patched version of 9.9.0 without https://github.com/apache/lucene/pull/12699. See https://github.com/apache/lucene/issue... — committed to rjernst/elasticsearch by rjernst 7 months ago
- Update Lucene to remove terms enum optimzation (#103229) This commit updates Lucene to a patched version of 9.9.0 without https://github.com/apache/lucene/pull/12699. See https://github.com/apache/lu... — committed to elastic/elasticsearch by rjernst 7 months ago
- #12901: add TestBackwardsCompatibility test case that reveals the block tree IntersectTermsEnum bug #12895 — committed to mikemccand/lucene by mikemccand 7 months ago
- #12901: add TestBackwardsCompatibility test case that reveals the block tree IntersectTermsEnum bug #12895 (#12913) * #12901: add TestBackwardsCompatibility test case that reveals the block tree Inte... — committed to apache/lucene by mikemccand 7 months ago
- #12901: add TestBackwardsCompatibility test case that reveals the block tree IntersectTermsEnum bug #12895 (#12913) * #12901: add TestBackwardsCompatibility test case that reveals the block tree Inte... — committed to apache/lucene by mikemccand 7 months ago
- #12901: add TestBackwardsCompatibility test case that reveals the block tree IntersectTermsEnum bug #12895 (#12913) * #12901: add TestBackwardsCompatibility test case that reveals the block tree Inte... — committed to apache/lucene by mikemccand 7 months ago
- Check `Terms#intersect` in CheckIndex. This commit adds coverage to `Terms#intersect` to `CheckIndex` and indexes `LineFileDocs` in `BasePostingsFormatTestCase` to get some coverage with real-world d... — committed to jpountz/lucene by jpountz 7 months ago
- Check `Terms#intersect` in CheckIndex. (#12925) This commit adds coverage to `Terms#intersect` to `CheckIndex` and indexes `LineFileDocs` in `BasePostingsFormatTestCase` to get some coverage with r... — committed to apache/lucene by jpountz 7 months ago
- Check `Terms#intersect` in CheckIndex. (#12925) This commit adds coverage to `Terms#intersect` to `CheckIndex` and indexes `LineFileDocs` in `BasePostingsFormatTestCase` to get some coverage with r... — committed to apache/lucene by jpountz 7 months ago
- Check `Terms#intersect` in CheckIndex. (#12925) This commit adds coverage to `Terms#intersect` to `CheckIndex` and indexes `LineFileDocs` in `BasePostingsFormatTestCase` to get some coverage with r... — committed to apache/lucene by jpountz 7 months ago
BTW, please do not beat yourself up over this 😉
We of course try our best not to introduce bugs when making exciting optimizations, but, we are only human, this block tree code is insanely hairy / complex, especially
IntersectTermsEnum(intersecting FST index + on-disk blocks with Automaton), and we have tests (which are inadequate now) that we improve every time we hit situations like this 😃Remember: in life and software, if you are not making mistakes, you are not trying hard enough!
I’ve managed to repro (thanks @benwtrent!) and indeed the bug is happening because this assumption is (very, very rarely?) invalid:
I see a case where a single byte prefix
[0xfe]appears before the final arc that has the rest of the floor/frame data. So the accumulator fails to put this0xfefirst andreadVLongthen fails. This is on a 9.8.0 written index where thevLongis still LSB encoded.I don’t think this assumption is valid @gf2121? Because that floor data first contains the file pointer of the on-disk block that this prefix points to (in MSB order as of 9.9, where lots of prefix sharing should happen), so, internal arcs before the final arc are in fact expected to output shared prefix bytes?
One thing I am curious about: it’s possible that turning off suffix sharing (a separate change: #12722) sidesteps this bug and maybe that’s why we are not seeing it with newly created (9.9.0) indices? We could test this by backporting #12722 to 9.8.x SNAPSHOT build and re-build the
wikibigalland see if the corruption still happens. I’m not saying this is a workaround or anything but it’d make me more comfortable if we could understand why 9.9.0 written indices are not corrupt. Alternatively, we could revert #12722 in 9.9.x, rebuild wikibigall, and see if the bug then happens? I won’t be able to try this likely for a day or two so if someone who can repro the bug could test this that would be awesome 😃. Thanks!I am with @mikemccand on this @gf2121!
The optimization proved valuable in benchmarks and all testing indicated it was good to go. Mistakes happen. I think the bigger thing is that our testing around this part of the code is lacking!
It’s awesome to see that a solution was found so quickly!
OK I opened #12901 to create a test – can we do that, first, and then confirm #12900 indeed fixes it, then push the test, then push the fix? I can try to work on creating that test case, but won’t have much time until early tomorrow AM. If anyone else wants to crack it, feel free!
Git bisect has confirmed the read corruption occurs with: https://github.com/apache/lucene/pull/12699
OK I think I understand why we see the bug on 9.8.x indices but NOT 9.9.x. indices. And I’m quite sure blast radius of the bug is contained solely to read-time, i.e. newly written 9.9.0 indices are not corrupt. Though I’d really love to see some stronger testing – I’m hoping a randomly written large enough index might reveal the bug. I’ll open a spinoff issue so we can create that.
Details:
We write FST with <term-prefix,byte[]> pairs, where term-prefix is the prefix of a set of terms sharing one on-disk block. E.g.
foois prefix, and the N terms would befoobar,foobaz, … The start of thatbyte[]is avLong, which is the absolute file pointer of the on-disk block shifted up by 2 bits, and 2 lower bits are two important flags:OUTPUT_FLAG_IS_FLOORandOUTPUT_FLAG_HAS_TERMS.Now, with 9.8.x, this is LSB encoded: those two flags are in the leading byte, not the last byte. FST will share output byte prefixes when it can, and it does in fact happen (rarely!) that the LSB (6 bits + 2 bit flags) is the same across N term blocks. It can happen if the last 6 bits of their block file pointers happen to be the same. In this case that leading byte (or possibly, even more rarely, bytes) are not all on the last arc, and this PR loses that leading flag-containing byte/bytes in that case, and computes the wrong flag value for
OUTPUT_FLAG_IS_FLOOR, then tries to readVInt when non was written --> exceptions at read time.@gf2121 indeed the leading fp + 2-bit flags will never be the same across blocks, so this means even if FST is able to share leading prefix byte or two, it will never share that entire
vLong: there must be at least one final (different) byte forreadVLongto read. And becausevLongis self-delineating (each byte knows whether it is the last one by checking 8th bit is clear) losing a byte or two of its leading bytes won’t break reading of thevLong, i.e.,readVLongwill indeed stop at the right byte (but of course produce the wrong value).Importantly,
IntersectTermsEnumdoes not use this encoded file pointer in thefloorData! It gets the file-pointer by traversing prior arcs in the FST and accumulating N incremental diffs contained in the on-disk block entries leading to that final floor block. SoIntersectTermsEnumjust needs these two big flags.In the 9.9.x written index, where we instead encode that leading long fp + 2 bits in MSB order, we share tons of prefixes of course (this is why FST got so much smaller), but, we will never share the entire thing, and that last byte contains our bit flags. So at read-time we will always have the right (flag containing) byte. The 9.9.x index is intact, and reading it is buggy yet buggy in a harmless way (throwing away bytes it does not try to use), and also because of how
vLongis self delineating. Fun!I think if a fix for this isn’t found early next week, we should consider reverting it.
No user should upgrade to Lucene 9.9.0 with this bug.