lucene: Corruption read on term dictionaries in Lucene 9.9

Description

It seems that https://github.com/apache/lucene/pull/12699/ has inadvertantly broken reading term dictionaries created in Lucene 9.8<=.

To replicate a bug, one can index wikibigall with LuceneUtil & Lucene 9.8 & force-merge.

Then attempt to read the created index using a wildcard query:

    Path path = Paths.get("/data/local/lucene/indices/wikibigall.lucene-main.opt.Lucene90.dvfields.nd6.72652M/index");
    try (FSDirectory dir = FSDirectory.open(path);
        DirectoryReader reader = DirectoryReader.open(dir)) {
      IndexSearcher searcher = new IndexSearcher(reader); 
      searcher.count(new WildcardQuery(new Term("body", "*fo*")));
    }

This will result in a trace similar to below:

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: Index 3 out of bounds for length 3
	at org.apache.lucene.store.ByteArrayDataInput.readByte(ByteArrayDataInput.java:136)
	at org.apache.lucene.store.DataInput.readVInt(DataInput.java:110)
	at org.apache.lucene.store.ByteArrayDataInput.readVInt(ByteArrayDataInput.java:114)
	at org.apache.lucene.codecs.lucene90.blocktree.IntersectTermsEnumFrame.load(IntersectTermsEnumFrame.java:158)
	at org.apache.lucene.codecs.lucene90.blocktree.IntersectTermsEnumFrame.load(IntersectTermsEnumFrame.java:149)
	at org.apache.lucene.codecs.lucene90.blocktree.IntersectTermsEnum.pushFrame(IntersectTermsEnum.java:203)
	at org.apache.lucene.codecs.lucene90.blocktree.IntersectTermsEnum._next(IntersectTermsEnum.java:531)
	at org.apache.lucene.codecs.lucene90.blocktree.IntersectTermsEnum.next(IntersectTermsEnum.java:373)
	at org.apache.lucene.search.MultiTermQueryConstantScoreBlendedWrapper$1.rewriteInner(MultiTermQueryConstantScoreBlendedWrapper.java:111)
	at org.apache.lucene.search.AbstractMultiTermQueryConstantScoreWrapper$RewritingWeight.rewrite(AbstractMultiTermQueryConstantScoreWrapper.java:179)
	at org.apache.lucene.search.AbstractMultiTermQueryConstantScoreWrapper$RewritingWeight.bulkScorer(AbstractMultiTermQueryConstantScoreWrapper.java:220)
	at org.apache.lucene.search.LRUQueryCache$CachingWrapperWeight.bulkScorer(LRUQueryCache.java:930)
	at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:678)
	at org.apache.lucene.search.IndexSearcher.lambda$4(IndexSearcher.java:636)
	at org.apache.lucene.search.TaskExecutor$TaskGroup.lambda$0(TaskExecutor.java:118)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
	at org.apache.lucene.search.TaskExecutor$TaskGroup.invokeAll(TaskExecutor.java:153)
	at org.apache.lucene.search.TaskExecutor.invokeAll(TaskExecutor.java:76)
	at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:640)
	at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:607)
	at org.apache.lucene.search.IndexSearcher.count(IndexSearcher.java:423)
	at Corruption.main(Corruption.java:18)

We are currently not sure if this effects Lucene 9.9 created indices & reading via Lucene 9.9.

EDIT: This failure does NOT occur for indices created by 9.9 and read by 9.9.

NOTE: This also fails with just a prefix wildcard query. It seems to be all multi-term queries could be affected.

Will provide more example stack traces in issue comments.

Version and environment details

Lucene 9.9 reading Lucene 9.8 indices.

About this issue

  • Original URL
  • State: closed
  • Created 7 months ago
  • Comments: 24 (24 by maintainers)

Commits related to this issue

Most upvoted comments

and so sorry for introducing this bug!

BTW, please do not beat yourself up over this 😉

We of course try our best not to introduce bugs when making exciting optimizations, but, we are only human, this block tree code is insanely hairy / complex, especially IntersectTermsEnum (intersecting FST index + on-disk blocks with Automaton), and we have tests (which are inadequate now) that we improve every time we hit situations like this 😃

Remember: in life and software, if you are not making mistakes, you are not trying hard enough!

I’ve managed to repro (thanks @benwtrent!) and indeed the bug is happening because this assumption is (very, very rarely?) invalid:

This works because we can not have two same fp encoded before floor data. I think this assumption works well for now, otherwise we will break this assert before meeting these exceptions.

I see a case where a single byte prefix [0xfe] appears before the final arc that has the rest of the floor/frame data. So the accumulator fails to put this 0xfe first and readVLong then fails. This is on a 9.8.0 written index where the vLong is still LSB encoded.

I don’t think this assumption is valid @gf2121? Because that floor data first contains the file pointer of the on-disk block that this prefix points to (in MSB order as of 9.9, where lots of prefix sharing should happen), so, internal arcs before the final arc are in fact expected to output shared prefix bytes?

One thing I am curious about: it’s possible that turning off suffix sharing (a separate change: #12722) sidesteps this bug and maybe that’s why we are not seeing it with newly created (9.9.0) indices? We could test this by backporting #12722 to 9.8.x SNAPSHOT build and re-build the wikibigall and see if the corruption still happens. I’m not saying this is a workaround or anything but it’d make me more comfortable if we could understand why 9.9.0 written indices are not corrupt. Alternatively, we could revert #12722 in 9.9.x, rebuild wikibigall, and see if the bug then happens? I won’t be able to try this likely for a day or two so if someone who can repro the bug could test this that would be awesome 😃. Thanks!

and so sorry for introducing this bug!

I am with @mikemccand on this @gf2121!

The optimization proved valuable in benchmarks and all testing indicated it was good to go. Mistakes happen. I think the bigger thing is that our testing around this part of the code is lacking!

It’s awesome to see that a solution was found so quickly!

OK I opened #12901 to create a test – can we do that, first, and then confirm #12900 indeed fixes it, then push the test, then push the fix? I can try to work on creating that test case, but won’t have much time until early tomorrow AM. If anyone else wants to crack it, feel free!

Git bisect has confirmed the read corruption occurs with: https://github.com/apache/lucene/pull/12699

OK I think I understand why we see the bug on 9.8.x indices but NOT 9.9.x. indices. And I’m quite sure blast radius of the bug is contained solely to read-time, i.e. newly written 9.9.0 indices are not corrupt. Though I’d really love to see some stronger testing – I’m hoping a randomly written large enough index might reveal the bug. I’ll open a spinoff issue so we can create that.

Details:

We write FST with <term-prefix,byte[]> pairs, where term-prefix is the prefix of a set of terms sharing one on-disk block. E.g. foo is prefix, and the N terms would be foobar, foobaz, … The start of that byte[] is a vLong, which is the absolute file pointer of the on-disk block shifted up by 2 bits, and 2 lower bits are two important flags: OUTPUT_FLAG_IS_FLOOR and OUTPUT_FLAG_HAS_TERMS.

Now, with 9.8.x, this is LSB encoded: those two flags are in the leading byte, not the last byte. FST will share output byte prefixes when it can, and it does in fact happen (rarely!) that the LSB (6 bits + 2 bit flags) is the same across N term blocks. It can happen if the last 6 bits of their block file pointers happen to be the same. In this case that leading byte (or possibly, even more rarely, bytes) are not all on the last arc, and this PR loses that leading flag-containing byte/bytes in that case, and computes the wrong flag value for OUTPUT_FLAG_IS_FLOOR, then tries to readVInt when non was written --> exceptions at read time.

@gf2121 indeed the leading fp + 2-bit flags will never be the same across blocks, so this means even if FST is able to share leading prefix byte or two, it will never share that entire vLong: there must be at least one final (different) byte for readVLong to read. And because vLong is self-delineating (each byte knows whether it is the last one by checking 8th bit is clear) losing a byte or two of its leading bytes won’t break reading of the vLong, i.e., readVLong will indeed stop at the right byte (but of course produce the wrong value).

Importantly, IntersectTermsEnum does not use this encoded file pointer in the floorData! It gets the file-pointer by traversing prior arcs in the FST and accumulating N incremental diffs contained in the on-disk block entries leading to that final floor block. So IntersectTermsEnum just needs these two big flags.

In the 9.9.x written index, where we instead encode that leading long fp + 2 bits in MSB order, we share tons of prefixes of course (this is why FST got so much smaller), but, we will never share the entire thing, and that last byte contains our bit flags. So at read-time we will always have the right (flag containing) byte. The 9.9.x index is intact, and reading it is buggy yet buggy in a harmless way (throwing away bytes it does not try to use), and also because of how vLong is self delineating. Fun!

I think if a fix for this isn’t found early next week, we should consider reverting it.

No user should upgrade to Lucene 9.9.0 with this bug.