dependency-track: Index commit occasionally fails with `AlreadyClosedException` in some deployments

Current Behavior

In some deployments, committing Lucene indexes may fail with an AlreadyClosedException, stating that the write.lock file of the respective index was “changed by an external force”:

ERROR [LoggableUncaughtExceptionHandler] An unknown error occurred in an asynchronous event or notification thread
org.apache.lucene.store.AlreadyClosedException: Underlying file changed by an external force at 2023-01-05T13:41:30.299072688Z, (lock=NativeFSLock(path=/data/index/servicecomponent/write.lock,impl=sun.nio.ch.FileLockImpl[0:9223372036854775807 exclusive valid],creationTime=2023-01-05T10:41:31.278927236Z))
	at org.apache.lucene.store.NativeFSLockFactory$NativeFSLock.ensureValid(NativeFSLockFactory.java:191)
	at org.apache.lucene.store.LockValidatingDirectoryWrapper.syncMetaData(LockValidatingDirectoryWrapper.java:61)
	at org.apache.lucene.index.SegmentInfos.prepareCommit(SegmentInfos.java:802)
	at org.apache.lucene.index.IndexWriter.startCommit(IndexWriter.java:5084)
	at org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:3460)
	at org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3770)
	at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3728)
	at org.dependencytrack.search.IndexManager.commit(IndexManager.java:249)
	at org.dependencytrack.tasks.IndexTask.inform(IndexTask.java:65)
	at alpine.event.framework.BaseEventService.lambda$publish$0(BaseEventService.java:101)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	at java.base/java.lang.Thread.run(Unknown Source)

write.lock is empty, and created when an IndexWriter instance is created. DT creates an IndexWriter instance per Index on startup. Write access to indexes is serialized using the single threaded ExecutorService, so there’s no way an index could be modified by multiple threads.

Lucene uses timestamps of write.lock to determine whether it has been modified after the index was opened.

Steps to Reproduce

I am unable to reproduce locally. It seems like it’s happening predominantly on servers, but I am yet unsure what exactly is the cause.

Could be third party services modifying the file, or file systems incapable of tracking file timestamps correctly, or skewed server times.

The exception can be forced by touching (i.e. modifying access timestamps) the write.lock file after DT has started, e.g.:

touch ~/.dependency-track/index/component/write.lock

Afterwards, upload a BOM and monitor the log for AlreadyClosedExceptions.

Expected Behavior

Committing indexes should not fail.

Dependency-Track Version

4.7.0

Dependency-Track Distribution

Container Image, Executable WAR

Database Server

N/A

Database Server Version

No response

Browser

N/A

Checklist

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 20 (13 by maintainers)

Commits related to this issue

Most upvoted comments

Thanks @syalioune, I raised #2477 that does exactly that. I also added proper closing of the IndexWriters when the application shuts down.

Found the culprit.

Since 4.7.0 there is an index health check on startup, but also recurring every 72h (per default, configurable in UI at Administration -> Search -> Search - Periodic consistency check -> Cadence).

image

The health check uses Lucene’s CheckIndex. CheckIndex does not only “check” the index, it actually tries to repair it if necessary. It is a writing operation that requires a write lock to be acquired. This is why DT currently deletes any existing lock files before starting CheckIndex:

https://github.com/DependencyTrack/dependency-track/blob/7cfc466baf7548ddb758f9e7b008b2d4aaedd148/src/main/java/org/dependencytrack/search/IndexManager.java#L438-L442

Write locks however are also held my the IndexWriters. There is an IndexWriter for every index. Official guidance from Lucene recommends to create IndexWriter once and then share the created instance among threads, for the entire lifetime of the application. So this is what DT does. IndexWriter creates a write lock file when it is created, and only releases that lock when it is closed. Based on Lucene guidance, and how it is implemented in DT, IndexWriters are only closed when the application shuts down.

So this is what happens:

  • DT starts
  • Event subsystem and task scheduler are started
    • Many event subscribers perform write operations on indexes, so IndexWriters may be created at this point
  • Index health check is performed, deleting any write.lock files that may have been created already, assuming they are stale
    • Not sure how common this is, most scheduled tasks have an initial delay of at least 10s
  • System runs ~72h, at this point most if not all IndexWriters and their lock files should have been created
  • Index health check is performed, again deleting lock files
  • From that point on, writing operations on indexes will fail with AlreadyClosedException, since the file has been recreated after the IndexWriter created it. It has to assume that the lock it holds is ineffective

As a short-term fix, disable the periodic consistency check:

image

As for a proper fix in the code, I see multiple options:

  • Close IndexWriters before performing the health check, thus releasing all locks. Re-open the writers after the check completed.
    • Will be hard to pull of, considering write operations to indexes are being performed in many asynchronous threads. If we close the IndexWriters, all those accesses will fail.
  • Only perform the health check on startup, when we can guarantee that no other write operations are being performed
  • Do not use CheckIndex at all

In my naive thinking

You and me both 😓. I was naively thinking that IndexWriter instances would lock the directory on a strictly need-to-lock basis so that the lock could be used as a mutex with CheckIndex. After looking at the code, it appears that the lock is obtained at the IndexWriter instance creation and held throughout its lifetime.

In that case, I vote for

Only perform the health check on startup, when we can guarantee that no other write operations are being performed

The periodic task would only compute the delta ratio between DB and Lucene and fire REINDEX events to the corresponding IndexWriter

Hello @nscuro

Thanks for the spot on analysis.

Index health check is performed, deleting any write.lock files that may have been created already, assuming they are stale

  • Not sure how common this is, most scheduled tasks have an initial delay of at least 10s

I don’t think that a race condition between the first task execution and initial health check is that common. I’ve rather implemented this check to prevent against hard process kill/crash which will leave stale lock file (too many bad experiences with flaky file based lock implementations 😓).

There’s maybe another alternative for a fix :

  • Ensuring that the Directory instance used for each indexer instance is a singleton
  • At startup, if the lock file is present and the corresponding Directory instance is null, the stale lock file is deleted and Directory instance created
  • Indexer instances will work as usual
  • Index health check is performed ~72h later. The lock file will likely be present but the Directory instance will be not null so it is fed to CheckIndex and reuse the shared lock to ensure proper synchronization with write occuring at the same time

So far it seems like the AlreadyClosedException is occurring exclusively on the servicecomponent index, can you confirm this?

The vast majority of the exceptions happens on the path=/data/.dependency-track/index/servicecomponent/write.lock, however not exclusively. It also happens on path=/data/.dependency-track/index/vulnerability/write.lock, path=/data/.dependency-track/index/component/write.lock and path=/data/.dependency-track/index/cpe/write.lock.

Some examples (with context) below:

2023-02-02 00:29:16,997 ERROR [LoggableUncaughtExceptionHandler] An unknown error occurred in an asynchronous event or notification thread
org.apache.lucene.store.AlreadyClosedException: Underlying file changed by an external force at 2023-02-02T00:29:15.592156452Z, (lock=NativeFSLock(path=/data/.dependency-track/index/vulnerability/write.lock,impl=sun.nio.ch.FileLockImpl[0:9223372036854775807 exclusive valid],creationTime=2023-02-01T19:00:07.312128577Z))
	at org.apache.lucene.store.NativeFSLockFactory$NativeFSLock.ensureValid(NativeFSLockFactory.java:191)
	at org.apache.lucene.store.LockValidatingDirectoryWrapper.createOutput(LockValidatingDirectoryWrapper.java:43)
	at org.apache.lucene.store.TrackingDirectoryWrapper.createOutput(TrackingDirectoryWrapper.java:43)
	at org.apache.lucene.codecs.lucene80.Lucene80NormsConsumer.<init>(Lucene80NormsConsumer.java:45)
	at org.apache.lucene.codecs.lucene80.Lucene80NormsFormat.normsConsumer(Lucene80NormsFormat.java:85)
	at org.apache.lucene.index.DefaultIndexingChain.writeNorms(DefaultIndexingChain.java:389)
	at org.apache.lucene.index.DefaultIndexingChain.flush(DefaultIndexingChain.java:221)
	at org.apache.lucene.index.DocumentsWriterPerThread.flush(DocumentsWriterPerThread.java:350)
	at org.apache.lucene.index.DocumentsWriter.doFlush(DocumentsWriter.java:476)
	at org.apache.lucene.index.DocumentsWriter.flushAllThreads(DocumentsWriter.java:656)
	at org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:3364)
	at org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3770)
	at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3728)
	at org.dependencytrack.search.IndexManager.commit(IndexManager.java:249)
	at org.dependencytrack.tasks.IndexTask.inform(IndexTask.java:65)
	at alpine.event.framework.BaseEventService.lambda$publish$0(BaseEventService.java:101)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	at java.base/java.lang.Thread.run(Unknown Source)
2023-02-02 00:29:17,948 ERROR [LoggableUncaughtExceptionHandler] An unknown error occurred in an asynchronous event or notification thread
org.apache.lucene.store.AlreadyClosedException: Underlying file changed by an external force at 2023-02-02T00:29:15.592156452Z, (lock=NativeFSLock(path=/data/.dependency-track/index/component/write.lock,impl=sun.nio.ch.FileLockImpl[0:9223372036854775807 exclusive valid],creationTime=2023-02-01T19:00:04.920111164Z))
	at org.apache.lucene.store.NativeFSLockFactory$NativeFSLock.ensureValid(NativeFSLockFactory.java:191)
	at org.apache.lucene.store.LockValidatingDirectoryWrapper.createOutput(LockValidatingDirectoryWrapper.java:43)
	at org.apache.lucene.store.TrackingDirectoryWrapper.createOutput(TrackingDirectoryWrapper.java:43)
	at org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.<init>(CompressingStoredFieldsWriter.java:121)
	at org.apache.lucene.codecs.compressing.CompressingStoredFieldsFormat.fieldsWriter(CompressingStoredFieldsFormat.java:130)
	at org.apache.lucene.codecs.lucene87.Lucene87StoredFieldsFormat.fieldsWriter(Lucene87StoredFieldsFormat.java:141)
	at org.apache.lucene.index.StoredFieldsConsumer.initStoredFieldsWriter(StoredFieldsConsumer.java:48)
	at org.apache.lucene.index.StoredFieldsConsumer.startDocument(StoredFieldsConsumer.java:55)
	at org.apache.lucene.index.DefaultIndexingChain.startStoredFields(DefaultIndexingChain.java:452)
	at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:488)
	at org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:208)
	at org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:415)
	at org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1471)
	at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1757)
	at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1400)
	at org.dependencytrack.search.ComponentIndexer.add(ComponentIndexer.java:80)
	at org.dependencytrack.search.ComponentIndexer.add(ComponentIndexer.java:44)
	at org.dependencytrack.tasks.IndexTask.inform(IndexTask.java:58)
	at alpine.event.framework.BaseEventService.lambda$publish$0(BaseEventService.java:101)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	at java.base/java.lang.Thread.run(Unknown Source)

We see the org.apache.lucene.store.AlreadyClosedException directly after the upload of a SBOM.

2023-02-03 07:07:20,794 INFO [BomUploadProcessingTask] Identified 0 new components
2023-02-03 07:07:20,794 INFO [BomUploadProcessingTask] Processing CycloneDX dependency graph for project: 2ecb6ef9-c7a8-4a8d-b347-c1e6c42b43c9
2023-02-03 07:07:20,807 INFO [BomUploadProcessingTask] Processed 140 components and 0 services uploaded to project 2ecb6ef9-c7a8-4a8d-b347-c1e6c42b43c9
2023-02-03 07:07:20,807 INFO [RepositoryMetaAnalyzerTask] Performing component repository metadata analysis against 140 components
2023-02-03 07:07:20,819 ERROR [LoggableUncaughtExceptionHandler] An unknown error occurred in an asynchronous event or notification thread
org.apache.lucene.store.AlreadyClosedException: Underlying file changed by an external force at 2023-02-02T00:29:15.592156452Z, (lock=NativeFSLock(path=/data/.dependency-track/index/servicecomponent/write.lock,impl=sun.nio.ch.FileLockImpl[0:9223372036854775807 exclusive valid],creationTime=2023-02-01T19:00:07.312128577Z))
	at org.apache.lucene.store.NativeFSLockFactory$NativeFSLock.ensureValid(NativeFSLockFactory.java:191)
	at org.apache.lucene.store.LockValidatingDirectoryWrapper.syncMetaData(LockValidatingDirectoryWrapper.java:61)
	at org.apache.lucene.index.SegmentInfos.prepareCommit(SegmentInfos.java:802)
	at org.apache.lucene.index.IndexWriter.startCommit(IndexWriter.java:5084)
	at org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:3460)
	at org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3770)
	at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3728)
	at org.dependencytrack.search.IndexManager.commit(IndexManager.java:249)
	at org.dependencytrack.tasks.IndexTask.inform(IndexTask.java:65)
	at alpine.event.framework.BaseEventService.lambda$publish$0(BaseEventService.java:101)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	at java.base/java.lang.Thread.run(Unknown Source)