dependency-track: Index commit occasionally fails with `AlreadyClosedException` in some deployments
Current Behavior
In some deployments, committing Lucene indexes may fail with an AlreadyClosedException, stating that the write.lock file of the respective index was “changed by an external force”:
ERROR [LoggableUncaughtExceptionHandler] An unknown error occurred in an asynchronous event or notification thread
org.apache.lucene.store.AlreadyClosedException: Underlying file changed by an external force at 2023-01-05T13:41:30.299072688Z, (lock=NativeFSLock(path=/data/index/servicecomponent/write.lock,impl=sun.nio.ch.FileLockImpl[0:9223372036854775807 exclusive valid],creationTime=2023-01-05T10:41:31.278927236Z))
at org.apache.lucene.store.NativeFSLockFactory$NativeFSLock.ensureValid(NativeFSLockFactory.java:191)
at org.apache.lucene.store.LockValidatingDirectoryWrapper.syncMetaData(LockValidatingDirectoryWrapper.java:61)
at org.apache.lucene.index.SegmentInfos.prepareCommit(SegmentInfos.java:802)
at org.apache.lucene.index.IndexWriter.startCommit(IndexWriter.java:5084)
at org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:3460)
at org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3770)
at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3728)
at org.dependencytrack.search.IndexManager.commit(IndexManager.java:249)
at org.dependencytrack.tasks.IndexTask.inform(IndexTask.java:65)
at alpine.event.framework.BaseEventService.lambda$publish$0(BaseEventService.java:101)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.base/java.lang.Thread.run(Unknown Source)
write.lock is empty, and created when an IndexWriter instance is created. DT creates an IndexWriter instance per Index on startup. Write access to indexes is serialized using the single threaded ExecutorService, so there’s no way an index could be modified by multiple threads.
Lucene uses timestamps of write.lock to determine whether it has been modified after the index was opened.
Steps to Reproduce
I am unable to reproduce locally. It seems like it’s happening predominantly on servers, but I am yet unsure what exactly is the cause.
Could be third party services modifying the file, or file systems incapable of tracking file timestamps correctly, or skewed server times.
The exception can be forced by touching (i.e. modifying access timestamps) the write.lock file after DT has started, e.g.:
touch ~/.dependency-track/index/component/write.lock
Afterwards, upload a BOM and monitor the log for AlreadyClosedExceptions.
Expected Behavior
Committing indexes should not fail.
Dependency-Track Version
4.7.0
Dependency-Track Distribution
Container Image, Executable WAR
Database Server
N/A
Database Server Version
No response
Browser
N/A
Checklist
- I have read and understand the contributing guidelines
- I have checked the existing issues for whether this defect was already reported
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 20 (13 by maintainers)
Commits related to this issue
- Fix `AlreadyClosedException` when committing search indexes See https://github.com/DependencyTrack/dependency-track/issues/2379#issuecomment-1416800378 for background. Fixes #2379 Signed-off-by: ns... — committed to nscuro/dependency-track by nscuro a year ago
- Fix `AlreadyClosedException` when committing search indexes See https://github.com/DependencyTrack/dependency-track/issues/2379#issuecomment-1416800378 for background. Fixes #2379 Signed-off-by: ns... — committed to AISAutomation/dependency-track by nscuro a year ago
Thanks @syalioune, I raised #2477 that does exactly that. I also added proper closing of the
IndexWriters when the application shuts down.Found the culprit.
Since 4.7.0 there is an index health check on startup, but also recurring every 72h (per default, configurable in UI at Administration -> Search -> Search - Periodic consistency check -> Cadence).
The health check uses Lucene’s
CheckIndex.CheckIndexdoes not only “check” the index, it actually tries to repair it if necessary. It is a writing operation that requires a write lock to be acquired. This is why DT currently deletes any existing lock files before startingCheckIndex:https://github.com/DependencyTrack/dependency-track/blob/7cfc466baf7548ddb758f9e7b008b2d4aaedd148/src/main/java/org/dependencytrack/search/IndexManager.java#L438-L442
Write locks however are also held my the
IndexWriters. There is anIndexWriterfor every index. Official guidance from Lucene recommends to createIndexWriteronce and then share the created instance among threads, for the entire lifetime of the application. So this is what DT does.IndexWritercreates a write lock file when it is created, and only releases that lock when it is closed. Based on Lucene guidance, and how it is implemented in DT,IndexWriters are only closed when the application shuts down.So this is what happens:
IndexWriters may be created at this pointwrite.lockfiles that may have been created already, assuming they are stale10sIndexWriters and their lock files should have been createdAlreadyClosedException, since the file has been recreated after theIndexWritercreated it. It has to assume that the lock it holds is ineffectiveAs a short-term fix, disable the periodic consistency check:
As for a proper fix in the code, I see multiple options:
IndexWriters before performing the health check, thus releasing all locks. Re-open the writers after the check completed.IndexWriters, all those accesses will fail.CheckIndexat allYou and me both 😓. I was naively thinking that
IndexWriterinstances would lock the directory on a strictly need-to-lock basis so that the lock could be used as a mutex withCheckIndex. After looking at the code, it appears that the lock is obtained at theIndexWriterinstance creation and held throughout its lifetime.In that case, I vote for
The periodic task would only compute the delta ratio between DB and Lucene and fire
REINDEXevents to the correspondingIndexWriterHello @nscuro
Thanks for the spot on analysis.
I don’t think that a race condition between the first task execution and initial health check is that common. I’ve rather implemented this check to prevent against hard process kill/crash which will leave stale lock file (too many bad experiences with flaky file based lock implementations 😓).
There’s maybe another alternative for a fix :
Directoryinstance used for each indexer instance is a singletonDirectoryinstance is null, the stale lock file is deleted andDirectoryinstance createdDirectoryinstance will be not null so it is fed toCheckIndexand reuse the shared lock to ensure proper synchronization with write occuring at the same timeThe vast majority of the exceptions happens on the
path=/data/.dependency-track/index/servicecomponent/write.lock, however not exclusively. It also happens onpath=/data/.dependency-track/index/vulnerability/write.lock,path=/data/.dependency-track/index/component/write.lockandpath=/data/.dependency-track/index/cpe/write.lock.Some examples (with context) below:
We see the
org.apache.lucene.store.AlreadyClosedExceptiondirectly after the upload of a SBOM.