dependency-track: Index commit occasionally fails with `AlreadyClosedException` in some deployments
Current Behavior
In some deployments, committing Lucene indexes may fail with an AlreadyClosedException
, stating that the write.lock
file of the respective index was “changed by an external force”:
ERROR [LoggableUncaughtExceptionHandler] An unknown error occurred in an asynchronous event or notification thread
org.apache.lucene.store.AlreadyClosedException: Underlying file changed by an external force at 2023-01-05T13:41:30.299072688Z, (lock=NativeFSLock(path=/data/index/servicecomponent/write.lock,impl=sun.nio.ch.FileLockImpl[0:9223372036854775807 exclusive valid],creationTime=2023-01-05T10:41:31.278927236Z))
at org.apache.lucene.store.NativeFSLockFactory$NativeFSLock.ensureValid(NativeFSLockFactory.java:191)
at org.apache.lucene.store.LockValidatingDirectoryWrapper.syncMetaData(LockValidatingDirectoryWrapper.java:61)
at org.apache.lucene.index.SegmentInfos.prepareCommit(SegmentInfos.java:802)
at org.apache.lucene.index.IndexWriter.startCommit(IndexWriter.java:5084)
at org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:3460)
at org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3770)
at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3728)
at org.dependencytrack.search.IndexManager.commit(IndexManager.java:249)
at org.dependencytrack.tasks.IndexTask.inform(IndexTask.java:65)
at alpine.event.framework.BaseEventService.lambda$publish$0(BaseEventService.java:101)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.base/java.lang.Thread.run(Unknown Source)
write.lock
is empty, and created when an IndexWriter
instance is created. DT creates an IndexWriter
instance per Index on startup. Write access to indexes is serialized using the single threaded ExecutorService
, so there’s no way an index could be modified by multiple threads.
Lucene uses timestamps of write.lock
to determine whether it has been modified after the index was opened.
Steps to Reproduce
I am unable to reproduce locally. It seems like it’s happening predominantly on servers, but I am yet unsure what exactly is the cause.
Could be third party services modifying the file, or file systems incapable of tracking file timestamps correctly, or skewed server times.
The exception can be forced by touch
ing (i.e. modifying access timestamps) the write.lock
file after DT has started, e.g.:
touch ~/.dependency-track/index/component/write.lock
Afterwards, upload a BOM and monitor the log for AlreadyClosedException
s.
Expected Behavior
Committing indexes should not fail.
Dependency-Track Version
4.7.0
Dependency-Track Distribution
Container Image, Executable WAR
Database Server
N/A
Database Server Version
No response
Browser
N/A
Checklist
- I have read and understand the contributing guidelines
- I have checked the existing issues for whether this defect was already reported
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 20 (13 by maintainers)
Commits related to this issue
- Fix `AlreadyClosedException` when committing search indexes See https://github.com/DependencyTrack/dependency-track/issues/2379#issuecomment-1416800378 for background. Fixes #2379 Signed-off-by: ns... — committed to nscuro/dependency-track by nscuro a year ago
- Fix `AlreadyClosedException` when committing search indexes See https://github.com/DependencyTrack/dependency-track/issues/2379#issuecomment-1416800378 for background. Fixes #2379 Signed-off-by: ns... — committed to AISAutomation/dependency-track by nscuro a year ago
Thanks @syalioune, I raised #2477 that does exactly that. I also added proper closing of the
IndexWriter
s when the application shuts down.Found the culprit.
Since 4.7.0 there is an index health check on startup, but also recurring every 72h (per default, configurable in UI at Administration -> Search -> Search - Periodic consistency check -> Cadence).
The health check uses Lucene’s
CheckIndex
.CheckIndex
does not only “check” the index, it actually tries to repair it if necessary. It is a writing operation that requires a write lock to be acquired. This is why DT currently deletes any existing lock files before startingCheckIndex
:https://github.com/DependencyTrack/dependency-track/blob/7cfc466baf7548ddb758f9e7b008b2d4aaedd148/src/main/java/org/dependencytrack/search/IndexManager.java#L438-L442
Write locks however are also held my the
IndexWriter
s. There is anIndexWriter
for every index. Official guidance from Lucene recommends to createIndexWriter
once and then share the created instance among threads, for the entire lifetime of the application. So this is what DT does.IndexWriter
creates a write lock file when it is created, and only releases that lock when it is closed. Based on Lucene guidance, and how it is implemented in DT,IndexWriter
s are only closed when the application shuts down.So this is what happens:
IndexWriter
s may be created at this pointwrite.lock
files that may have been created already, assuming they are stale10s
IndexWriter
s and their lock files should have been createdAlreadyClosedException
, since the file has been recreated after theIndexWriter
created it. It has to assume that the lock it holds is ineffectiveAs a short-term fix, disable the periodic consistency check:
As for a proper fix in the code, I see multiple options:
IndexWriter
s before performing the health check, thus releasing all locks. Re-open the writers after the check completed.IndexWriter
s, all those accesses will fail.CheckIndex
at allYou and me both 😓. I was naively thinking that
IndexWriter
instances would lock the directory on a strictly need-to-lock basis so that the lock could be used as a mutex withCheckIndex
. After looking at the code, it appears that the lock is obtained at theIndexWriter
instance creation and held throughout its lifetime.In that case, I vote for
The periodic task would only compute the delta ratio between DB and Lucene and fire
REINDEX
events to the correspondingIndexWriter
Hello @nscuro
Thanks for the spot on analysis.
I don’t think that a race condition between the first task execution and initial health check is that common. I’ve rather implemented this check to prevent against hard process kill/crash which will leave stale lock file (too many bad experiences with flaky file based lock implementations 😓).
There’s maybe another alternative for a fix :
Directory
instance used for each indexer instance is a singletonDirectory
instance is null, the stale lock file is deleted andDirectory
instance createdDirectory
instance will be not null so it is fed toCheckIndex
and reuse the shared lock to ensure proper synchronization with write occuring at the same timeThe vast majority of the exceptions happens on the
path=/data/.dependency-track/index/servicecomponent/write.lock
, however not exclusively. It also happens onpath=/data/.dependency-track/index/vulnerability/write.lock
,path=/data/.dependency-track/index/component/write.lock
andpath=/data/.dependency-track/index/cpe/write.lock
.Some examples (with context) below:
We see the
org.apache.lucene.store.AlreadyClosedException
directly after the upload of a SBOM.