chroma: [BUG] disk I/O error on Databricks:
What happened?
I run the following in a databricks notebook:
vector_db_path = '/dbfs/FileStore/HuggingFace/data/demo_langchain/test_vector_db/'
client = chromadb.PersistentClient(path=vector_db_path)
And I get back: “OperationalError: disk I/O error”
Here is a report for a related issue
Versions
Chroma v0.4.5; Python v3.10.6; Ubuntu 22.04.2 LTS; Databricks 13.2 ML (includes Apache Spark 3.4.0, GPU, Scala 2.12)
Relevant log output
OperationalError                          Traceback (most recent call last)
File <command-3393434180226691>:1
----> 1 client = chromadb.PersistentClient(path=vector_db_path)
      3 vector_db = Chroma(
      4     client=client,
      5     collection_name="HVAC",
      6     embedding_function=embeddings,
      7 )
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/chromadb/__init__.py:73, in PersistentClient(path, settings)
     70 settings.persist_directory = path
     71 settings.is_persistent = True
---> 73 return Client(settings)
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/chromadb/__init__.py:112, in Client(settings)
    109 telemetry_client = system.instance(Telemetry)
    110 api = system.instance(API)
--> 112 system.start()
    114 # Submit event for client start
    115 telemetry_client.capture(ClientStartEvent())
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/chromadb/config.py:216, in System.start(self)
    214 super().start()
    215 for component in self.components():
--> 216     component.start()
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/chromadb/db/impl/sqlite.py:93, in SqliteDB.start(self)
     91     cur.execute("PRAGMA foreign_keys = ON")
     92     cur.execute("PRAGMA case_sensitive_like = ON")
---> 93 self.initialize_migrations()
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/chromadb/db/migrations.py:128, in MigratableDB.initialize_migrations(self)
    125     self.validate_migrations()
    127 if migrate == "apply":
--> 128     self.apply_migrations()
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/chromadb/db/migrations.py:147, in MigratableDB.apply_migrations(self)
    145 def apply_migrations(self) -> None:
    146     """Validate existing migrations, and apply all new ones."""
--> 147     self.setup_migrations()
    148     for dir in self.migration_dirs():
    149         db_migrations = self.db_migrations(dir)
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/chromadb/db/impl/sqlite.py:149, in SqliteDB.setup_migrations(self)
    147 @override
    148 def setup_migrations(self) -> None:
--> 149     with self.tx() as cur:
    150         cur.execute(
    151             """
    152              CREATE TABLE IF NOT EXISTS migrations (
   (...)
    160              """
    161         )
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/chromadb/db/impl/sqlite.py:47, in TxWrapper.__exit__(self, exc_type, exc_value, traceback)
     45 if len(self._tx_stack.stack) == 0:
     46     if exc_type is None:
---> 47         self._conn.commit()
     48     else:
     49         self._conn.rollback()
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/chromadb/db/impl/sqlite_pool.py:31, in Connection.commit(self)
     30 def commit(self) -> None:
---> 31     self._conn.commit()
OperationalError: disk I/O error
About this issue
- Original URL
 - State: closed
 - Created a year ago
 - Comments: 15 (1 by maintainers)
 
Since you are working with dbfs, instead of having “/dbfs/” as your path you need to specify using “dbfs:/”
Your path will then be:
‘dbfs:/FileStore/HuggingFace/data/demo_langchain/test_vector_db/’
I have been able to save my chroma vector db in DBFS using langchain:
I haven’t tried to use ‘pure’ chroma API.
W.r.t. chroma specifically, as I understand you can run chroma as a standalone server and treat it that way. Of course you have to run and manage that service then, think about where the storage is persisted and all that, but you get a read/write endpoint.
DBFS is just “cloud storage”.
dbfs:is not understood by Chroma, of course; it’s a Databricks-specific alias that Spark and a few other tools support. DBFS is exposed as if local files via a FUSE mount at/dbfs.So, using a
dbfs:path with Chroma is actually just referring to a local directory you’ve created calleddbfs:/. It “works” but is not what you intend; the files are not on DBFS>You can use a
/dbfspath with Chroma as it appears to be a local file. However Chroma wants random-write access to files of course. Cloud storage in general does not support that. You can append, not change. So you get an I/O error.You can’t work off of cloud storage like this. You can however write to a local dir and copy the result to
/dbfsand read it. that works fine.