chroma: [BUG] disk I/O error on Databricks:
What happened?
I run the following in a databricks notebook:
vector_db_path = '/dbfs/FileStore/HuggingFace/data/demo_langchain/test_vector_db/'
client = chromadb.PersistentClient(path=vector_db_path)
And I get back: “OperationalError: disk I/O error”
Here is a report for a related issue
Versions
Chroma v0.4.5; Python v3.10.6; Ubuntu 22.04.2 LTS; Databricks 13.2 ML (includes Apache Spark 3.4.0, GPU, Scala 2.12)
Relevant log output
OperationalError Traceback (most recent call last)
File <command-3393434180226691>:1
----> 1 client = chromadb.PersistentClient(path=vector_db_path)
3 vector_db = Chroma(
4 client=client,
5 collection_name="HVAC",
6 embedding_function=embeddings,
7 )
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/chromadb/__init__.py:73, in PersistentClient(path, settings)
70 settings.persist_directory = path
71 settings.is_persistent = True
---> 73 return Client(settings)
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/chromadb/__init__.py:112, in Client(settings)
109 telemetry_client = system.instance(Telemetry)
110 api = system.instance(API)
--> 112 system.start()
114 # Submit event for client start
115 telemetry_client.capture(ClientStartEvent())
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/chromadb/config.py:216, in System.start(self)
214 super().start()
215 for component in self.components():
--> 216 component.start()
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/chromadb/db/impl/sqlite.py:93, in SqliteDB.start(self)
91 cur.execute("PRAGMA foreign_keys = ON")
92 cur.execute("PRAGMA case_sensitive_like = ON")
---> 93 self.initialize_migrations()
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/chromadb/db/migrations.py:128, in MigratableDB.initialize_migrations(self)
125 self.validate_migrations()
127 if migrate == "apply":
--> 128 self.apply_migrations()
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/chromadb/db/migrations.py:147, in MigratableDB.apply_migrations(self)
145 def apply_migrations(self) -> None:
146 """Validate existing migrations, and apply all new ones."""
--> 147 self.setup_migrations()
148 for dir in self.migration_dirs():
149 db_migrations = self.db_migrations(dir)
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/chromadb/db/impl/sqlite.py:149, in SqliteDB.setup_migrations(self)
147 @override
148 def setup_migrations(self) -> None:
--> 149 with self.tx() as cur:
150 cur.execute(
151 """
152 CREATE TABLE IF NOT EXISTS migrations (
(...)
160 """
161 )
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/chromadb/db/impl/sqlite.py:47, in TxWrapper.__exit__(self, exc_type, exc_value, traceback)
45 if len(self._tx_stack.stack) == 0:
46 if exc_type is None:
---> 47 self._conn.commit()
48 else:
49 self._conn.rollback()
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/chromadb/db/impl/sqlite_pool.py:31, in Connection.commit(self)
30 def commit(self) -> None:
---> 31 self._conn.commit()
OperationalError: disk I/O error
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 15 (1 by maintainers)
Since you are working with dbfs, instead of having “/dbfs/” as your path you need to specify using “dbfs:/”
Your path will then be:
‘dbfs:/FileStore/HuggingFace/data/demo_langchain/test_vector_db/’
I have been able to save my chroma vector db in DBFS using langchain:
I haven’t tried to use ‘pure’ chroma API.
W.r.t. chroma specifically, as I understand you can run chroma as a standalone server and treat it that way. Of course you have to run and manage that service then, think about where the storage is persisted and all that, but you get a read/write endpoint.
DBFS is just “cloud storage”.
dbfs:is not understood by Chroma, of course; it’s a Databricks-specific alias that Spark and a few other tools support. DBFS is exposed as if local files via a FUSE mount at/dbfs.So, using a
dbfs:path with Chroma is actually just referring to a local directory you’ve created calleddbfs:/. It “works” but is not what you intend; the files are not on DBFS>You can use a
/dbfspath with Chroma as it appears to be a local file. However Chroma wants random-write access to files of course. Cloud storage in general does not support that. You can append, not change. So you get an I/O error.You can’t work off of cloud storage like this. You can however write to a local dir and copy the result to
/dbfsand read it. that works fine.