chroma: [BUG] disk I/O error on Databricks:

What happened?

I run the following in a databricks notebook:

vector_db_path = '/dbfs/FileStore/HuggingFace/data/demo_langchain/test_vector_db/'
client = chromadb.PersistentClient(path=vector_db_path)

And I get back: “OperationalError: disk I/O error”

Here is a report for a related issue

Versions

Chroma v0.4.5; Python v3.10.6; Ubuntu 22.04.2 LTS; Databricks 13.2 ML (includes Apache Spark 3.4.0, GPU, Scala 2.12)

Relevant log output

OperationalError                          Traceback (most recent call last)
File <command-3393434180226691>:1
----> 1 client = chromadb.PersistentClient(path=vector_db_path)
      3 vector_db = Chroma(
      4     client=client,
      5     collection_name="HVAC",
      6     embedding_function=embeddings,
      7 )

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/chromadb/__init__.py:73, in PersistentClient(path, settings)
     70 settings.persist_directory = path
     71 settings.is_persistent = True
---> 73 return Client(settings)

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/chromadb/__init__.py:112, in Client(settings)
    109 telemetry_client = system.instance(Telemetry)
    110 api = system.instance(API)
--> 112 system.start()
    114 # Submit event for client start
    115 telemetry_client.capture(ClientStartEvent())

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/chromadb/config.py:216, in System.start(self)
    214 super().start()
    215 for component in self.components():
--> 216     component.start()

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/chromadb/db/impl/sqlite.py:93, in SqliteDB.start(self)
     91     cur.execute("PRAGMA foreign_keys = ON")
     92     cur.execute("PRAGMA case_sensitive_like = ON")
---> 93 self.initialize_migrations()

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/chromadb/db/migrations.py:128, in MigratableDB.initialize_migrations(self)
    125     self.validate_migrations()
    127 if migrate == "apply":
--> 128     self.apply_migrations()

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/chromadb/db/migrations.py:147, in MigratableDB.apply_migrations(self)
    145 def apply_migrations(self) -> None:
    146     """Validate existing migrations, and apply all new ones."""
--> 147     self.setup_migrations()
    148     for dir in self.migration_dirs():
    149         db_migrations = self.db_migrations(dir)

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/chromadb/db/impl/sqlite.py:149, in SqliteDB.setup_migrations(self)
    147 @override
    148 def setup_migrations(self) -> None:
--> 149     with self.tx() as cur:
    150         cur.execute(
    151             """
    152              CREATE TABLE IF NOT EXISTS migrations (
   (...)
    160              """
    161         )

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/chromadb/db/impl/sqlite.py:47, in TxWrapper.__exit__(self, exc_type, exc_value, traceback)
     45 if len(self._tx_stack.stack) == 0:
     46     if exc_type is None:
---> 47         self._conn.commit()
     48     else:
     49         self._conn.rollback()

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/chromadb/db/impl/sqlite_pool.py:31, in Connection.commit(self)
     30 def commit(self) -> None:
---> 31     self._conn.commit()

OperationalError: disk I/O error

About this issue

Original URL
State: closed
Created a year ago
Comments: 15 (1 by maintainers)

Most upvoted comments

Since you are working with dbfs, instead of having “/dbfs/” as your path you need to specify using “dbfs:/”

Your path will then be:

‘dbfs:/FileStore/HuggingFace/data/demo_langchain/test_vector_db/’

ktian9 on Aug 15, 2023

I have been able to save my chroma vector db in DBFS using langchain:

from langchain.vectorstores import Chroma
db = Chroma.from_documents(texts, embeddings, persist_directory=vector_db_path)

I haven’t tried to use ‘pure’ chroma API.

davidmakovoz on Aug 15, 2023

W.r.t. chroma specifically, as I understand you can run chroma as a standalone server and treat it that way. Of course you have to run and manage that service then, think about where the storage is persisted and all that, but you get a read/write endpoint.

srowen on Sep 12, 2023

DBFS is just “cloud storage”. dbfs: is not understood by Chroma, of course; it’s a Databricks-specific alias that Spark and a few other tools support. DBFS is exposed as if local files via a FUSE mount at /dbfs.

So, using a dbfs: path with Chroma is actually just referring to a local directory you’ve created called dbfs:/. It “works” but is not what you intend; the files are not on DBFS>

You can use a /dbfs path with Chroma as it appears to be a local file. However Chroma wants random-write access to files of course. Cloud storage in general does not support that. You can append, not change. So you get an I/O error.

You can’t work off of cloud storage like this. You can however write to a local dir and copy the result to /dbfs and read it. that works fine.

srowen on Sep 11, 2023