conda-store: [BUG] - Submitting multiple builds can fill queue causing frontend to be stop working

Describe the bug

I’ll describe this issue as I’ve seen it through the Nebari jlab conda-store UI, but I believe the issue is likely the backend.

Open an existing env in the UI, go to edit, make a minor change, click Create. Normally upon clicking Create, you’d be redirected back to the “non-edit” UI screen, but now clicking the button doesn’t appear to do anything.

Now attempt to go to the /conda-store/admin page - it will not load anything, just spin.

Now shutdown your server, and log back in. Going to the conda-store UI and attempting to log in will have no effect, you just get the spinning logo in the browser tab. The rest of nebari remains operational, but there is no way to get conda-store out of this state immediately.

image

This happened to me last night and when I came back this morning it had worked itself into an operational state again. I went through the above process this morning on one nebari deployment and the same thing happened. Then I went through the above process on a different nebari deployment and the same thing happened.

I will also note that I tried to Delete environments and saw similar behavior but I’m not sure if that was because it was already in a broken state or if that also is causing something similar to happen.

Expected behavior

I expect to be able to Edit and Save environments.

How to Reproduce the problem?

Reproducer explained above.

Output

No response

Versions and dependencies used.

No response

Anything else?

No response

About this issue

  • Original URL
  • State: closed
  • Created 9 months ago
  • Reactions: 1
  • Comments: 16 (13 by maintainers)

Commits related to this issue

Most upvoted comments

We talked about this today and @nkaretnikov will be writing an integration test to hopefully reproduce this issue. I think that this issue will only surface when using non-sqlite databases as a backend.

This issue has to do with SQLAlchemy, Sessions, and FastAPI (threads/async/await). I have spent a long time trying to figure issues around this… I don’t understand it

Adding a tad more context

We can make a short-term fix to unblock release by changing QueuePool to NullPool

For this item to be completed and merged (potentially) the change is needed + adding the relevant tests

From today’s meeting:

  • We can make a short-term fix to unblock release by changing QueuePool to NullPool
  • Long term, we need to look into the core issue here - @costrouc will open a new issue about this with historical context