dagster: `dagster-webserver` memory leak

Dagster version

1.5.13

What’s the issue?

dagster-webserver 1.5.13 seems to have some kind of memory leak. Since we updated to that version, we can observe a steady increase in memory usage over the last couple of weeks.

  • The increase in memory usage correlates to the change of version, without any other change being introduced.
  • We observe the same behaviour on 2 different GKE clusters.
  • Reverting to 1.5.12 resolves the issue.

image image

What did you expect to happen?

No response

How to reproduce?

No response

Deployment type

Dagster Helm chart

Deployment details

No response

Additional information

No response

Message from the maintainers

Impacted by this issue? Give it a πŸ‘! We factor engagement into prioritization.

About this issue

  • Original URL
  • State: open
  • Created 6 months ago
  • Reactions: 8
  • Comments: 22 (8 by maintainers)

Most upvoted comments

We think we might? have solved it on our end – we didn’t have a strict retention policy on logs set in our dagster.yml and once we set it to below our memory stopped growing:

retention:
  schedule:
    purge_after_days: 90 # sets retention policy for schedule ticks of all types
  sensor:
    purge_after_days:
      skipped: 7
      failure: 90
      success: 365

How did that impact your memory usage? Technically you’ll still retain ticks for up to 365 days, thus you should not see a change in behavior in just a few days. Or did I miss something?

I’ve applied a similar setting on my deployment as well (way stricter than yours, for testing) and my memory is still going up, same as before.

We are also having the same issue on 1.6.0, also ECS/Fargate

@jvyoralek : No, we found out that it’s not working for us either. The initial indication that it was working was probably just a fluke.

Has anyone had success with the solution recommended by @stasharrofi ?

We have made changes, but it appears that the memory usage is still increasing.

image

I see anyio 4.3 in log

#12 1.757 Collecting dagster==1.6.6
#12 1.810   Downloading dagster-1.6.6-py3-none-any.whl (1.4 MB)
#12 1.852      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.4/1.4 MB 36.1 MB/s eta 0:00:00
#12 2.037 Collecting dagster-aws==0.22.6
#12 2.042   Downloading dagster_aws-0.22.6-py3-none-any.whl (109 kB)
#12 2.048      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 109.8/109.8 kB 32.6 MB/s eta 0:00:00
#12 2.214 Collecting dagster-postgres==0.22.6
#12 2.219   Downloading dagster_postgres-0.22.6-py3-none-any.whl (20 kB)
#12 2.259 Collecting anyio==4.3.0
#12 2.263   Downloading anyio-4.3.0-py3-none-any.whl (85 kB)

EDIT: We found out that the following is actually not working. The initial indication might have just been a fluke.

~We were having this issue and I believe that we have found the root cause to be a bug in anyio which leaked processes. The bug was introduced in 4.1.0 and fixed in 4.3.0 (last week): https://github.com/agronholm/anyio/issues/669~

~Dagster has a dependency on anyio through the following chain: dagit --> dagster-webserver --> starlette --> anyio and I believe that this issue started to appear for people whenever they rebuilt their Dagster image during the time that bug was present because a newer but buggy version of anyio would have been included in their docker image.~

~So, the solution could be to either explicitly require anyio >= 4.3.0 or to wait until people rebuild their docker images and automatically get the bug-fixed version.~

@noam-jacobson what version were you upgrading from?

I was on version 1.5.10

@aaaaahaaaaa did you find any reason why memory started growing? We have a similar issue and switching between versions didn’t help yet - tried from 1.5.14 to 1.5.12.

The memory increase is quite noticeable, showing up even in daily granularity.

This issue seems to be isolated to the webserver component. Both the daemon and code servers are exhibiting stable memory usage. We are operating these as three separate containers within AWS ECS.

We have only one scheduled job active, no sensors, auto-materialized so far. Assets are loaded from dbt.

SCR-20240119-iqbx