zenml: Pandas materializer is slow when using large datasets with remote artifact stores

Open Source Contributors Welcomed!

Please comment below if you would like to work on this issue!

Contact Details [Optional]

support@zenml.io

What happened?

There is a performance issue with the Pandas materializer in ZenML, particularly when dealing with large datasets materialized on remote artifact stores such as S3. The loading process is extremely slow, which becomes evident when handling datasets of significant size.

Steps to Reproduce

The issue was encountered when trying to load a large dataset (~30m rows) from a Snowflake database. Initially, the pipeline was tested with a limit of 100 on the SQL query, which worked fine. However, removing the limit to fetch the entire dataset leads to a memory outage and slow performance. Here’s one way you might like to test this:

import pandas as pd
from zenml import step, pipeline

def create_large_dataset(num_rows=30_000_000):
    # Creating a DataFrame with 30 million rows and a few columns
    df = pd.DataFrame({
        'id': range(num_rows),
        'value': [f'value_{i % 1000}' for i in range(num_rows)],  # Cyclic repetition of values
        'number': range(num_rows),
        'date': pd.date_range(start='2020-01-01', periods=num_rows, freq='T')  # Minute frequency
    })
    return df

@step
def my_step() -> pd.DataFrame:
    df = create_large_dataset()
    return df

# Run this using a remote artifact store
@pipeline
def my_pipeline(my_step):
    my_step()

my_pipeline()

This code generates a DataFrame with 30 million rows, each row containing an ID, a cyclically repeating string value, a number, and a timestamp. You can adjust the create_large_dataset function to tailor the dataset to your specific needs.

Expected Behavior

The expectation is for the Pandas materializer to efficiently handle large datasets, even when they are stored on remote artifact stores. The materializer should be able to load the entire dataset without significant performance degradation or memory issues.

Potential Solutions

Benchmark the current implementation to identify bottlenecks. Investigate optimizations in data loading, possibly through chunking or more efficient memory management. Consider alternative approaches or tools that are better suited for handling large datasets. Explore the possibility of improving the integration with the remote artifact stores to optimize data transfer and loading.

Additional Context

This issue is critical for users who work with large-scale data in ZenML, as it affects the efficiency and feasibility of data pipelines.

About this issue

Original URL
State: open
Created 6 months ago
Comments: 25 (13 by maintainers)

Most upvoted comments

I think it sounds good. Allow me a few days to play around ZenML and Modin, as I have not tested anything so far. Then I will open a PR and If I’m stuck I will come back to you

benitomartin on Jan 10, 2024

Ah got it. I opened a PR that will fix this issue going forward.

strickvl on Jan 15, 2024