zenml: Pandas materializer is slow when using large datasets with remote artifact stores
Open Source Contributors Welcomed!
Please comment below if you would like to work on this issue!
Contact Details [Optional]
support@zenml.io
What happened?
There is a performance issue with the Pandas materializer in ZenML, particularly when dealing with large datasets materialized on remote artifact stores such as S3. The loading process is extremely slow, which becomes evident when handling datasets of significant size.
Steps to Reproduce
The issue was encountered when trying to load a large dataset (~30m rows) from a Snowflake database. Initially, the pipeline was tested with a limit of 100 on the SQL query, which worked fine. However, removing the limit to fetch the entire dataset leads to a memory outage and slow performance. Here’s one way you might like to test this:
import pandas as pd
from zenml import step, pipeline
def create_large_dataset(num_rows=30_000_000):
# Creating a DataFrame with 30 million rows and a few columns
df = pd.DataFrame({
'id': range(num_rows),
'value': [f'value_{i % 1000}' for i in range(num_rows)], # Cyclic repetition of values
'number': range(num_rows),
'date': pd.date_range(start='2020-01-01', periods=num_rows, freq='T') # Minute frequency
})
return df
@step
def my_step() -> pd.DataFrame:
df = create_large_dataset()
return df
# Run this using a remote artifact store
@pipeline
def my_pipeline(my_step):
my_step()
my_pipeline()
This code generates a DataFrame
with 30 million rows, each row containing an ID, a cyclically repeating string value, a number, and a timestamp. You can adjust the create_large_dataset
function to tailor the dataset to your specific needs.
Expected Behavior
The expectation is for the Pandas materializer to efficiently handle large datasets, even when they are stored on remote artifact stores. The materializer should be able to load the entire dataset without significant performance degradation or memory issues.
Potential Solutions
Benchmark the current implementation to identify bottlenecks. Investigate optimizations in data loading, possibly through chunking or more efficient memory management. Consider alternative approaches or tools that are better suited for handling large datasets. Explore the possibility of improving the integration with the remote artifact stores to optimize data transfer and loading.
Additional Context
This issue is critical for users who work with large-scale data in ZenML, as it affects the efficiency and feasibility of data pipelines.
About this issue
- Original URL
- State: open
- Created 6 months ago
- Comments: 25 (13 by maintainers)
I think it sounds good. Allow me a few days to play around ZenML and Modin, as I have not tested anything so far. Then I will open a PR and If I’m stuck I will come back to you
Ah got it. I opened a PR that will fix this issue going forward.