datasets: Very slow data loading on large dataset

I made a simple python script to check the NLP library speed, which loads 1.1 TB of textual data. It has been 8 hours and still, it is on the loading steps. It does work when the text dataset size is small about 1 GB, but it doesn’t scale. It also uses a single thread during the data loading step.

train_files = glob.glob("xxx/*.txt",recursive=True)
random.shuffle(train_files)

print(train_files)

dataset = nlp.load_dataset('text', 
                           data_files=train_files,
                           name="customDataset",
                           version="1.0.0",
                           cache_dir="xxx/nlp")

Is there something that I am missing ?

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 28 (13 by maintainers)

Most upvoted comments

Right now, for caching 18Gb data, it is taking 1 hour 10 minute. Is that proper expected time? @lhoestq @agemagician In this rate (assuming large file will caching at the same rate) caching full mC4 (27TB) requires a month (~26 days).

sbmaruf on Oct 5, 2021

@lhoestq Yes, I understand that the first time requires more time. The concatenate_datasets seems to be a workaround, but I believe a multi-processing method should be integrated into load_dataset to make it easier and more efficient for users.

@thomwolf Sure, here are the statistics: Number of lines: 4.2 Billion Number of files: 6K Number of tokens: 800 Billion The number of lines is distributed equally across these 6k files. The line length varies between 100 tokens to 40k tokens.

agemagician on Aug 31, 2020

It does spawn num_proc processes. Note that when you download in parallel you’re often bounded by your bandwidth at one point, so 50 processes is unlikely to get you a x50 download speed up but a bit less

lhoestq on Dec 19, 2023

Hi ! No but this is in our plans (probably a few weeks)

lhoestq on Apr 21, 2022

Ok so now the text files won’t be hashed.

I also updated #548 to include this change. Let us know if it helps @agemagician 😃

lhoestq on Sep 4, 2020

I’m working on it today 😃

lhoestq on Sep 4, 2020

I believe this is really a very important feature, otherwise, we will still have the issue of too slow loading problems even if the data cache generation is fast.

agemagician on Sep 1, 2020