datasets: Very slow data loading on large dataset

I made a simple python script to check the NLP library speed, which loads 1.1 TB of textual data. It has been 8 hours and still, it is on the loading steps. It does work when the text dataset size is small about 1 GB, but it doesn’t scale. It also uses a single thread during the data loading step.

train_files = glob.glob("xxx/*.txt",recursive=True)
random.shuffle(train_files)

print(train_files)

dataset = nlp.load_dataset('text', 
                           data_files=train_files,
                           name="customDataset",
                           version="1.0.0",
                           cache_dir="xxx/nlp")

Is there something that I am missing ?

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 28 (13 by maintainers)

Most upvoted comments

Right now, for caching 18Gb data, it is taking 1 hour 10 minute. Is that proper expected time? @lhoestq @agemagician In this rate (assuming large file will caching at the same rate) caching full mC4 (27TB) requires a month (~26 days).

@lhoestq Yes, I understand that the first time requires more time. The concatenate_datasets seems to be a workaround, but I believe a multi-processing method should be integrated into load_dataset to make it easier and more efficient for users.

@thomwolf Sure, here are the statistics: Number of lines: 4.2 Billion Number of files: 6K Number of tokens: 800 Billion The number of lines is distributed equally across these 6k files. The line length varies between 100 tokens to 40k tokens.

It does spawn num_proc processes. Note that when you download in parallel you’re often bounded by your bandwidth at one point, so 50 processes is unlikely to get you a x50 download speed up but a bit less

Hi ! No but this is in our plans (probably a few weeks)

Ok so now the text files won’t be hashed.

I also updated #548 to include this change. Let us know if it helps @agemagician 😃

I’m working on it today 😃

I believe this is really a very important feature, otherwise, we will still have the issue of too slow loading problems even if the data cache generation is fast.