datasets: Very slow data loading on large dataset
I made a simple python script to check the NLP library speed, which loads 1.1 TB of textual data. It has been 8 hours and still, it is on the loading steps. It does work when the text dataset size is small about 1 GB, but it doesn’t scale. It also uses a single thread during the data loading step.
train_files = glob.glob("xxx/*.txt",recursive=True)
random.shuffle(train_files)
print(train_files)
dataset = nlp.load_dataset('text',
data_files=train_files,
name="customDataset",
version="1.0.0",
cache_dir="xxx/nlp")
Is there something that I am missing ?
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 28 (13 by maintainers)
Right now, for caching 18Gb data, it is taking 1 hour 10 minute. Is that proper expected time? @lhoestq @agemagician In this rate (assuming large file will caching at the same rate) caching full mC4 (27TB) requires a month (~26 days).
@lhoestq Yes, I understand that the first time requires more time. The concatenate_datasets seems to be a workaround, but I believe a multi-processing method should be integrated into load_dataset to make it easier and more efficient for users.
@thomwolf Sure, here are the statistics: Number of lines: 4.2 Billion Number of files: 6K Number of tokens: 800 Billion The number of lines is distributed equally across these 6k files. The line length varies between 100 tokens to 40k tokens.
It does spawn
num_proc
processes. Note that when you download in parallel you’re often bounded by your bandwidth at one point, so 50 processes is unlikely to get you a x50 download speed up but a bit lessHi ! No but this is in our plans (probably a few weeks)
Ok so now the text files won’t be hashed.
I also updated #548 to include this change. Let us know if it helps @agemagician 😃
I’m working on it today 😃
I believe this is really a very important feature, otherwise, we will still have the issue of too slow loading problems even if the data cache generation is fast.