datasets: Slow dataloading with big datasets issue persists
Hi,
I reported too slow data fetching when data is large(#2210) a couple of weeks ago, and @lhoestq referred me to the fix (#2122). However, the problem seems to persist. Here is the profiled results:
- Running with 60GB
Action | Mean duration (s) |Num calls | Total time (s) | Percentage % |
------------------------------------------------------------------------------------------------------------------------------------
Total | - |_ | 517.96 | 100 % |
------------------------------------------------------------------------------------------------------------------------------------
model_backward | 0.26144 |100 | 26.144 | 5.0475 |
model_forward | 0.11123 |100 | 11.123 | 2.1474 |
get_train_batch | 0.097121 |100 | 9.7121 | 1.8751 |
- Running with 600GB, datasets==1.6.0
Action | Mean duration (s) |Num calls | Total time (s) | Percentage % |
------------------------------------------------------------------------------------------------------------------------------------
Total | - |_ | 4563.2 | 100 % |
------------------------------------------------------------------------------------------------------------------------------------
get_train_batch | 5.1279 |100 | 512.79 | 11.237 |
model_backward | 4.8394 |100 | 483.94 | 10.605 |
model_forward | 0.12162 |100 | 12.162 | 0.26652 |
I see that get_train_batch
lags when data is large. Could this be related to different issues?
I would be happy to provide necessary information to investigate.
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 9
- Comments: 70 (29 by maintainers)
Commits related to this issue
- add num_process to load_from_disk #2252 — committed to kkoutini/datasets by kkoutini 7 months ago
- add threadmap to load_from_disk #2252 — committed to kkoutini/datasets by kkoutini 7 months ago
- Add threadmap to arrow_reader.read_files #2252 — committed to kkoutini/datasets by kkoutini 7 months ago
- Add concurrent loading of shards to datasets.load_from_disk (#6464) * add threadmap to load_from_disk #2252 * Add threadmap to arrow_reader.read_files #2252 * remove old way of loading files ... — committed to huggingface/datasets by kkoutini 5 months ago
If this solution proves to help, we can add an arrow files sharding for all big datasets directly integrated in
load_dataset
.Yes your intuition is right 😃
Unfortunately no. Thanks for running the benchmark though, it shows that you machine does a lot of read operations. This is not expected: in other machines it does almost no read operations which enables a very fast loading.
I did some tests on google colab and have the same issue. The first time the dataset arrow file is memory mapped takes always a lot of time (time seems linear with respect to the dataset size). Reloading the dataset is then instantaneous since the arrow file has already been memory mapped.
I also tried using the Arrow IPC file format (see #1933) instead of the current streaming format that we use but it didn’t help.
Memory mapping is handled by the OS and depends on the disk you’re using, so I’m not sure we can do much about it. I’ll continue to investigate anyway, because I still don’t know why in some cases it would go through the entire file (high
Blocks read
as in your tests) and in other cases it would do almost no reading.I’m very happying that I use this way to acclereate the time of the data load from 15 minutes to 45 seconds. And my dataset size is on the TB scale.
My file system is virtio-fs. I can’t know its real file system because my code is in the virtual machine and i can’t access the host machine. I guess it should be a distributed file system similar to Luster. I don’t know why it is so slow and why multi threading can acclerate it. But what i know is it really acclerates the load time and it is vital for me. Thanks for your code.
I managed to speed up the loading time (on the Lustre file system) by mmapping the arrow shards in parallel (
python preload_mmap.py
see the script below) and relying on the OS to cache the mmap.Here are some results:
python preload_mmap.py
once to cache (the first time, it takes around 90 seconds with 16 processes).It seems that preloading the files in processes (without returning the table) speeds up subsequent
load_from_disk
calls. However, the communication time to return the tables for concatenation is high (I am not sure how they are pickled).Threads are slower to mmap the table but faster to communicate. If this works on other file systems, it may be worth it to have the option to load the shards in parallel here.
I’m facing the same issue when loading a 900GB dataset (stored via
save_to_disk
):load_from_disk(path_to_dir)
takes 1.5 hours and htop consistently shows high IO rates > 120 M/s.Nice to see this method validated on multiple setups !
Would be cool to integrate multithreading when memory mapping the Arrow files then I think this can be added here (for load_dataset):
https://github.com/huggingface/datasets/blob/796a47e388a5c5711a95bd649648608c18219ac5/src/datasets/arrow_reader.py#L199-L201
and here (for load_from_disk):
https://github.com/huggingface/datasets/blob/796a47e388a5c5711a95bd649648608c18219ac5/src/datasets/arrow_dataset.py#L1701-L1704
I can take some time next week to do it, but feel free to open a PR if you want to give it a try
Reproducing these issues is not easy on our side, given they depend on the setup.
For
load_datastet
it would be nice to be able to control the size of the batches written on disk, feel free to open an issue if it’s something you’d like to see, and we’ll discuss there how to do it.That’s helpful information, thanks ! It seems like Lustre doesn’t read at full speed with the memory mapping in
datasets
I would try increasing the stripe size in case the memory mapping does too much unecessary readahead with the default value
An alternative is to load the dataset as iterable but this is not implemented yet, see #5481
If you want to skip that step, next time I’d recommend you to save the dataset somewhere after tokenization (e.g. using
.save_to_disk()
) and reload it from there instead of relying on the cache.Though you could look for the cached arrow files in your cache and reload the data from there if you’re adventurous. You can use
Dataset.from_file
to reload a file, and thenconcatenate_datasets
to concatenate all the chunks.Cool !
When you process an unshuffled dataset with
map
, you iterate over contiguous chunks of data, which is very fast. You get the best speed when you have an iterable dataset as well, when it’s based on shards of contiguous data.This is fast because internally Arrow simply iterates over the record batches.
On the other hand, if you use a map-style dataset in PyTorch, then PyTorch samples uniformly from the files on your disk. This is slower for your disk, and also requires an extra step to get the location of the examples from an index.
Please also make sure to use the latest version of pyarrow to benefit from the best speed, or at least pyarrow 8.0.0 😃
By any chance, do we have a better understanding of what’s happening?
I am encoutering a similar problem: I have an arrow file produced by HF datasets (
shard
+save_to_disk
) and I am trying to load this dataset/arrow file withdatasets.load_from_disk(the_dataset_folder)
. I noticed that the first time I load it, it would be significantly slower than the subsequent times. Two days later, I will retry loading it, and it will be slow again…After diving a little bit, the gap happens in the
_memory_mapped_arrow_table_from_file
function, and in particular in the call toRecordBatchStreamReader.read_all
:https://github.com/huggingface/datasets/blob/158917e24128afbbe0f03ce36ea8cd9f850ea853/src/datasets/table.py#L51read_all
is slow the first time (probably for some operations that are only happening once, and are cached for a few hours?), but not the subsequent times.My setup:
datasets
version: 2.3.3.dev0I realize this might be an Apache Arrow question so I ask them, but wanted to leave a message here too.
I wasn’t able to reproduce this on a toy dataset of around 300GB:
Could you run this on your side and tell me if how much time it takes ? Please run this when your machine is idle so that other processes don’t interfere.
I got these results on my macbook pro on datasets 1.6.2