dvc: 'Errno 24 - Too many open files' on dvc push

Version information

  • DVC version: 0.58.1
  • Platform: MacOS 10.14.6
  • Method of installation: pip within a conda environment

Description

When pushing to S3 a directory of ~100 files that have been added to DVC, I observe an Errno 24 error from the dvc process.

It looks like dvc is trying to open more files than the OS allows. Checking the file handles on for the dvc process I get:

$ lsof -p $DVC_PID | wc -l
412

Looking at the OS limits, a process is limited to having 256 open files.

$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
file size               (blocks, -f) unlimited
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 256
pipe size            (512 bytes, -p) 1
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 4256
virtual memory          (kbytes, -v) unlimited

A workaround for this is to increase the max files per process to a larger number (say 4096) by running something like ulimit -n 4096, but I wonder if the ideal solution is for DVC to work within the OS configured limits by default?

Edit: Updated wording of workaround

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 35 (29 by maintainers)

Most upvoted comments

it’s, not a bug, its a feature Everyone, at some point in their life

So the thing is that we easily reach the open descriptors limit on mac. It is unnoticeable for Linux because it has 4 times bigger default open descriptors limit than mac.

The reason why we are reaching the limit is the default configuration of s3. When we do not provide Config here boto3 will default to:

multipart_threshold=8 * MB,
max_concurrency=10,
multipart_chunksize=8 * MB,
num_download_attempts=5,
max_io_queue=100,
io_chunksize=256 * KB,
use_threads=True

(defined here) s3 is not caching transfer object in any way. Every time upload_file is called, the new transfer object is created, having, by default 10 threads limit for upload. So, effectively, when we run dvc push --jobs=X we indirectly allow s3 to create 10X threads. It’s easy to throttle mac’s 256 file descriptors limit with that.

I think, we should introduce default upload config in remote/s3 class: config = TransferConfig(use_threads=False).

I think we should decide do we want dvc be a little bit aggressive to be fast. If yes then we should implement automatic degrading with a WARNING and a HINT and set default high. If no we might catch an error, stop, show a hint to either reduce jobs or increase ulimit.

I am in favor of being aggressive here. Reasons:

  • faster by default
  • utilizing as lot of resources as possible

We could catch the error and make user retry with a smaller number of jobs,

We can adjust that (Remote.jobs or s3 max_concurrency or both) automatically without a user needing to do anything. We can even show a WARNING that transfer is automatically throttled because of ulimit and suggest a HINT.

BTW, there are similar scenarios with other resources like SSH/SFTP sessions. Now we set the conservative defaults, which makes dvc slower for many users, without them even knowing that, which in its turn makes a perception of dvc worse.

Ok so let’s sum up what is going on:

Problem:

For different remotes, jobs value does not necessarily correspond to the number of open file descriptors at maximum capacity. We need somehow to connect those values to prevent user from starting the upload if he will be unable to finish.

Possible solution

  1. just limit default jobs number for s3 (quick fix, won’t solve the problem)
  2. estimate the maximal number of file descriptors and, if necessary, adjust the number of jobs so that we do not exceed the estimated available number of file descriptors.

I think the proper solution would be the second one, though there are few things to consider:

  • cache file descriptors are not the only thing open during upload (for example in s3 number of cache file descriptors corresponds to the number of open sockets), also I don’t think that we can just use all the available limit of open file descriptors, we would probably need to leave some file descriptors for process to utilize.
  • also, in the case described above, the user might get his jobs restricted by us, which might be unpleasant to more advanced users when they will not understand why we throttle value chosen by them. Though, proper description in documentation might solve this

Also: 3. We could catch the error and make user retry with a smaller number of jobs, though I believe that this would be frustrating when one will have to restart push 5 times to adjust the number of jobs. Also handling of the error might be different for different os-es (We know that in macos and Linux its OSError with code 24, but we would need to find out how to handle it in Windows.)

Notes

  • we could use resource.getrlimit(resource.RLIMIT_NOFILE)[0] to access file descriptor limit
  • For most remotes, it is not a problem, though it would be good to check whether other external packages (azure, gs, pyarrow, oss, paramiko) does not provide built-in optimization and parallelization for upload.

Also, one more note: If we have too small file descriptors limit, we can have another Too many open files error, when boto is, loading json files defined inside botocore/data.

Great point! But anything less then what mac has is an unlikely scenario and we shouldn’t worry about it, as lots of other stuff will break anyway.

Even when setting --jobs=1 process can have opened up to 14 file descriptors. Need to investigate whether it is our fault or boto.

@pared setting ulimit to 25 might be way too low just in general, need to be careful about that. What I did for #2600 is I’ve monitored the number of opened fds from /proc/$PID/ and it was pretty apparent that some stuff was not released in time. Might want to look into caching the boto session as a quick experiment.

I can confirm the above workarounds. To summarise (for anyone else experiencing this), I was able to work around the issue with either of these commands:

  • dvc push --jobs 8 - Reducing number of jobs to 2*CPUs (8 in my case)
  • ulimit -n 4096 - Increasing OS limit on open files per proc (run dvc push after setting this)

Some observations:

  • When trying to reproduce this, it seems significant that the files are of a decent size (in this case 100MB)
  • Reproducing by creating files from /dev/zero instead of /dev/urandom doesn’t seem to work (I think there is some background compression or trick for this low entropy file)
  • Some of the logs indicate errors uploading a part of the file (search partNumber), suggesting multi-threaded upload.

I checked again, and the number of files being pushed was actually 79 (definitely not thousands 😄 )

I don’t have the full logs from the dvc push to hand right now, but I was getting the below for repeatedly for various files. This seemed to be happening as soon as the first files finished uploading.

ERROR: failed to upload '.dvc/cache/2f/25387c98c599ab0de148f437b780ad' to 's3://<s3_path>/2f/25387c98c599ab0de148f437b780ad' - [Errno 24] Too many open files: '/<local_path>/.dvc/cache/2f/25387c98c599ab0de148f437b780ad'
Having any troubles?. Hit us up at https://dvc.org/support, we are always happy to help!

I don’t have access to the environment until Monday, but I might be able to reproduce over the weekend.