dvc: 'Errno 24 - Too many open files' on dvc push
Version information
- DVC version: 0.58.1
- Platform: MacOS 10.14.6
- Method of installation: pip within a conda environment
Description
When pushing to S3 a directory of ~100 files that have been added to DVC, I observe an Errno 24 error from the dvc process.
It looks like dvc is trying to open more files than the OS allows. Checking the file handles on for the dvc process I get:
$ lsof -p $DVC_PID | wc -l
412
Looking at the OS limits, a process is limited to having 256 open files.
$ ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
file size (blocks, -f) unlimited
max locked memory (kbytes, -l) unlimited
max memory size (kbytes, -m) unlimited
open files (-n) 256
pipe size (512 bytes, -p) 1
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 4256
virtual memory (kbytes, -v) unlimited
A workaround for this is to increase the max files per process to a larger number (say 4096) by running something like ulimit -n 4096
, but I wonder if the ideal solution is for DVC to work within the OS configured limits by default?
Edit: Updated wording of workaround
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 35 (29 by maintainers)
it’s, not a bug, its a feature Everyone, at some point in their life
So the thing is that we easily reach the open descriptors limit on mac. It is unnoticeable for Linux because it has 4 times bigger default open descriptors limit than mac.
The reason why we are reaching the limit is the default configuration of s3. When we do not provide
Config
here boto3 will default to:(defined here) s3 is not caching transfer object in any way. Every time
upload_file
is called, the new transfer object is created, having, by default 10 threads limit for upload. So, effectively, when we rundvc push --jobs=X
we indirectly allow s3 to create 10X threads. It’s easy to throttle mac’s 256 file descriptors limit with that.I think, we should introduce default upload config in
remote/s3
class:config = TransferConfig(use_threads=False)
.I think we should decide do we want dvc be a little bit aggressive to be fast. If yes then we should implement automatic degrading with a WARNING and a HINT and set default high. If no we might catch an error, stop, show a hint to either reduce jobs or increase ulimit.
I am in favor of being aggressive here. Reasons:
We can adjust that (
Remote.jobs
or s3max_concurrency
or both) automatically without a user needing to do anything. We can even show a WARNING that transfer is automatically throttled because ofulimit
and suggest a HINT.BTW, there are similar scenarios with other resources like SSH/SFTP sessions. Now we set the conservative defaults, which makes dvc slower for many users, without them even knowing that, which in its turn makes a perception of dvc worse.
Ok so let’s sum up what is going on:
Problem:
For different remotes,
jobs
value does not necessarily correspond to the number of open file descriptors at maximum capacity. We need somehow to connect those values to prevent user from starting the upload if he will be unable to finish.Possible solution
jobs
so that we do not exceed the estimated available number of file descriptors.I think the proper solution would be the second one, though there are few things to consider:
jobs
restricted by us, which might be unpleasant to more advanced users when they will not understand why we throttle value chosen by them. Though, proper description in documentation might solve thisAlso: 3. We could catch the error and make user retry with a smaller number of jobs, though I believe that this would be frustrating when one will have to restart push 5 times to adjust the number of
jobs
. Also handling of the error might be different for different os-es (We know that in macos and Linux its OSError with code 24, but we would need to find out how to handle it in Windows.)Notes
resource.getrlimit(resource.RLIMIT_NOFILE)[0]
to access file descriptor limitGreat point! But anything less then what mac has is an unlikely scenario and we shouldn’t worry about it, as lots of other stuff will break anyway.
Even when setting
--jobs=1
process can have opened up to 14 file descriptors. Need to investigate whether it is our fault or boto.@pared setting ulimit to 25 might be way too low just in general, need to be careful about that. What I did for #2600 is I’ve monitored the number of opened fds from /proc/$PID/ and it was pretty apparent that some stuff was not released in time. Might want to look into caching the boto session as a quick experiment.
I can confirm the above workarounds. To summarise (for anyone else experiencing this), I was able to work around the issue with either of these commands:
dvc push --jobs 8
- Reducing number of jobs to 2*CPUs (8 in my case)ulimit -n 4096
- Increasing OS limit on open files per proc (rundvc push
after setting this)Some observations:
/dev/zero
instead of/dev/urandom
doesn’t seem to work (I think there is some background compression or trick for this low entropy file)partNumber
), suggesting multi-threaded upload.I checked again, and the number of files being pushed was actually 79 (definitely not thousands 😄 )
I don’t have the full logs from the
dvc push
to hand right now, but I was getting the below for repeatedly for various files. This seemed to be happening as soon as the first files finished uploading.I don’t have access to the environment until Monday, but I might be able to reproduce over the weekend.