fd: fd can get stuck when ran onto the whole FS from the root

The ran command is just fd foobar /. It gets stuck and never end. I sadly was unable to understand what’s causing it. I can investigate to give answers if you have some ideas about what else could I check.

Below is the partition’s setup:

# lsblk
NAME           MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
sda              8:0    0 223,6G  0 disk  
├─sda1           8:1    0   512M  0 part  /boot/efi
├─sda2           8:2    0   200M  0 part  
│ └─cryptboot  254:3    0   198M  0 crypt /boot
└─sda3           8:3    0 222,9G  0 part  
  └─lvm        254:0    0 222,9G  0 crypt 
    ├─vg0-swap 254:1    0     8G  0 lvm   [SWAP]
    └─vg0-root 254:2    0 214,9G  0 lvm   /

EDIT: Issue is reproducing everytime on this Archlinux setup. Couldn’t reproduce it on another Archlinux with unencrypted system.

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Comments: 37 (15 by maintainers)

Commits related to this issue

Most upvoted comments

I can reproduce this now:

tavianator@graphene$ (sleep 1& (sleep 2 && fd . /proc/${!}/net --show-errors)& exec /bin/sleep 3)
[fd error]: Invalid argument (os error 22)
[fd error]: Invalid argument (os error 22)
[fd error]: Invalid argument (os error 22)
[fd error]: Invalid argument (os error 22)
[fd error]: Invalid argument (os error 22)
[fd error]: Invalid argument (os error 22)
[fd error]: Invalid argument (os error 22)
[fd error]: Invalid argument (os error 22)
[fd error]: Invalid argument (os error 22)
[fd error]: Invalid argument (os error 22)
...

That command creates a zombie process (the sleep 1) by replacing the shell with a command that won’t wait() for its children (the exec /bin/sleep 3). In the meantime, we wait for the zombie to die and then run fd in its /proc/<PID>/net directory. For a zombie process, the open() will succeed but readdir() will fail with EINVAL. This is key to triggering the error.

Those with a long memory might remember the bug https://github.com/rust-lang/rust/issues/50619, which @sharkdp filed and then fixed as a result of this bug.

Unfortunately, some silly programmer named @tavianator reintroduced the bug in https://github.com/rust-lang/rust/pull/92778. Or to be a little more charitable, the original fix only applied to some platforms, of which Linux used to be one. But now Linux uses a different ReadDir implementation that is better in many ways but regressed this bug. Oops!

I guess I’ll fix it in Rust, unless someone beats me to it.

Well, that’s pretty much it except that I detected a weird case in comment https://github.com/sharkdp/fd/issues/288#issuecomment-383637131 It looks like excluding /proc/[0-9]* and searching in / make it work but excluding the same thing and searching in /procmake it fail… which is pure nonsense to me. I was trying the find if there was a culprit file and stopped when I discovered that because that didn’t make any sense.

The exact same computer I did the tests on yesterday morning now succeed on fd foobar /… I’ll try to test the other one that had issue when coming to work.