cvmfs: Bug: deadlock in cvmfs2 process

Dear all,

we’re suffering for days from a potential bug, where the cvmfs2 process serving sft.cern.ch is deadlocking. (We only saw that for sft.cern.ch!)

With the 268 nodes we have, we see a rate of >= 1 node per hour showing this behaviour. We (Wuppertal University) are an ATLAS Tier-2, so we have a lot of ATLAS workload. Our local users (other then from the HEP group) barely use CVMFS. We couldn’t identify if it is related to some particular ATLAS job type, yet.

Symptom:


[root@wn21268 ~]# ls -l /cvmfs/sft.cern.ch
ls: Zugriff auf /cvmfs/sft.cern.ch nicht möglich: Datei oder Verzeichnis nicht gefunden

killing the cvmfs2 process and restarting autofs helps.

Most interesting observation:

this returns nothing on the sft.cern.ch cvmfs2 process

[root@wn21159 fd]# perf trace -p 26181 --duration 10
^C[root@wn21159 fd]#

while it shows normal activity on any other cvmfs2 process.

I attached with gdb to one of those deadlocking processes:

(gdb) bt
#0  0x00007f9e0428db3b in do_futex_wait.constprop.1 () from /lib64/libpthread.so.0
#1  0x00007f9e0428dbcf in __new_sem_wait_slow.constprop.0 () from /lib64/libpthread.so.0
#2  0x00007f9e0428dc6b in [sem_wait@@GLIBC_2.2.5](mailto:sem_wait@@GLIBC_2.2.5) () from /lib64/libpthread.so.0
#3  0x00007f9e0323e5c8 in fuse_session_loop_mt () from /lib64/libfuse.so.2
#4  0x00007f9e03473b0f in FuseMain (argc=<optimized out>, argv=<optimized out>)
    at /home/sftnight/jenkins/workspace/CvmfsFullBuildDocker/CVMFS_BUILD_ARCH/docker-x86_64/CVMFS_BUILD_PLATFORM/cc7/build/BUILD/cvmfs-2.11.0/cvmfs/loader.cc:1086
#5  0x0000000000404c3f in main (argc=5, argv=0x7ffddaca7248)
    at /home/sftnight/jenkins/workspace/CvmfsFullBuildDocker/CVMFS_BUILD_ARCH/docker-x86_64/CVMFS_BUILD_PLATFORM/cc7/build/BUILD/cvmfs-2.11.0/cvmfs/fuse_main.cc:144
(gdb)

versions are

cvmfs-2.11.0-1.el7.x86_64
cvmfs-config-default-2.0-1.noarch
cvmfs-libs-2.11.0-1.el7.x86_64

and kernel is

[root@wn21186 ~]# uname -a
Linux wn21186.pleiades.uni-wuppertal.de 3.10.0-1160.99.1.el7.x86_64 #1 SMP Wed Sep 13 14:19:20 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

We already collected some DEBUG logs here, but as Dave suggested on cvmfs-talk, we just reconfigured the cluster to only write DEBUG logs for sft.cern.ch. As soon as we catch a node showing this behaviour again, we will add the relevant log to this issue.

We cross-checked the CVMFS client config with DESY and the Squid config with LRZ-LMU and we couldn’t find any difference other than local cache sizes (we have 12 GB CVMFS cache on our nodes).

Cheers

Martin and Torsten

About this issue

Original URL
State: open
Created 9 months ago
Comments: 21 (10 by maintainers)

Most upvoted comments

Hi @vvolkl, unfortunately not yet, we’re still waiting for our vendor to come up with a schedule. I’ll update here as soon as we did the switch.

harenber on Mar 5, 2024