catkin_tools: Possible deadlock with catkin build
System Info
- Operating System: Ubuntu 14.04 LTS
- Python Version: 2.7
- Version of catkin_tools: 0.4.2
- ROS Distro: Indigo
Build / Run Issue
I apologize in advance for this very imprecise bug report.
I noticed that, starting with the new version 0.4.x, catkin build
will hang sometimes. It happens while processing a random package and very rarely (I’d say 3 out of 100 builds on our Jenkins). Also, it will run just fine if I cancel and reschedule the build.
I stumbled upon the --no-install-lock
option, which I just added to our build script in the hope that it will resolve this issue. I won’t be able to tell until sufficiently many builds have run, obviously.
In case anyone has an idea where to look for this problem, our build script runs the following commands:
catkin config -w ros --init --no-blacklist --install -j4 -p4 --cmake-args -DCMAKE_BUILD_TYPE=Debug -DCMAKE_C_COMPILER=/usr/lib/ccache/gcc -DCMAKE_C_FLAGS_DEBUG="-fmessage-length=0 -Wall -Wextra -Wno-unused-parameter -Wno-ignored-qualifiers -Wno-error=deprecated-declarations -Wno-error=unused-variable -Wno-error=unused-but-set-variable -O0" -DCMAKE_CXX_COMPILER=/usr/lib/ccache/g++ -DCMAKE_CXX_FLAGS_DEBUG="-fmessage-length=0 -Wall -Wextra -Wno-unused-parameter -Wno-ignored-qualifiers -Wno-invalid-offsetof -Wno-unused-local-typedefs -Wno-error=deprecated-declarations -Wno-error=unused-variable -Wno-error=unused-but-set-variable -O0" -DCMAKE_SHARED_LINKER_FLAGS_DEBUG="-Wl,-z,defs" -DCMAKE_EXE_LINKER_FLAGS_DEBUG="-Wl,-z,defs"
catkin clean -w ros --all --yes
catkin build -w ros --verbose --no-status --no-notify --continue-on-failure
The last command outputs:
--------------------------------------------------------------------------------
Profile: default
Extending: [env] /opt/ros/indigo
Workspace: /home/jenkins/ros
--------------------------------------------------------------------------------
Source Space: [exists] /home/jenkins/ros/src
Log Space: [missing] /home/jenkins/ros/logs
Build Space: [exists] /home/jenkins/ros/build
Devel Space: [exists] /home/jenkins/ros/devel
Install Space: [missing] /home/jenkins/ros/install
DESTDIR: [unused] None
--------------------------------------------------------------------------------
Devel Space Layout: merged
Install Space Layout: merged
--------------------------------------------------------------------------------
Additional CMake Args: -DCMAKE_BUILD_TYPE=Debug -DCMAKE_C_COMPILER=/usr/lib/ccache/gcc -DCMAKE_C_FLAGS_DEBUG=-fmessage-length=0 -Wall -Wextra -Wno-unused-parameter -Wno-ignored-qualifiers -Wno-error=deprecated-declarations -Wno-error=unused-variable -Wno-error=unused-but-set-variable -O0 -DCMAKE_CXX_COMPILER=/usr/lib/ccache/g++ -DCMAKE_CXX_FLAGS_DEBUG=-fmessage-length=0 -Wall -Wextra -Wno-unused-parameter -Wno-ignored-qualifiers -Wno-invalid-offsetof -Wno-unused-local-typedefs -Wno-error=deprecated-declarations -Wno-error=unused-variable -Wno-error=unused-but-set-variable -O0 -DCMAKE_SHARED_LINKER_FLAGS_DEBUG=-Wl,-z,defs -DCMAKE_EXE_LINKER_FLAGS_DEBUG=-Wl,-z,defs
Additional Make Args: -j4
Additional catkin Make Args: None
Internal Make Job Server: True
Cache Job Environments: False
--------------------------------------------------------------------------------
Whitelisted Packages: None
Blacklisted Packages: None
--------------------------------------------------------------------------------
Workspace configuration appears valid.
About this issue
- Original URL
- State: open
- Created 8 years ago
- Comments: 21 (4 by maintainers)
@roehling , thanks for the update. We’ve found a working theory and a dirty workaround that works robustly for us:
Theory:
The catkin internal make job-server runs out of job-tokens (we don’t know why or how; FD 6 and 7 (see above) are relevant, because
--jobserver-fds=6,7
is passed on to the childmake
processes). Since catkin is also using these tokens internally it gets locked indefinitely while trying to get one out of zero available). The child processes get blocked on output because catkins also stops reading from the pipes to the children and their buffer run full at some point.Workaround:
A script runs periodically that:
$catkin_pid
) having the issue it performs:sudo -u jenkins bash -c "echo -n +++ > /proc/$catkin_pid/fd/7"
to inject three fresh job-tokens (catkin seems to use ‘+’ only; one would typically lead to it getting stuck again later).Given this theory I doubt that this is actually a catkin bug. Interestingly this only happens on (all; Trusty and Xenial) our Jenkins slaves running in Docker containers and only for very big jobs (running > 10min, producing > 10MB verbose output)
Unfortunately, no. The closest thing I found is a warning in the documentation of subprocess.Popen.wait that describes similar symptoms. AFAICT it does not apply here, but maybe I overlooked something.
We are experiencing similar issues, also within Jenkins Jobs on both Ubuntu Trusty and Xenial (each using catkin 0.4.4). It rather seems to affect long running jobs with a lot of output (several MBs).
Every time it happens the python process (catkin) is blocked in a read system call reading either on FD 6 or 7 (both pipes; other threads are waiting for a lock (probably the python GIL)). At the same time all child processes are blocked while writing to their FD 1 or 2.
Quite often (but not all the time) one of the children has already ended (zombie state).
Interestingly the whole build can always be continued with manually writing to the pipes of the catkin process, e.g. with
echo > /proc/<CATKIN_PID>/fd/6
(a single write to the FD currently being read from is enough). After that everything continues like nothing happened.Any idea how to solve / work around / debug further?
@roehling , did you solve it for you?