snakemake: --googlelifesciences segmentation fault
Snakemake version
Tested on 7.0.0, 6.15.5 and 6.15.0
Describe the bug
Segmentation fault (core dumped) when executing with --google-lifesciences.
Logs
Minimal example
Snakefile:
rule all:
input: expand("done{i}.txt", i=range(100) )
rule test:
output: "done{i}.txt"
shell: "echo hi > {output}"
Command line:
snakemake --google-lifesciences --default-remote-prefix snake-test -j10
Additional context
Tried -j10 and -j100, to no effect, still same error.
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Reactions: 1
- Comments: 52 (52 by maintainers)
haha yes good observations indeed! I absolutely love using Google Cloud but the APIs are constantly moving targets and there are many ways to do the same thing. I try to choose and make the best decision for the time, but I suspect this also changes over time.
Yea it totally makes sense to use the storage client. So it does seem that the storage client generates its own credentials from the environment. I guess it could be crosstalk? But seems unlikely as @CowanCS1 mentions. Gonna push some commits and test on a larger workflow and see if I can replicate.
Thanks @cademirch 😃
I’ve tested this version, and can confirm that it eliminates the SIGSEGV, all of the exotic SIGABRT errors, and all of the SSL exceptions. Nice!
Unexpectedly, it is also eliminating another issue I was seeing with this test script where a subset of output files (5-10%) were either not present in cloud storage or present but not recognized by the job. I saw the latter type of MissingOutputExceptions frequently in my own pipeline. With this version, all of the jobs are present every time.
That’s actually all of the issues I was tracking, so hurrah! 👍
@vsoch Thanks for reaching out to get advice - this version is probably creating 10-20 connections per each of these quick jobs, but that would increase linearly with time due to the status checks. I hesitate to go fully into implementing a pool of connections, since I could imagine some edge cases like stale connections which we’d have to handle for minimal benefit compared to simpler solutions. My currently favored compromise is to create a single http connection for each call to
_runand then maintain one for_wait_jobs, which would reduce the intial connection count 5-10x and eliminate the scaling with time. Since this version is working fine, I’ll wait to implement anything until you get feedback.Cheers all
@CowanCS1 Running the MRE with
-j 1right now and it has not failed yet. Thanks for running the gdb btw.@vsoch I gave it a shot earlier today by trying to get gdb to attach the the running snakemake process, but had no success unfortunately. I’ll look into it more in the morning, though.
Adding to @CowanCS1’s findings. I ran the test with faulthandler enabled and got the following traceback: