snakemake: default remote prefix duplicated for checkpoint output on cloud
Snakemake version 5.22.1
Describe the bug
Using the --default-remote-prefix parameter will cause MissingInputException errors to arise for the outputs of checkpoint rules.
Steps to Reproduce
Follow the Google Life Sciences Executor Tutorial but convert the bwa_map rule into a checkpoint like this
diff --git Snakefile Snakefile
index 68974d1..d519d4f 100644
--- Snakefile
+++ Snakefile
@@ -4,7 +4,7 @@ rule all:
input:
"plots/quals.svg"
-rule bwa_map:
+checkpoint bwa_map:
input:
fastq="samples/{sample}.fastq",
idx=multiext("genome.fa", ".amb", ".ann", ".bwt", ".pac", ".sa")
@@ -19,7 +19,7 @@ rule bwa_map:
rule samtools_sort:
input:
- "mapped_reads/{sample}.bam"
+ lambda wildcards: checkpoints.bwa_map.get(sample=wildcards.sample).output[0]
output:
"sorted_reads/{sample}.bam"
conda:
and then you should see the following when running the pipeline:
Building DAG of jobs...
MissingInputException in line 20 of Snakefile:
Missing input files for rule samtools_sort:
snakemake-testing-data/snakemake-testing-data/mapped_reads/A.bam
Notice how snakemake-testing-data appears prepended twice?
Bug Hypothesis
I think rules.apply_default_remote() is being applied to the output more than once. It might help to check whether incomplete is true on this line before executing rules.apply_default_remote()?
It might be possible that this is a bug that extends beyond the life sciences executor (to other cloud environments), but I haven’t tested that yet.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 24 (19 by maintainers)
Commits related to this issue
- create tests for issue #574 (checkpoints not working in cloud environments) — committed to aryarm/snakemake by aryarm 4 years ago
- do not reapply default remote prefix if checkpoint output is incomplete in rules.py (see #574) — committed to aryarm/snakemake by aryarm 4 years ago
- create tests for issue #574 (checkpoints not working in cloud environments) — committed to aryarm/snakemake by aryarm 4 years ago
- create tests for issue #574 to reproduce a problem with the use of checkpoints in cloud environments — committed to snakemake/snakemake by aryarm 3 years ago
- fix: issue with duplicated prefix for checkpoints on cloud (#1294) * create tests for issue #574 to reproduce a problem with the use of checkpoints in cloud environments * do not add default remo... — committed to snakemake/snakemake by aryarm 2 years ago
Ok, I may have found somewhat of a fix? I just deleted lines 722 and 723 out of
rules.py, and everything just started working again.I don’t really understand the underlying reason, but I’m guessing something about checkpoints causes
apply_default_remote()inrules.pyto be run more times than usual for a rule. So my approach was to try to figure out where that code is. And lines 722 and 723 seem like good candidates.I’m planning on spending some more time on it tomorrow. Once I feel like I understand things better, I can try to explain here and maybe write up a PR!
@aryarm Okay, I see - thanks for the update. Unfortunately I’m not very familiar with the Snakemake code, but I’ve taken a quick look and it seems the logic for applying the default remote prefix has moved to here https://github.com/snakemake/snakemake/blob/01d6102795c96ce695d6d7201f7e4655a1d5cac8/snakemake/path_modifier.py#L14 I’m not sure how much I can help with this, but I’ll take a deeper look sometime this week as I’m also trying to figure out what is causing #1260.
@aryarm, I tested 5.22.1 with your solution and also the 6.0.5 version (latest). I receive the following error with your fix:
snakemake/executors/init.py", line 1827, in handle_remote if isinstance(target, _IOFile) and target.remote_object.provider.is_default: AttributeError: ‘NoneType’ object has no attribute ‘provider’
I agree! @johanneskoester was on fire this morning 😃 🔥
Thanks @aryarm I really appreciate that! I’m hoping we will get our testing running again soon, and I’ll take a look at the commit to see if it can help the current test.