snakemake: Jobs updated by checkpoint are not getting run

Snakemake version works:

5.20.0
5.25.0
5.26.0
5.26.1

fails:

HEAD (5.27.4+1.g4a63a269)
5.27.4
5.27.0

Describe the bug The test run for our DTR-phage-pipeline has stopped working recently. It has a few checkpoints and 160+ total jobs, so it’s not the simplest test case, but it did work in earlier version.

Instead of completing in one go, it stops after each checkpoint. DEbugging output indicates that the downstream jobs are getting updated, but somehow it stops anyway. Restarting snakemake continues where it left off.

Logs Here is the end of the output from the first run:

checkpoint generate_bins_and_stats:
    input: test/test-output/25m/1D/v2/kmer_binning/bin_membership.tsv
    output: test/test-output/25m/1D/v2/kmer_binning/bin_stats.csv, test/test-output/25m/1D/v2/kmer_binning/bins
    jobid: 11
Downstream jobs will be updated after completion.

[Thu Nov  5 14:15:24 2020]
rule plot_umap_freq_map_bins:
    input: test/test-output/25m/1D/v2/kmer_binning/bin_membership.tsv
    output: test/test-output/25m/1D/v2/kmer_binning/kmer_comp.umap.bins.png
    jobid: 9
                                                                                                                                                                                                                 
python /mnt/data0/jmeppley/projects/DTR/TBD/DTR-phage-pipeline/scripts/plot_umap_with_hdbscan_labels.py -o test/test-output/25m/1D/v2/kmer_binning/kmer_comp.umap.bins.png -s 7 -a 0.2 test/test-output/25m/1D/v2/kmer_binning/bin_membership.tsv
No kaiju DB found in /media/sfodata/databases/kaiju/nr_euk.
Skipping Kaiju
No kaiju DB found in /media/sfodata/databases/kaiju/nr_euk.
Skipping Kaiju
Job counts:
        count   jobs
        1       generate_bins_and_stats
        1
Activating conda environment: /mnt/data0/jmeppley/projects/DTR/TBD/DTR-phage-pipeline/.snakemake/conda/53e51836
Updating job 12 (combine_bin_cluster_read_info).
Replace combine_bin_cluster_read_info with dynamic branch combine_bin_cluster_read_info
updating depending job all
Updating job 13 (create_bin_cluster_read_lists).
Replace create_bin_cluster_read_lists with dynamic branch create_bin_cluster_read_lists
updating depending job combine_all_polished_ref_reads
updating depending job aggregate_prodigal_statistics
updating depending job combine_bin_cluster_strand_counts_into_table
updating depending job combine_dtr_aligns
updating depending job all
Updating job 21 (build_bin_cluster_summary_table).
Replace build_bin_cluster_summary_table with dynamic branch build_bin_cluster_summary_table
updating depending job combine_all_draft_stats
Updating job 0 (all).
Replace all with dynamic branch all
[Thu Nov  5 14:15:25 2020]
Finished job 11.
11 of 86 steps (13%) done
[Thu Nov  5 14:15:26 2020]
Finished job 8.
12 of 86 steps (14%) done
Mapping 869 data points...
Saving test/test-output/25m/1D/v2/kmer_binning/kmer_comp.umap.bins.notext.png
Mapping 869 data points...
[Thu Nov  5 14:15:26 2020]
Finished job 7.
13 of 86 steps (15%) done
bin -1 (66 reads)
bin 0 (315 reads)
bin 1 (79 reads)
bin 2 (45 reads)
bin 3 (29 reads)
bin 4 (12 reads)
bin 5 (27 reads)
bin 6 (15 reads)
bin 7 (42 reads)
bin 8 (42 reads)
bin 9 (20 reads)
bin 10 (13 reads)
bin 11 (36 reads)
bin 12 (28 reads)
bin 13 (25 reads)
bin 14 (16 reads)
bin 15 (18 reads)
bin 16 (29 reads)
bin 17 (12 reads)
Saving test/test-output/25m/1D/v2/kmer_binning/kmer_comp.umap.bins.png
Saving test/test-output/25m/1D/v2/kmer_binning/kmer_comp.umap.bins.nolabels.png
[Thu Nov  5 14:15:31 2020]
Finished job 9.
14 of 86 steps (16%) done
Complete log: /mnt/data0/jmeppley/projects/DTR/TBD/DTR-phage-pipeline/.snakemake/log/2020-11-05T141437.844467.snakemake.log
unlocking
removing lock
removing lock
removed all locks

Despite reporting 16% done, it just stops.

Minimal example I’m working on a simpler example. For now it should be easy to reproduce by cloning the linked repo:

$ git clone https://github.com/nanoporetech/DTR-phage-pipeline
$ cd DTR-phage-pipeline
$ mamba create -p ./env -f ./environment.yaml
$ conda activate ./env
$ snakemake --use-conda --conda-frontend mamba -j 20 -p

Additional context Oddly, if the first few rules (before the checkpoint) have already been completed, then the jobs after the checkpoint get run as they should. The erroneous stopping only occurs when starting with no output files.

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 33 (25 by maintainers)

Commits related to this issue

Possible fix for issue #727: add finished property to Reason so that jobs don't get flagged for re-run when udpate_needrun is triggered by a finished checkpoint — committed to jmeppley/snakemake by jmeppley 4 years ago
Adding test for chekpoint bug #727 — committed to jmeppley/snakemake by jmeppley 4 years ago
fixing issue #727 checkpoint test to (1) work on windows and (2) put ouput in root dir so that it gets checked — committed to jmeppley/snakemake by jmeppley 4 years ago
Possible fix for issue #727 (#801) * Possible fix for issue #727: add finished property to Reason so that jobs don't get flagged for re-run when udpate_needrun is triggered by a finished checkpoint ... — committed to snakemake/snakemake by jmeppley 4 years ago

Most upvoted comments

@johanneskoester Could you have a look at this and decide what needs to be done?

epruesse on Nov 24, 2020

Thank you folks! This is now fixed by @jmeppley merged PR #801.

johanneskoester on Dec 16, 2020

i can confirm seeing the same issue still on 5.30.1. on the first run snakemake completes without running the checkpoint jobs, only after re-starting the pipeline it completes all.

martinsikora on Dec 2, 2020

@epruesse,

It seems like there are more than one of these rare checkpoint bugs. We managed to fix one here, but you’re clearly seeing a different one.

I found another (#817) and submitted a PR (#818). I have no idea if it’s the same problem you’re seeing, but it’s probably worth trying out.

If that doesn’t work, open a new issue with as many details as you can provide (particularly the output logs and the snakefile), even if you’ve already entered them in this thread. I’m not an official developer, but I’m interested in getting checkpoints to work, so feel free to tag me in it.

jmeppley on Dec 28, 2020

@pb-cdunn , not quite. It can depend on the output of the checkpoint, just not any other rules. So it is dynamic, just not very flexible.

I’m also not suggesting that’s OK, by any measure, I’m just trying to isolate the bug.

jmeppley on Dec 14, 2020

This is well over a month old now. In my opinion, snakemake should be reverted to 5.26.1. To be completely clear, snakemake was not very useful to us until the introduction of checkpoint, allowing a dynamic workflow graph. This is not a minor regression.

@johanneskoester, if there is no response soon, I think we will be forced to fork snakemake and include the fork in our software distributions, as more and more users are complaining about this. I don’t mind bugs, and I’m not criticizing the bug itself. I only ask for some discussion of its importance and possible resolution.

pb-cdunn on Dec 13, 2020

Unsurprisingly this is still an issue in 5.28.

Starting a new pipeline from scratch (pbipa) there are multiple phases of sharding/gathering jobs through a checkpoint command. After each gather phase, the pipeline ends with a similar message first reported by @jmeppley of 4 of 47 steps (9%) done. Just rerunning snakemake will then correctly continue sharding/analysing/gathering and then cleanly ending despite more jobs to be rerun.

Not sure if the difference is starting from scratch or the lack of multiple wildcard instantiations of a checkpoint, but this behaviour is opposite to what I reported earlier of the output being correct but snakemake failing anyway post-checkpoint.

ASLeonard on Nov 19, 2020