snakemake: Regression after 7.7.0 until 7.15.X: rules downstream of a checkpoint aggregation rule keep being re-executed even when all output files exist and nothing upstream has changed

Snakemake version

7.15.1

Describe the bug

Rules downstream of a checkpoint aggregation rule are always re-executed when both aggregation and downstream rules have already been run and output files exist.

Minimal example

Using the docs example and adding two rules downstream of aggregate. It keeps re-running the process and process2 rules even though output files exists and nothing upstream had changed. It doesn’t matter if you have one rule after aggregate or multiple, it keeps re-running all of them even when the outputs exist and nothing upstream has changed. If you remove the downstream rules and go back to the exact docs code, things work as expected.

# a target rule to define the desired final output
rule all:
    input:
        "processed2.txt",


# the checkpoint that shall trigger re-evaluation of the DAG
# an number of file is created in a defined directory
checkpoint somestep:
    output:
        directory("my_directory/"),
    shell:
        """
        mkdir my_directory/
        cd my_directory
        for i in 1 2 3; do touch $i.txt; done
        """


# input function for rule aggregate, return paths to all files produced by the checkpoint 'somestep'
def aggregate_input(wildcards):
    checkpoint_output = checkpoints.somestep.get(**wildcards).output[0]
    return expand(
        "my_directory/{i}.txt",
        i=glob_wildcards(os.path.join(checkpoint_output, "{i}.txt")).i,
    )


rule aggregate:
    input:
        aggregate_input,
    output:
        "aggregated.txt",
    shell:
        "echo AGGREGATED > {output}"


rule process:
    input:
        "aggregated.txt",
    output:
        "processed.txt",
    shell:
        "echo PROCESSED > {output}"


rule process2:
    input:
        "processed.txt",
    output:
        "processed2.txt",
    shell:
        "echo PROCESSED2 > {output}"

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 32 (13 by maintainers)

Commits related to this issue

Most upvoted comments

Using the provided reprex, I confirmed that this was introduced in 7.8.0:

  • the bug exists for Snakemake 7.12.1, 7.12.0, 7.8.2, 7.8.1, 7.8.0
  • the bug does not exist for Snakemake 7.7.0

I purposefully tested the boundary of 7.8.1 and 7.8.2 because according to the changelog, the fix c634b78b4d7c4f6ef59e46c94162893e42de6f73 from PR #1704 was intended to address this issue of re-running rules downstream of a checkpoint.

Ok, problem identified. If have to catch a train now, but I’ll be able to fix it likely tomorrow.

Thanks for the minimal example and the exploration work all of you already did. This is now fixed in PR #1907!

But in this case, this is not needed as I am looking into this already.

@hermidalc My view is that most scientific software operates in a system which does not reward maintenance, especially for purely academic projects like Snakemake (c.f. Rstudio). Furthermore, developers of these software are often swamped with larger responsibilities such as research or teaching which manifest formal and harsher penalties if ignored. Faced with these pressures, I think it is reasonable that many bugs are left unfixed; or that easier to tackle issues are repeatedly addressed over complex ones.

Anyways, I think we are getting off-topic. We can agree to disagree, but again I feel your frustration. All my projects depend on Snakemake.

To steer us back, have you tried using 7.7.0 but specifying all the available rerun triggers? I think you wouldn’t have to give up checkpoints then, and you would still get the benefit of updated code rerun triggers, etc.

@hermidalc I’m glad you were able to find at least a partial solution.

While I agree that this bug is particularly frustrating, it may not be reasonable to expect a fix so soon. As far as I know, snakemake is an open-source project maintained by other scientists, researchers, and teachers; they have many formal responsibilities to be busy with, very few of which (if any) depend on maintaining this software. As much as you or I lack time to learn the codebase, I suspect they too lack the time to tackle this bug. So it goes, I guess.

Actually I realized this isn’t such an easy decision. Version 7.8.0 also introduced another feature that I love: Freezing environments to exactly pinned packages. So far I’ve been lucky in that the pipelines in which I’ve pinned the conda envs haven’t relied on checkpoint rules. It’s unfortunate that this new feature was introduced in the same release as this bug. I don’t want to have to choose!

@hermidalc What about pinning Snakemake to version 7.7.0? Is there some important feature in a more recent release that you need/want to use? If you’ve got 3 checkpoint rules in a pipeline, updating doesn’t seem like it would be worth it to me (I’ve only got 1 checkpoint rule, but I’m not planning on updating any time soon).

Thanks @jdblischak for providing the last working version, I was wondering which one it was and hadn’t browsed the CHANGELOG yet. Yeah I will downgrade instead of doing @ning-y recommendation since requires not code changes, though @ning-y it was a good one as a workaround.

@hermidalc What about pinning Snakemake to version 7.7.0? Is there some important feature in a more recent release that you need/want to use? If you’ve got 3 checkpoint rules in a pipeline, updating doesn’t seem like it would be worth it to me (I’ve only got 1 checkpoint rule, but I’m not planning on updating any time soon).

@hermidalc The workaround is to avoid using checkpoints for now. For example, checkpoints can be converted into rules which have an additional output containing a list of files they had produced, and downstream rules can use that list of files as input. It is hacky, and not the best solution, but the alternative is to fix the bug ourselves (I’ve tried and failed, the source code is a little too complex for the amount of free time I have).

I’ve had a nasty surprise from this bug, too, unfortunately. I have a very large workflow which I’ve been running piecemeal with --batch to get around the long DAG generation times. Any ideas for a workaround? At the moment I’m just using os.path.exists to establish which files don’t exist.

Sorry to complain… hope this will get fixed at some point soon because it’s a pretty major regression. Causing me issues currently because I have a very large snakemake workflow where rules downstream of a checkpoint aggregation are computationally expensive to run and they keep getting re-executed even though they shouldn’t.

Hi, thanks for the reprex, I was running in the same problem in production. I confirm this bug with 7.8.2 as well as the latest 7.14.0.

On top of that, I found that Snakemake does not correctly list the files that should be updated, despite indicating the reason for rerun is a change in input files.

snakemake -c 1 -n --list-input-changes
snakemake -c 1 -n --list-code-changes
snakemake -c 1 -n --list-params-changes
# All yields only the following preparation message
# Building DAG of jobs...

EDIT: Using --rerun-triggers input is enough to trigger the re-execution behavior while --rerun-triggers mtime and others do not

$ snakemake -c 1 -n --rerun-triggers input
Building DAG of jobs...
Job stats:
job         count    min threads    max threads
--------  -------  -------------  -------------
all             1              1              1
process         1              1              1
process2        1              1              1
total           3              1              1


[Wed Sep 14 09:47:12 2022]
rule process:
    input: aggregated.txt
    output: processed.txt
    jobid: 2
    reason: Input files updated by another job: aggregated.txt
    resources: tmpdir=/tmp


[Wed Sep 14 09:47:12 2022]
rule process2:
    input: processed.txt
    output: processed2.txt
    jobid: 1
    reason: Input files updated by another job: processed.txt
    resources: tmpdir=/tmp


[Wed Sep 14 09:47:12 2022]
localrule all:
    input: processed2.txt
    jobid: 0
    reason: Input files updated by another job: processed2.txt
    resources: tmpdir=/tmp

Job stats:
job         count    min threads    max threads
--------  -------  -------------  -------------
all             1              1              1
process         1              1              1
process2        1              1              1
total           3              1              1


This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.

#1694 I believe this is connected to the comment I made just now.

I suspect the new re-run behavior is being inappropriately triggered, perhaps by rules with input/output/params that are functions. I am not sure how the behavior is implemented, but I wonder – if one wrote a rule that used the current time of day function (or a random number, etc) as an input, would reruns always occur?