snakemake: Checkpoint aggregate returns checkpoint output dir instead of files

Hi, the following has been observed in 5.7.1 (edit: also in v5.7.4 and v5.6.0)

For jobs waiting for checkpoint output, I get failed jobs with the following irritating output (simplified):

rule merge_mono_dinucleotide_fraction:
    input: <TBD>
    output: <OMITTED path to output file>
    log: <OMITTED path to log file>
    jobid: 0
   <OMITTED wildcards, resources etc...>

Error in rule merge_mono_dinucleotide_fraction:
    jobid: 0
    output: <OMITTED>
    log: <OMITTED>
    shell:
        samtools merge -@ 6 -O BAM <OMITTED: correct path to output file> input/fastq/strand-seq/HG00733_PRJEB12849/requests &> <OMITTED: path to log file>
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

The <TBD> (to be determined?) probably tells me that Snakemake needs to evaluate the checkpoint in the input function - ok. The checkpoint is long-running (downloading data), and adding a pdb.set_trace() inside my input function shows that there is an IncompleteCheckpoint exception raised (as expected, I presume). Now the problematic part: the path input/fastq/strand-seq/HG00733_PRJEB12849/requests is the directory() output of the checkpoint. Checking the log file of samtools for the above failed job shows the following:

[E::hts_hopen] Failed to open file input/fastq/strand-seq/HG00733_PRJEB12849/requests
[E::hts_open_format] Failed to open file input/fastq/strand-seq/HG00733_PRJEB12849/requests
samtools merge: fail to open "input/fastq/strand-seq/HG00733_PRJEB12849/requests": Is a directory

Apparently, Snakemake detects the unfinished checkpoint, but returns the directory() of the checkpoint as input to the rule (in this case merge_mono_dinucleotide_fraction). If I wait for all jobs to fail, and for the checkpoint to finish, and restart the pipeline, the workflow continues as expected (= showing that the aggregate input function works as intended).

I have trouble coming up with a minimal reproducible example for this, maybe because it’s about timing, or the reason is actually something else - nevertheless, the log output of samtools clearly shows that Snakemake executes the rule with the checkpoint output, instead of the output collected by the aggregate input function. Thanks for looking into this.

Best, Peter

Below the code of my aggregate input function - as stated above, this works as intended after waiting for the checkpoint to complete (see my comment below):

def collect_merge_files(wildcards):
    """
    """
    individual = wildcards.individual
    bioproject = wildcards.bioproject
    platform = wildcards.platform
    project = wildcards.project
    lib_id = wildcards.lib_id

    requests_dir = checkpoints.create_bioproject_download_requests.get(individual=individual, bioproject=bioproject).output[0]

    search_pattern = '_'.join([individual, project, '{spec}', lib_id, '{run_id}', '1'])

    search_path = os.path.join(requests_dir, search_pattern + '.request')

    checkpoint_wildcards = glob_wildcards(search_path)

    bam_files = expand(
        'output/alignments/strandseq_to_reference/{reference}.{individual}.{bioproject}/{individual}_{project}_{spec}_{lib_id}_{run_id}.filt.sam.bam',
        zip,
        reference=[wildcards.reference, wildcards.reference],
        individual=[individual, individual],
        bioproject=[bioproject, bioproject],
        project=[project, project],
        spec=checkpoint_wildcards.spec,
        lib_id=[lib_id, lib_id],
        run_id=checkpoint_wildcards.run_id)

    assert len(bam_files) == 2, 'Missing merge partner: {}'.format(bam_files)

    return sorted(bam_files)

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Reactions: 1
  • Comments: 31 (10 by maintainers)

Most upvoted comments

beacon: same error still exists in Snakemake v5.10.0

I can (sort-of) reproduce this with a small makefile running locally. I added “sort of” because it requires a missing input (that would normally be OK) to trigger the behavior.

The Snakefile:

"""
Dummy snakefile to reprouce issue #55

 * fragments input test
 * runs grep on each fragment separately
 * combines output

"""

input_file = config['input_file']
grep_pattern = config.get('pattern', 'rule ')
out_prefix = config.get('out_prefix', 'output')
chunk_size = config.get('chunk_size', 10)

wildcard_constraints:
    chunk=r'\w+'

rule all:
    input:
        f"{out_prefix}.matches",

checkpoint fragment_input:
    input: input_file
    output: directory(f'{out_prefix}.fragments')
    shell:
        """
            mkdir -p {output}
            split -l {chunk_size} {input} {output}/fragment.
        """

rule annotate_fragment:
    input: '{out_prefix}.fragments/fragment.{chunk}'
    output: '{out_prefix}.fragments/fragment.{chunk}.matches'
    threads: 2
    shell:
        'grep "{grep_pattern}" {input} > {output} || true'

def get_fragment_outputs(wildcards):
    frag_dir = checkpoints.fragment_input.get().output[0]
    chunks, = glob_wildcards(f'{out_prefix}.fragments/fragment.{{chunk}}')
    # wildcard_costraints not honored, so we do it manually
    chunks = [c for c in chunks if re.match(r'^\w+$', c)]
    return expand('{out_prefix}.fragments/fragment.{chunk}.matches',
                  out_prefix=out_prefix,
                  chunk=chunks)

rule combine_outputs:
    input: get_fragment_outputs
    output:
        f"{out_prefix}.matches",
    shell:
        "cat {input} > {output}"

If you run this with:

snakemake -p --config input_file=Snakefile

…it works fine. If you delete the output file and some of the inputs and re-run with a missing input, you get the error:

(/mnt/data0/jmeppley/projects/snakemake/test-env) [jmeppley@tyrosine test-tbd]$ rm output.fragments/fragment.ac.matches output.matches 
(/mnt/data0/jmeppley/projects/snakemake/test-env) [jmeppley@tyrosine test-tbd]$ snakemake -p --config input_file=Snakefil -j 3
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 3
Rules claiming more threads will be scaled down.
Job counts:
        count   jobs
        1       all
        1       combine_outputs
        2

[Thu Jun  4 16:23:25 2020]
rule combine_outputs:
    input: <TBD>
    output: output.matches
    jobid: 1

cat output.fragments > output.matches
cat: output.fragments: Is a directory
[Thu Jun  4 16:23:25 2020]
Error in rule combine_outputs:
    jobid: 1
    output: output.matches
    shell:
        cat output.fragments > output.matches
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Removing output files of failed job combine_outputs since they might be corrupted:
output.matches
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /mnt/data0/jmeppley/projects/snakemake/test-tbd/.snakemake/log/2020-06-04T162325.566085.snakemake.log

If you build a comparable makefile without a checkpoint and re-run it from the same spot with a missing start file, it will recognize that the intermediate files exist and work from them. But with a checkpoint rule, you get this TBD behavior.