picard: Upstream deletions and CollectVariantCallingMetrics do not play nice right now.

The current VCF spec allows for a * allele (no brackets):

“The ‘*’ allele is reserved to indicate that the allele is missing due to a upstream deletion.”

CollectVariantCallingMetrics treats this as a third (size 1!!) allele so that in the case of

1   10347   .   TAAACCCTA   T   100 .   AC=2    GT  0/1 0/1
1   10350   .   A           C,* 100 .   AC=3    GT  1/2 0/2

both the 0/2 and 1/2 genotypes in the second line are counted towards TOTAL_MULTIALLELIC_SNPS (for the detailed metrics) Also, both of these genotype will not be counted towards the TOTAL_SNPS (as that only captures bi-alleleic SNPs). So upstream deletions are “hurting” both the monomorphic samples (as they get an inflated TOTAL_MULTIALLELIC_SNPS ) and the polymorphic samples (as they get a deflated TOTAL_SNPS count)

I propose changing this behavior so that an upstream deletion will count as the reference allele for the purpose of metrics.

I will also add a few column or two to capture the number of upstream deletions, perhaps counting the 0/2 separately from the 1/2 genotypes.

Does this sounds reasonable to folks?

@eitanbanks @tfenne ?

About this issue

  • Original URL
  • State: open
  • Created 8 years ago
  • Comments: 22 (15 by maintainers)

Most upvoted comments

still a thing! on my “todo” list too!

On Wed, Jan 18, 2017 at 8:23 PM, Geraldine Van der Auwera < notifications@github.com> wrote:

Is this still a thing or has the relevant work been done/closed?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/broadinstitute/picard/issues/555#issuecomment-273654840, or mute the thread https://github.com/notifications/unsubscribe-auth/ACnk0g_H44yzYQaMI0WOnT6KtDNzXTEFks5rTrsXgaJpZM4ItK4k .