UMI-tools: Problems with dedup output and RSEM
Hi,
I’m using umi_tools dedup to remove PCR duplicates from an alignment to the transcriptome with STAR. After the deduplication, when I run RSEM, it seems that there are some reads from the pairs that are lost since the program exits with the following error:
Read ST-E00114:1178:HFL75CCX2:7:1101:1610:55297_TTGCCATCTC: The adjacent two lines do not represent the two mates of a paired-end read! (RSEM assumes the two mates of a paired-end read should be adjacent)
The ran the command with --paired --multimapping-detection-method=NH --unpaired-reads=discard --chimeric-pairs=discard --unmapped_reads=discard
I have seen that this problem was already discussed in #384, but there is not an option on how to solve this. Do you have any idea on how to solve this issue or a workaround that could work for this case?
Thank you very much
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 65 (29 by maintainers)
As for release schedule: I can’t see any reason not to include this script in the next release. We have a set of things we’d like to tidy up before the next release, but its probably not a million miles off.
Yeah, I thought that it might be coming from
$options.args. There is still a potential theoretical problem with situations where read1 is alignmed but read2 isn’t, so I’ll see what I can think of for that.I’m definitely open to suggestions for a more elegant fix, so let me know if you have one.
Salmon/RSEM expect that BAM files are sorted such that each read is immediately followed by its mate. This means that there must be exactly read 2 for every read 1 and they must immediately follow each other in the BAM file. STAR outputs alignments in this manner. In each pair of reads the mpos and mnext fields point to the mate alignment for that read (i.e. the one that immediately follows/precedes it).
UMI-tools requires that the BAM file is position sorted. How else can it be sure that it has seen all the reads mapping to a given location before deduplicating them. However, that means that when it marks a read for retention, it must also find the mate for that read, which it does using the mnext and mpos fields. However, the SAM format specification says that the mpos and mnext field should point to the primary alignment of the mate, not the one that might naturally be assumed to pair with it (i.e. the one that would come next in the STAR output). Different aligners also differ in how many times they will output read2 if it aligns to one location, but read1 aligns to two. STAR will output read2 twice, where as BWA will only output it once. Consider the following two situations:
STAR will output
where as BWA will output:
However, now consider:
STAR will output:
BWA will output:
This is a problem, because when UMI tools decides to output a pair (based on the information in read1), it adds read2 to a set of things to look for, once it has been seen, and output, then it removes it from the set. So each read2 is only ever output once.
It may be possible to add a counter, so that if two read1s point to the same read2, then UMI-tools knows to output it twice. Trouble is, what then happens if the input file is from BWA?
There are other problems with multimapped reads because you can get different answers to the question “is this read unique?” at different mapping locations, even though the read must either be from a unique biological molecule or be a PCR duplicate, and really, you’d like to remove a read if any of the alignments for its are non-unique, but that is not possible in the current UMI-tools framework.
It has been suggested that perhaps we should just disallow multimapping paired end sequencing, and say that it is Salmon/RSEMs job to deal with it properly (Salmon-alevin already has a very clever algorithm to do this in the single cell situation). This would mean we would effectively be preventing UMI-tools from being used with transcript based quantification tools.
Thanks! No worries! Maybe we should create a small issue on nf-core/rnaseq linking here so we are able to track it?
Hi @IanSudbery,
you can access the original and duplicated bamfiles here. They are aligned to the mm10 genome and
gencode.vM25.primary_assembly.annotation.gtfusing STAR and the nf-core RNA-Seq pipeline.There was some additional discussion about this with @drpatelh and @rob-p on the nf-core slack [1] [2].
Here’s the summary:
According to @rob-p, this issue is likely the reason why Salmon fails to produce reasonable counts on my data. Here’s the assumptions Salmon makes about the input files:
When running
samtools flagstaton the umi-deduplicated BAM file, we see that theread2s are missing.For comparison, this is the flagstat of the original BAM file as produced by STAR