UMI-tools: Problems with dedup output and RSEM

Hi,

I’m using umi_tools dedup to remove PCR duplicates from an alignment to the transcriptome with STAR. After the deduplication, when I run RSEM, it seems that there are some reads from the pairs that are lost since the program exits with the following error: Read ST-E00114:1178:HFL75CCX2:7:1101:1610:55297_TTGCCATCTC: The adjacent two lines do not represent the two mates of a paired-end read! (RSEM assumes the two mates of a paired-end read should be adjacent)

The ran the command with --paired --multimapping-detection-method=NH --unpaired-reads=discard --chimeric-pairs=discard --unmapped_reads=discard

I have seen that this problem was already discussed in #384, but there is not an option on how to solve this. Do you have any idea on how to solve this issue or a workaround that could work for this case?

Thank you very much

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 65 (29 by maintainers)

Most upvoted comments

As for release schedule: I can’t see any reason not to include this script in the next release. We have a set of things we’d like to tidy up before the next release, but its probably not a million miles off.

Yeah, I thought that it might be coming from $options.args. There is still a potential theoretical problem with situations where read1 is alignmed but read2 isn’t, so I’ll see what I can think of for that.

I’m definitely open to suggestions for a more elegant fix, so let me know if you have one.

Salmon/RSEM expect that BAM files are sorted such that each read is immediately followed by its mate. This means that there must be exactly read 2 for every read 1 and they must immediately follow each other in the BAM file. STAR outputs alignments in this manner. In each pair of reads the mpos and mnext fields point to the mate alignment for that read (i.e. the one that immediately follows/precedes it).

UMI-tools requires that the BAM file is position sorted. How else can it be sure that it has seen all the reads mapping to a given location before deduplicating them. However, that means that when it marks a read for retention, it must also find the mate for that read, which it does using the mnext and mpos fields. However, the SAM format specification says that the mpos and mnext field should point to the primary alignment of the mate, not the one that might naturally be assumed to pair with it (i.e. the one that would come next in the STAR output). Different aligners also differ in how many times they will output read2 if it aligns to one location, but read1 aligns to two. STAR will output read2 twice, where as BWA will only output it once. Consider the following two situations:

read1(primary @ pos 100)            read2 (primary @ pos 120)
read1(secondary @ pos 200)

STAR will output

read1(primary @ pos 100, mate @ pos 120)
read2(primary @ pos 120, mate @ pos 100)
read1(secondary @ pos 200, mate @ pos120)
**read2(secondary @ pos 120, mate @ pos 200)** <--- this alignment is redundant

where as BWA will output:

read1(primary @ pos 100, mate @ pos 120)
read2(primary @ pos 120, mate @ pos 100)
read1(secondary @ pos 200, mate @ pos120)

However, now consider:

read1(primary @ pos 100)            read2 (primary @ pos 120)
read1(secondary @ pos 200)        read2 (secondary @ pos 220)

STAR will output:

read1(primary @ pos 100, mate @ pos 120)
read2(primary @ pos 120, mate @ pos 100)
read1(secondary @ pos 200, mate @ pos220) <--- invalid SAM, mpos points to secondary alignment
read2(secondary @ pos 220, mate @ pos 200) 

BWA will output:

read1(primary @ pos 100, mate @ pos 120)
read2(primary @ pos 120, mate @ pos 100)
read1(secondary @ pos 200, mate @ pos120)  <--- This read is now valid SAM, but points to the "wrong" mate
read2(secondary @ pos 220, mate @ pos 200) <--- This read is never reffered to as the mate of anything, and so will never be output by UMI-tools

This is a problem, because when UMI tools decides to output a pair (based on the information in read1), it adds read2 to a set of things to look for, once it has been seen, and output, then it removes it from the set. So each read2 is only ever output once.

It may be possible to add a counter, so that if two read1s point to the same read2, then UMI-tools knows to output it twice. Trouble is, what then happens if the input file is from BWA?

There are other problems with multimapped reads because you can get different answers to the question “is this read unique?” at different mapping locations, even though the read must either be from a unique biological molecule or be a PCR duplicate, and really, you’d like to remove a read if any of the alignments for its are non-unique, but that is not possible in the current UMI-tools framework.

It has been suggested that perhaps we should just disallow multimapping paired end sequencing, and say that it is Salmon/RSEMs job to deal with it properly (Salmon-alevin already has a very clever algorithm to do this in the single cell situation). This would mean we would effectively be preventing UMI-tools from being used with transcript based quantification tools.

Thanks! No worries! Maybe we should create a small issue on nf-core/rnaseq linking here so we are able to track it?

Hi @IanSudbery,

you can access the original and duplicated bamfiles here. They are aligned to the mm10 genome and gencode.vM25.primary_assembly.annotation.gtf using STAR and the nf-core RNA-Seq pipeline.

There was some additional discussion about this with @drpatelh and @rob-p on the nf-core slack [1] [2].

Here’s the summary:

According to @rob-p, this issue is likely the reason why Salmon fails to produce reasonable counts on my data. Here’s the assumptions Salmon makes about the input files:

Salmon expects that for a paired-end fragment, the alignment records for all alignments of this fragment are consecutive in the file, and that the alignment for end2 is directly after the corresponding alignment for end1.

[Salmon] is aware of orphan alignments (where only one end aligns), it needs to account for this because e.g. implied fragment size can only be computed when you have a proper paired-end alignment, but it then still expects the unmapped record for the second end of the pair to remain in the file (just like RSEM).

When running samtools flagstat on the umi-deduplicated BAM file, we see that the read2s are missing.

(test_salmon) sturm@zeus [SSH] test_umitools % samtools flagstat POOLAM1-40.umi_dedup.transcriptome.sorted.bam 
5768256 + 0 in total (QC-passed reads + QC-failed reads)
2660439 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
5768256 + 0 mapped (100.00% : N/A)
3107817 + 0 paired in sequencing
3107817 + 0 read1
0 + 0 read2
3107817 + 0 properly paired (100.00% : N/A)
3107817 + 0 with itself and mate mapped
0 + 0 singletons (0.00% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
For comparison, this is the flagstat of the original BAM file as produced by STAR
(test_salmon) sturm@zeus [1] [SSH] test_umitools % samtools flagstat POOLAM1-40.Aligned.toTranscriptome.out.bam 
27624680 + 0 in total (QC-passed reads + QC-failed reads)
12712284 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
27624680 + 0 mapped (100.00% : N/A)
14912396 + 0 paired in sequencing
7456198 + 0 read1
7456198 + 0 read2
14912396 + 0 properly paired (100.00% : N/A)
14912396 + 0 with itself and mate mapped
0 + 0 singletons (0.00% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)