tskit: `write_vcf` returning position 0

hey team-

working with @mufernando on some stdpopsim coding we stumbled upon a bug with write_vcf. using the default position_transform value, write_vcf can return position 0 which isn’t allowed under the vcf spec i believe. e.g., this tree sequence

print(ts.tables.sites)
╔═════╤════════╤═══════════════╤════════╗
║id   │position│ancestral_state│metadata║
╠═════╪════════╪═══════════════╪════════╣
║0    │       0│              C│        ║
║1    │     534│              C│        ║
║2    │     991│              T│        ║
║3    │    1188│              G│        ║
║4    │    1281│              C│        ║
║5    │    1301│              A│        ║

will return the following vcf

##fileformat=VCFv4.2
##source=tskit 0.5.5
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=1,length=5000000>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	tsk_0	tsk_1	tsk_2	tsk_3	tsk_4	tsk_5	tsk_6	tsk_7	tsk_8	tsk_9	tsk_10	tsk_11	tsk_12	tsk_13	tsk_14	tsk_15	tsk_16	tsk_17	tsk_18	tsk_19
1	0	0	C	G	.	PASS	.	GT	0|0	0|0	0|0	0|0	0|0	0|0	0|0	1|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0
1	534	1	C	G	.	PASS	.	GT	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|1	0|0	0|0	0|0	0|0	0|0
1
.
.
.

i’m guessing we should not have been using the default position_transform here but instead legacy? either way i think this is a bug

About this issue

  • Original URL
  • State: open
  • Created 10 months ago
  • Comments: 29 (21 by maintainers)

Most upvoted comments

Yes, @mufernando and @bhaller are correct in that tskit’s interpretation of genome coordinates is zero-based (it allows a zero coordinate position). However, we do actively store coordinates in there that are 1-based, and in reality there’s nothing to be gained from forcing people to substract 1 from all their coordinates.

Adding 1 to all coords in VCF output would definitely 100% break current and expected future usage, so we’re not going to do that.

Ben’s suggestion above of erroring out when ts.sites_position[0] == 0 is good I think.

actually we found the bug because a downstream program threw an error on the VCF

The issue with it being the users’ responsibility is that the user producing the VCF and the user consuming it are often different. By doing this we force the user to acknowledge that the VCF might not play nice downstream.

I agree that breaking changes suck, but at least this one is an easy fix.

yes, that is where I was trying to get to @bhaller!

I’m pretty sure the ARGs we infer using tskit used 1-based coordinates.

Yes, all the tree sequences I’m inferring use 1-based and adding one on VCF output would be very unhelpful.

Here’s a strawman suggestion for fixing this, although it is a hard breaking change:

write_vcf errors out if a 0 position is seen with this message:

A variant position of 0 was found in the tree sequence, this is not fully compliant with the VCF spec. If you still wish to write the VCF please use the "allow_zero_position" argument to write_vcf. Alternatively, you can increment all the positions by one using "position_transform = lambda x: 1 + x" or coerce the zero to one with "position_transform = lambda x: np.fmax(1, x)".

If that’s too much breakage we could warn for a while before turning it on.

(As a follow up, we could easily add an option to msprime to disallow sites with position=0, which I’d be happy to discuss over there)

I think the issue is only that tskit allows a nucleotide at position 0, which is a much smaller deal.

I agree with Peter here. Msprime generates sites that can have a 0 position, so you can say it uses 0-based coordinates. Sc2ts uses 1-based coordinates from real data, so it is 1-based. I’m pretty sure the ARGs we infer using tskit used 1-based coordinates. All use tskit.

Adding 1 to all coordinates is incorrect behaviour, as it changes the actual coordinates for everything. So, if we did this, then a VCF output from a sc2ts inferred ARG would have wrong coordinates.

I think this is more correctly seen as an issue with simulators generating 0 or 1 based coordinates, and tskit shouldn’t have an opinion about that. I accept that VCF doesn’t allow for a coordinate of 0 in the spec, so we should probably emit a warning if there is a site with this coord.

I’m going to echo @bhaller’s opinion.

I agree it is confusing to have the positions in the VCF be off by one when compared to positions in the tree sequence, but this is something everyone dealing with bioinformatics is aware of. I think it is a lot more confusing to have a VCF that is 0-based, specially if we want tree sequences to be adopted more widely.

I don’t fully understand the ramifications of changing the default behavior, but now we have more evidence that the current behavior is messing with other people’s downstream codes.

The more I ponder this, though, the less it seems like a doc bug. It’s saving out non-spec-compliant VCF files, and that is causing problems with at least two software packages downstream of it. Seems like maybe a fix is in order, probably doing the same thing that SLiM does (which would also be nice for interoperability, of course).

That’s a good suggestion - provide the solution in the docs.