tskit: "invalid start byte" error accessing variant.alleles in pyodide/jupyterlite

When I try to run the following in a jupyterlite notebook (using chrome 112.0.5615.49 under OS X):

import msprime

print("msprime v:", msprime.__version__, ", tskit v:", tskit.__version__)

ts = msprime.sim_mutations(msprime.sim_ancestry(10, sequence_length=10, random_seed=1), rate=1, random_seed=1)
next(ts.variants()).alleles

I get

msprime v: 1.2.0 , tskit v: 0.5.4
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
Cell In[4], line 7
      4 print("msprime v:", msprime.__version__, ", tskit v:", tskit.__version__)
      6 ts = msprime.sim_mutations(msprime.sim_ancestry(10, sequence_length=10, random_seed=1), rate=1, random_seed=1)
----> 7 next(ts.variants()).alleles

File /lib/python3.11/site-packages/tskit/genotypes.py:151, in Variant.alleles(self)
    143 @property
    144 def alleles(self) -> tuple[str | None, ...]:
    145     """
    146     A tuple of the allelic values which samples can possess at the current
    147     site. Unless an encoding of alleles is specified when creating this
    148     variant instance, the first element of this tuple is always the site's
    149     ancestral state.
    150     """
--> 151     return self._ll_variant.alleles

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf8 in position 4: invalid start byte

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 21 (20 by maintainers)

Most upvoted comments

Turns out this is a problem in which we don’t cast the length of the string to Py_ssize_t, which is an easy fix.

Yeash, if I do

>>> ts = msprime.sim_mutations(msprime.sim_ancestry(10, sequence_length=10, random_seed=1), rate=1, random_seed=1)
>>> next(ts.variants()).alleles

all is good, but put those identical lines in a for loop body and it fails:

>>> for i in range(1):
...     ts = msprime.sim_mutations(msprime.sim_ancestry(10, sequence_length=10, random_seed=1), rate=1, random_seed=1)
...     next(ts.variants()).alleles

And I get the error.

Yeah, the length was correct.

@hyanwong I can’t replicate this using the pyodide version at https://pyodide.org/en/stable/console.html Which is 3.11.2 (main, May 3 2023, 04:00:05) (Pyodide 0.23.2) a more recent version than 3.11.2 (main, Mar 30 2023, 21:37:59) (Pyodide 0.23.0) on the notebook demo site.

I can’t identify the upstream commit that fixed this, but it looks like it was fixed. I was already certain that it wasn’t our code, but this confirms it.

A quick fix might be to use y# and then do the utf-8 decode on the python side.