pyranges: Range join (also overlap, intersect) fails until I recreate the PyRanges from its .df
I encountered this today, and I’m completely baffled by it.
I have an object test_gr that looks like this:
+--------------+-----------+-----------+----------------------+-------+
| Chromosome | Start | End | Value | +4 |
| (category) | (int32) | (int32) | (object) | ... |
|--------------+-----------+-----------+----------------------+-------|
| 1 | 161332203 | 161332344 | POOR_MAPPING_QUALITY | ... |
+--------------+-----------+-----------+----------------------+-------+
Unstranded PyRanges object has 1 rows and 8 columns from 1 chromosomes.
For printing, the PyRanges was sorted on Chromosome.
4 hidden columns: sample_name, vendor, experiment, promised_exon
And I have an object test_exons that I want to join, as below
+--------------+-----------+-----------+
| Chromosome | Start | End |
| (category) | (int32) | (int32) |
|--------------+-----------+-----------|
| 1 | 15764898 | 15764905 |
| 1 | 15764946 | 15765013 |
| 1 | 15766782 | 15766901 |
| 1 | 15766966 | 15766973 |
| ... | ... | ... |
| X | 154002863 | 154002997 |
| X | 154003456 | 154003561 |
| X | 154004448 | 154004612 |
| X | 154005060 | 154005156 |
+--------------+-----------+-----------+
Unstranded PyRanges object has 2,177 rows and 3 columns from 23 chromosomes.
For printing, the PyRanges was sorted on Chromosome.
If I run test_gr.join(test_exons), I get Empty PyRanges, which is not correct because there is an overlap in there.
If I do this pr.PyRanges(test_gr.df).join(test_exons), using presumably the same data, then the overlap works.
+--------------+-----------+-----------+----------------------+-------+
| Chromosome | Start | End | Value | +6 |
| (category) | (int32) | (int32) | (object) | ... |
|--------------+-----------+-----------+----------------------+-------|
| 1 | 161332203 | 161332344 | POOR_MAPPING_QUALITY | ... |
+--------------+-----------+-----------+----------------------+-------+
Unstranded PyRanges object has 1 rows and 10 columns from 1 chromosomes.
For printing, the PyRanges was sorted on Chromosome.
6 hidden columns: sample_name, vendor, experiment, promised_exon, Start_b, ... (+ 1 more.)
How can this be? I’ve tried versions all the way back to 0.0.79, and the problem remains.
test_gr was created by (filters removed):
test_df=failed_loci_per_assay_df[
(failed_loci_per_assay_df.vendor == "XXX")
& (failed_loci_per_assay_df.experiment == "YYY")
& (failed_loci_per_assay_df.promised_exon=="ZZZ")
].copy()
test_gr = pr.PyRanges(test_df.head(1))
Thanks very much!
About this issue
- Original URL
- State: open
- Created 4 years ago
- Comments: 20 (10 by maintainers)
Commits related to this issue
- #159 Chromosome and Strand should be str - ensure these fields are str before converting to category - minor black formatting - use 'not in' — committed to cfriedline/pyranges by cfriedline-sema4 4 years ago
- #159 Chromosome and Strand should be str (#162) - ensure these fields are str before converting to category - minor black formatting - use 'not in' Co-authored-by: Chris Friedline <chris.friedli... — committed to pyranges/pyranges by cfriedline 4 years ago
Thank you! Will try to look at tomorrow!
On Wednesday, November 4, 2020, Chris Friedline notifications@github.com wrote: