ggsashimi: KeyError: 'transcript_id' with Ensemble human annotation
From @sridhar0605 originally posted in #1:
When using Homo_sapiens.GRCh37.75.gtf as reference from ensembl, I see this error
Using default tag: latest latest: Pulling from guigolab/ggsashimi 915665fee719: Pull complete 1a0814f59c8e: Pull complete b3b71680ed5d: Pull complete 1c3c8afa6ada: Pull complete 2fbeb903a5b4: Pull complete Digest: sha256:82590f821978568e948ad4861ce009fcb26e7543263bea9d7b78c17667f8d675 Status: Downloaded newer image for guigolab/ggsashimi:latest Traceback (most recent call last): File “/sashimi-plot.py”, line 592, in <module> transcripts, exons = read_gtf(args.gtf, args.coordinates) File “/sashimi-plot.py”, line 278, in read_gtf transcript_id = d[“transcript_id”] KeyError: ‘transcript_id’
few lines from gtf:
#!genome-build GRCh37.p13
#!genome-version GRCh37
#!genome-date 2009-02
#!genome-build-accession NCBI:GCA_000001405.14
#!genebuild-last-updated 2013-09
1 pseudogene gene 11869 14412 . + . gene_id "ENSG00000223972"; gene_name "DDX11L1"; gene_source "ensembl_havana"; gene_biotype "pseudogene";
1 processed_transcript transcript 11869 14409 . + . gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; gene_name "DDX11L1"; gene_source "ensembl_havana"; gene_biotype "pseudogene"; transcript_name "DDX11L1-002"; transcript_source "havana";
1 processed_transcript exon 11869 12227 . + . gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; exon_number "1"; gene_name "DDX11L1"; gene_source "ensembl_havana"; gene_biotype "pseudogene"; transcript_name "DDX11L1-002"; transcript_source "havana"; exon_id "ENSE00002234944";
1 processed_transcript exon 12613 12721 . + . gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; exon_number "2"; gene_name "DDX11L1"; gene_source "ensembl_havana"; gene_biotype "pseudogene"; transcript_name "DDX11L1-002"; transcript_source "havana"; exon_id "ENSE00003582793";
1 processed_transcript exon 13221 14409 . + . gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; exon_number "3"; gene_name "DDX11L1"; gene_source "ensembl_havana"; gene_biotype "pseudogene"; transcript_name "DDX11L1-002"; transcript_source "havana"; exon_id "ENSE00002312635";
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Comments: 21 (1 by maintainers)
Commits related to this issue
- Get transcript_id from transcript/exon only, handle error when absent - fix #2, #46 — committed to guigolab/ggsashimi by dgarrimar 3 years ago
Here, When I used Ensembl GTF file, I also get the same error. Finally, I found that the GTF only have transcript and exon rows works well.
awk -F "\t" '$3=="exon"||$3=="transcript"' Homo_sapiens.GRCh38.87.gtf > Homo_sapiens.GRCh38.87.transccript.exon.gtfI’ve fixed this issue with gencodeID, but still works as originally intended with a try statement. This is easier than editing a GTF file.
replace:
transcript_id = d["transcript_id"]with try statement below.
Dear Lea @bellenger-l,
You could try using GENCODE annotation files. The release corresponding to mouse ensembl 83 is GENCODE M8. Alternatively, could you provide some lines of your GTF to check what is the problem? As stated in previous comments, make sure that the file follows the proper format. Specially, the
transcript_idattribute should be present in every line of the GTF.Just wanted to add that gencode gtf runs into same issue.