ggsashimi: KeyError: 'transcript_id' with Ensemble human annotation

From @sridhar0605 originally posted in #1:

When using Homo_sapiens.GRCh37.75.gtf as reference from ensembl, I see this error

Using default tag: latest latest: Pulling from guigolab/ggsashimi 915665fee719: Pull complete 1a0814f59c8e: Pull complete b3b71680ed5d: Pull complete 1c3c8afa6ada: Pull complete 2fbeb903a5b4: Pull complete Digest: sha256:82590f821978568e948ad4861ce009fcb26e7543263bea9d7b78c17667f8d675 Status: Downloaded newer image for guigolab/ggsashimi:latest Traceback (most recent call last): File “/sashimi-plot.py”, line 592, in <module> transcripts, exons = read_gtf(args.gtf, args.coordinates) File “/sashimi-plot.py”, line 278, in read_gtf transcript_id = d[“transcript_id”] KeyError: ‘transcript_id’

few lines from gtf:

#!genome-build GRCh37.p13
#!genome-version GRCh37
#!genome-date 2009-02
#!genome-build-accession NCBI:GCA_000001405.14
#!genebuild-last-updated 2013-09
1	pseudogene	gene	11869	14412	.	+	.	gene_id "ENSG00000223972"; gene_name "DDX11L1"; gene_source "ensembl_havana"; gene_biotype "pseudogene";
1	processed_transcript	transcript	11869	14409	.	+	.	gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; gene_name "DDX11L1"; gene_source "ensembl_havana"; gene_biotype "pseudogene"; transcript_name "DDX11L1-002"; transcript_source "havana";
1	processed_transcript	exon	11869	12227	.	+	.	gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; exon_number "1"; gene_name "DDX11L1"; gene_source "ensembl_havana"; gene_biotype "pseudogene"; transcript_name "DDX11L1-002"; transcript_source "havana"; exon_id "ENSE00002234944";
1	processed_transcript	exon	12613	12721	.	+	.	gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; exon_number "2"; gene_name "DDX11L1"; gene_source "ensembl_havana"; gene_biotype "pseudogene"; transcript_name "DDX11L1-002"; transcript_source "havana"; exon_id "ENSE00003582793";
1	processed_transcript	exon	13221	14409	.	+	.	gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; exon_number "3"; gene_name "DDX11L1"; gene_source "ensembl_havana"; gene_biotype "pseudogene"; transcript_name "DDX11L1-002"; transcript_source "havana"; exon_id "ENSE00002312635";

About this issue

Original URL
State: closed
Created 7 years ago
Comments: 21 (1 by maintainers)

Commits related to this issue

Get transcript_id from transcript/exon only, handle error when absent - fix #2, #46 — committed to guigolab/ggsashimi by dgarrimar 3 years ago

Most upvoted comments

Here, When I used Ensembl GTF file, I also get the same error. Finally, I found that the GTF only have transcript and exon rows works well. awk -F "\t" '$3=="exon"||$3=="transcript"' Homo_sapiens.GRCh38.87.gtf > Homo_sapiens.GRCh38.87.transccript.exon.gtf

ChaoTang-SCU on Apr 3, 2018

I’ve fixed this issue with gencodeID, but still works as originally intended with a try statement. This is easier than editing a GTF file.

replace: transcript_id = d["transcript_id"]

with try statement below.

try:
    transcript_id = d["transcript_id"]
except KeyError:
    transcript_id = d["gene_id"]

KrotosBenjamin on Sep 1, 2020

Dear Lea @bellenger-l,

You could try using GENCODE annotation files. The release corresponding to mouse ensembl 83 is GENCODE M8. Alternatively, could you provide some lines of your GTF to check what is the problem? As stated in previous comments, make sure that the file follows the proper format. Specially, the transcript_id attribute should be present in every line of the GTF.

dgarrimar on Feb 21, 2019

Just wanted to add that gencode gtf runs into same issue.

ManavalanG on Sep 6, 2018