bcbio-nextgen: RefSeq GRCh38.p12 reference genome fails to build

I’m hitting some errors when attempting to index the current RefSeq Homo sapiens reference genome GCF_000001405.38_GRCh38.p12 using bcbio_setup_genome.py. Is RefSeq currently supported in bcbio?

Here’s a link to the relevant files on the NCBI FTP server: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.38_GRCh38.p12

  • GCF_000001405.38_GRCh38.p12_genomic.fna.gz
  • GCF_000001405.38_GRCh38.p12_genomic.gff.gz
  • GCF_000001405.38_GRCh38.p12_genomic.gtf.gz

Other genomes from Ensembl/GenBank and GENCODE (e.g. GRCh38 v29) install as expected on my virtual machine, so it appears that I’m running into some RefSeq-specific issues.

bowtie2, for example, errors out here, and I see this using either the GTF or GFF3 as input:

Total time for backward call to driver() for mirror index: 00:34:05
2019-04-05 16:14:23,509 - INFO - Committing changes: 3698000 features
2019-04-05 16:14:24,047 - INFO - Populating features table and first-order relations: 3698296 features
2019-04-05 16:14:24,047 - INFO - Creating relations(parent) index
2019-04-05 16:14:28,948 - INFO - Creating relations(child) index
2019-04-05 16:14:34,035 - INFO - Inferring transcript extents and writing to tempfile
2019-04-05 16:14:59,707 - INFO - Importing inferred features into db
2019-04-05 16:15:27,587 - INFO - Committing changes
2019-04-05 16:15:27,704 - INFO - Creating relations(parent) index
2019-04-05 16:15:32,459 - INFO - Creating relations(child) index
2019-04-05 16:15:37,302 - INFO - Creating features(featuretype) index
2019-04-05 16:15:40,852 - INFO - Creating features (seqid, start, end) index
2019-04-05 16:15:45,800 - INFO - Creating features (seqid, start, end, strand) index
2019-04-05 16:15:50,957 - INFO - Running ANALYSE features
invalid gffGroup detected on line: NC_000002.12 Curated Genomic stop_codon      88857361
        88857363        0.000000        -       0       gene_id "IGKC"; transcript_id "unknown_transcript_1";
GFF/GTF group unknown_transcript_1 on NC_000001.11-, this line is on NC_000002.12-, all group members must be on same seq and strand
1 errors
Creating gffutils database for /data00/bcbio/stable/genomes/Hsapiens/GCF_000001405.38_GRCh38.p12/tmpcbl/ref-transcripts.gtf.
'transcript' or 'gene' entries not found, so inferring their extent. This can be very slow.
Writing out merged GTF file to /data00/bcbio/stable/genomes/Hsapiens/GCF_000001405.38_GRCh38.p12/tmpcbl/ref-transcripts.gtf.
Traceback (most recent call last):
  File "/data00/biodata/cloudbiolinux/utils/prepare_tx_gff.py", line 840, in <module>
    main(args.org_build, args.gtf, args.fasta, genome_dir, args.cores, args)
  File "/data00/biodata/cloudbiolinux/utils/prepare_tx_gff.py", line 298, in main
    gtf_to_refflat(gtf_file)
  File "/data00/biodata/cloudbiolinux/utils/prepare_tx_gff.py", line 439, in gtf_to_refflat
    genepred = gtf_to_genepred(gtf)
  File "/data00/biodata/cloudbiolinux/utils/prepare_tx_gff.py", line 431, in gtf_to_genepred
    subprocess.check_call(cmd.format(**locals()), shell=True)
  File "/data00/bcbio/stable/anaconda/lib/python2.7/subprocess.py", line 190, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'gtfToGenePred -allErrors -ignoreGroupsWithoutExons -genePredExt /data00/bcbio/stable/genomes/Hsapiens/GCF_000001405.38_GRCh38.p12/tmpcbl/ref-transcripts.gtf /data00/bcbio/stable/genomes/Hsapiens/GCF_000001405.38_GRCh38.p12/tmpcbl/ref-transcripts.genePred' returned non-zero exit status 255
Creating directories using /data00/bcbio/stable/genomes as the base.
Genomes will be installed into /data00/bcbio/stable/genomes/Hsapiens/GCF_000001405.38_GRCh38.p12.
Installed genome as /data00/bcbio/stable/genomes/Hsapiens/GCF_000001405.38_GRCh38.p12/seq/GCF_000001405.38_GRCh38.p12.fa.
Installed GTF as /data00/bcbio/stable/genomes/Hsapiens/GCF_000001405.38_GRCh38.p12/rnaseq/ref-transcripts.gtf.
Creating the seq index.
[localhost] local: samtools faidx GCF_000001405.38_GRCh38.p12.fa
[localhost] local: picard -Xms500m -Xmx3500m CreateSequenceDictionary REFERENCE=/data00/bcbio/stable/genomes/Hsapiens/GCF_000001405.38_GRCh38.p12/seq/GCF_000001405.38_GRCh38.p12.fa OUTPUT=/data00/bcbio/stable/genomes/Hsapiens/GCF_000001405.38_GRCh38.p12/seq/GCF_000001405.38_GRCh38.p12.dict
Creating the bowtie2 index.
[localhost] local: mkdir /data00/bcbio/stable/genomes/Hsapiens/GCF_000001405.38_GRCh38.p12/seq/../bowtie2
[localhost] local: ln -sf ../seq/GCF_000001405.38_GRCh38.p12.fa /data00/bcbio/stable/genomes/Hsapiens/GCF_000001405.38_GRCh38.p12/bowtie2/GCF_000001405.38_GRCh38.p12.fa
Traceback (most recent call last):
  File "/data00/bcbio/tools/bin/bcbio_setup_genome.py", line 312, in <module>
    subprocess.check_call(cmd.format(**locals()), shell=True)
  File "/data00/bcbio/stable/anaconda/lib/python2.7/subprocess.py", line 190, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '/data00/bcbio/stable/anaconda/bin/python /data00/biodata/cloudbiolinux/utils/prepare_tx_gff.py --cores 28 --genome-dir /data00/bcbio/stable/genomes --gtf /data00/bcbio/stable/genomes/Hsapiens/GCF_000001405.38_GRCh38.p12/rnaseq/ref-transcripts.gtf Hsapiens GCF_000001405.38_GRCh38.p12' returned non-zero exit status 1

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 29 (17 by maintainers)

Most upvoted comments

For the record, I’ve also been having trouble while running bcbio_setup_genome.py on a RefSeq GTF file. I’m guessing that the general RefSeq pipeline setup renders their GTF files unusable for bcbio by default, so I’m describing below what worked for me in case it helps anyone else.

I’ve also had to remove trailing spaces and any lines containing transcript_id "unknown_transcript_1", as described above. Then, in order to fix the error The attribute string seems to contain mismatched quotes (https://github.com/bcbio/bcbio-nextgen/issues/2760#issuecomment-481375203), I used the following perl command to replace any semicolons with & signs when enclosed within quotes inside the GTF annotations (as suggested here):

perl -e 'open(FILE, "original_file.gtf"); while(<FILE>){$_ =~ s/([^\"]);/\1\&/g; print $_;}' > edited_file.gtf

This got the GTF file usable with bcbio_setup_genome.py.

Thanks, @mjsteinbaugh, no need to pass them on, I think if you just add to your GTF cleaning something to strip off trailing spaces it should fix the issue you were seeing.