deepvariant: Deepvariant fails to recognize PacBio BAM index file

**Have you checked the FAQ? https://github.com/google/deepvariant/blob/r1.5/docs/FAQ.md**: Yes. Describe the issue: I followed the PACBIO example and also added the flags from this issue.

This error indicates that Deepvariant is not able to find an index file for the bam file but the index file is there see :

 (base) ✔ /media/nils/nils_ssd_01/Calling/HiFI_sequencing/data/bam[master L|…5] 
 22:08 $ ls
 GFX.bam  GFX.bam.pbi  GFX_hg19.bam  GFX_hg19.bam.pbi   readlength.txt  tmp

I also re-indexd the file using pbindex from pbbam. As one can see from the ls output I also tried to realign the bam file to some other reference panel using pbmm2.

Setup - Operating system: Linux Mint 21.1 x86_64 - Kernel: 5.15.0-69-generic - DeepVariant version: 1.5.0 - Installation method (Docker, built from source, etc.): Docker image run through Singularity - Singularity Verion : singularity-ce version 3.11.3 - Type of data: (sequencing instrument, reference genome, anything special that is unlike the case studies?) PACBIO CCS data aligned to GRCh37.fa reference genome. No special observations in the file can be reported.

(base) ✔ /media/nils/nils_ssd_01/Calling/HiFI_sequencing/data/bam [master L|…5]                                                                                                                                                            
20:46 $ samtools flagstat GFX.bam 
^[[1;5C940551 + 0 in total (QC-passed reads + QC-failed reads)
881297 + 0 primary
0 + 0 secondary
59254 + 0 supplementary
0 + 0 duplicates
0 + 0 primary duplicates
940551 + 0 mapped (100.00% : N/A)
881297 + 0 primary mapped (100.00% : N/A)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (N/A : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (N/A : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

Steps to reproduce: - Command: Follow the PACBIO example and install the mentioned singularity version. Then run

singularity exec --bind /usr/lib/locale/,/media/nils/nils_ssd_01/Genomics_prac_guide/reference/unsorted/,/media/nils/nils_ssd_01/Calling/HiFI_sequencing/data/ docker://google/deepvariant:1.5.0 /opt/deepvariant/bin/run_deepvariant --model_type PACBIO --ref /media/nils/nils_ssd_01/Genomics_prac_guide/reference/unsorted/hg19.fa --reads /media/nils/nils_ssd_01/Calling/HiFI_sequencing/data/GFX.bam --output_vcf /media/nils/nils_ssd_01/Calling/HiFI_sequencing/data/GFX.vcf.gz --num_shards 22 --dry_run=false --make_examples_extra_args='sort_by_haplotypes=false,parse_sam_aux_fields=false'

- Error trace:

The error message including the pipeline call is :

   E::idx_find_and_load] Could not retrieve index file for '/media/nils/nils_ssd_01/Calling/HiFI_sequencing/data/bam/GFX.bam'
2023-06-20 22:48:40.272832: W third_party/nucleus/io/sam_reader.cc:131] Unknown tag pb: in header line, ignoring: @HD	VN:1.6	SO:coordinate	pb:5.0.0
2023-06-20 22:48:40.272881: W third_party/nucleus/io/sam_reader.cc:174] Unknown tag BC: in RG line, ignoring: @RG	ID:408781da/26--26	PL:PACBIO	DS:READTYPE=CCS;BINDINGKIT=101-894-200;SEQUENCINGKIT=101-826-100;BASECALLERVERSION=5.0.0;FRAMERATEHZ=100.000000;BarcodeFile=m64023e_230515_162401.barcodes.fasta;BarcodeHash=86d73e586a6d3ede0295785b51105eea;BarcodeCount=96;BarcodeMode=Symmetric;BarcodeQuality=Score	LB:Pool_18_20_GFX0455704_GFX	PU:m64023e_230515_162401	SM:GFX
PM:SEQUELII	BC:TGACTGTAGCGAGTAT	CM:S/P5-C2/5.0-8M
2023-06-20 22:48:40.272889: W third_party/nucleus/io/sam_reader.cc:174] Unknown tag CM: in RG line, ignoring: @RG	ID:408781da/26--26	PL:PACBIO	DS:READTYPE=CCS;BINDINGKIT=101-894-200;SEQUENCINGKIT=101-826-100;BASECALLERVERSION=5.0.0;FRAMERATEHZ=100.000000;BarcodeFile=m64023e_230515_162401.barcodes.fasta;BarcodeHash=86d73e586a6d3ede0295785b51105eea;BarcodeCount=96;BarcodeMode=Symmetric;BarcodeQuality=Score	LB:Pool_18_20_GFX0455704_GFX	PU:m64023e_230515_162401	SM:GFX
PM:SEQUELII	BC:TGACTGTAGCGAGTAT	CM:S/P5-C2/5.0-8M
I0620 22:48:40.273113 140449601988416 genomics_reader.py:222] Reading /media/nils/nils_ssd_01/Calling/HiFI_sequencing/data/bam/GFX.bam with NativeSamReader
I0620 22:48:40.273825 140449601988416 make_examples_core.py:257] Task 8/22: Writing examples to /tmp/tmpwjk24y8t/make_examples.tfrecord-00008-of-00022.gz
I0620 22:48:40.274013 140449601988416 make_examples_core.py:257] Task 8/22: Overhead for preparing inputs: 0 seconds
Traceback (most recent call last):
  File "/tmp/Bazel.runfiles_ztv_d7ra/runfiles/com_google_deepvariant/deepvariant/make_examples.py", line 196, in <module>
  app.run(main)
File "/tmp/Bazel.runfiles_ztv_d7ra/runfiles/absl_py/absl/app.py", line 312, in run
_run_main(main, args)
File "/tmp/Bazel.runfiles_ztv_d7ra/runfiles/absl_py/absl/app.py", line 258, in _run_main
sys.exit(main(argv))
File "/tmp/Bazel.runfiles_ztv_d7ra/runfiles/com_google_deepvariant/deepvariant/make_examples.py", line 186, in main
make_examples_core.make_examples_runner(options)
File "/tmp/Bazel.runfiles_ztv_d7ra/runfiles/com_google_deepvariant/deepvariant/make_examples_core.py", line 2183, in make_examples_runner
runtimes) = region_processor.process(region)
File "/tmp/Bazel.runfiles_ztv_d7ra/runfiles/com_google_deepvariant/deepvariant/make_examples_core.py", line 1285, in process
sample_reads = self.region_reads_norealign(
  File "/tmp/Bazel.runfiles_ztv_d7ra/runfiles/com_google_deepvariant/deepvariant/make_examples_core.py", line 1376, in region_reads_norealign
  reads = itertools.chain(reads, sam_reader.query(region))
  File "/tmp/Bazel.runfiles_ztv_d7ra/runfiles/com_google_deepvariant/third_party/nucleus/io/genomics_reader.py", line 247, in query
  return self._reader.query(region)
  File "/tmp/Bazel.runfiles_ztv_d7ra/runfiles/com_google_deepvariant/third_party/nucleus/io/sam.py", line 250, in query
  return self._reader.query(region)
  ValueError: FAILED_PRECONDITION: Cannot query without an index
  parallel: This job failed:
    /opt/deepvariant/bin/make_examples --mode calling --ref /media/nils/nils_ssd_01/Genomics_prac_guide/reference/GRCh37/hs37d5.fa --reads /media/nils/nils_ssd_01/Calling/HiFI_sequencing/data/bam/GFX.bam --examples /tmp/tmpwjk24y8t/make_examples.tfrecord@22.gz --add_hp_channel --alt_aligned_pileup diff_channels --max_reads_per_partition 600 --min_mapping_quality 1 --noparse_sam_aux_fields --partition_size 25000 --phase_reads --pileup_image_width 199 --norealign_reads --nosort_by_haplotypes --track_ref_reads --vsc_min_fraction_indels 0.12 --task 1

Does the quick start test work on your system? Please test with https://github.com/google/deepvariant/blob/r0.10/docs/deepvariant-quick-start.md. Quickstart works. This issue also happens when I try to run the pipeline with docker-only.

**Any additional context:**

About this issue

Original URL
State: closed
Created a year ago
Comments: 37

Most upvoted comments

Hi @Npaffen,

To expand a bit on my previous explanation, regarding the way you can view a variant coming from a neural network is through the idea of preserving $information$ $propagation$. You saw the previous description of how the different channels get encoded, but let’s start with a simpler version. Let say you have a one-line matrix with the following columns:

This could denote a simplified version of your read base representation, where you notice the middle denotes the variant, and the last column the channel it represents. Now I want to create new types of data from this, which will help me with identifying unique areas of patterns within it. This could be of the form of transformation of the values to ranges that are easier to detect differences among columns. One of these can be dividing all the values by 10, and labeling that channel 2. In this transformation, 10 represents a kernel I described previously and the output is a new feature map (a transformed matrix that helps with detecting unique features based on the numerical representation). Now the data would look like this:

Now imagine I create different transformations of these rows, to expand on specific areas among these values where intriguing patterns might emerge. Suppose I create 5 different transformations having then 5 channels with multiple copies of each row, in order to have a fuller dataset that mimics the number of reads. This data is multi-dimensional as it contains different values of X. This can be pretty hard to interpret, but we can collapse these differences to a 2D representation using t-SNE plots to visualize these differences. For example, if I do that to this dataset, I get the following plot:

This helps me understand which channels (feature maps) might be similar, or different. Now if we look at the heatmap for the original data it would look something like this (notice the variant in the middle):

If I generate a t-SNE plot of this, I see something like the following using colors to denote the different channels (feature maps):

Now comes the fun part! Let’s have different matrices (like the ones that generated the new channels above) that will identify interesting features that might expose one type of a genotype or another with confidence. Imagine that instead of dividing by 10, I find the values (matrices) that best helps separate the data in a BAM file I know should have specific variants at specific loci. I want to maximize that precision to be able to recall. Now you notice that the original data and transformations (feature maps) are linked preserving this $propagation$ $of$ $information$, as this flow of information is enabled through these sets of transformations. An interesting thing then begins to emerge as you move up the layers of transformation. For example, early on in the neural network’s set of transformations you will see patterns like this:

You might notice an explosion of features, with no specific patterns. These early steps are to generate a large variety of features to be able to have selection power for the later layers to use as input, for helping with the separation into distinct patterns for mapping to the different classes of genotypes confidently. For example, you can see distinct patterns forming as it reaches the later stages:

Once the pattern has been achieved like the following, then one can proceed with testing each genotype’s representation of the variant:

We want to see for which genotype the set of patterns (the feature map above) maximizes for, which will indicate the genotype present with a specific maximal probability.

First we test for $homozygous$ $reference$:

Next we test for $heterozygous$:

Finally we test for $homozygous$ $alternate$:

Now we can see there is a significant correlation with a heterozygous variant call.

So to call a variant site’s genotype you now have the power to traverse the whole neural network – through a set of transformations (you can tweak) to identify hidden patterns in the data – as it is all linked from the start (read encoding) to finish (genotype identification). This ability is enabled through the preservation of the propagation of information (which is deeply and directly linked) in order to help one optimize upon. That’s why you don’t want to throw anything away, as @AndrewCarroll and @pichuan mentioned as it helps you inspect and tweak the selection power and criteria as it is all interconnected.

Hope it helps, Paul

pgrosu on Jun 29, 2023

Yeah thanks too all here in this thread for the productive discussion. This helped me a lot to understand how deepvariant works! Good luck with the project!

Npaffen on Jul 14, 2023

Thank you @AndrewCarroll and @pichuan for the clarification. The calibration makes sense, and could be intriguing for inspecting DNN-resiliency. Having the same underlying Inception V3 network architecture for both PacBio and WGS, a point of natural comparability would be the logits kernel across all three genotypes:

Given visual similarity, these were confirmed via Euclidean distance (0.9931127, 0.8543731 and 1.052052, respectively). This indicates the feature set might exhibit strong similarity for interpretation.

Looking at one network (PacBio), it might be possible to confirm calibration by testing for network-resiliency. Via perturbation analysis it should be possible to get insight into a channel’s response under perturbation, and their binary interactions under such conditions. Keeping the variant unchanged within a window on each side for preserving the call, the inspection each channel vulnerability response to perturbation can be tested. This resulted in the following perturbation response ($c\_*$ denotes a channel, and $i\_*\_*$ represents a binary interaction between two channels):

The above can be mapped into a network of interactions among the channels:

Based on the above mapping, by testing well-interacting channels through a probabilistically value-update – within DeepVariant-acceptable values – it might be possible to check for shifts in genotype mimicking Mendelian violation. Selecting base_quality and staying within DeepVariant’s minimum acceptable value, random sampling with replacement was performed in the window outside the variant region. A shift in genotype was achieved giving a measure of network resiliency. Other channels being more strongly-connected could provide more aggressive shifts in genotype. This can be offset by restrictions in convolutional motifs within the network, or be shifted to the code via threshold limitations in make_examples.

Thank you and Happy 4th of July! Paul

pgrosu on Jul 4, 2023

Hi @Npaffen

Our typical recommendation for this would be to use DeepVariant WGS model to call variants with gVCF output in the Illumina samples, DeepVariant PACBIO model to call variants with gVCF output in the PacBio sample, and then use glnexus to jointly genotype the three samples together.

The variant call probabilities are well calibrated in each model individually, and so glnexus can operate on them.

AndrewCarroll on Jun 30, 2023

Hi @pgrosu , The hybrid model you mentioned is combining PacBio and Illumina reads of the same individual. It is a DeepVariant model, not a DeepTrio model.

With the hybrid model, you can achieve better accuracy. This model was first developed as part of our submission to https://precision.fda.gov/challenges/10/results.

However, this is not a DeepTrio model and does not fit the use case that @Npaffen described unfortunately.

pichuan on Jun 30, 2023

@AndrewCarroll and @pgrosu thanks for your view on things and @pgrosu especially for your fantastic explanation of the model!

From a data scientist perspective this makes totally sense. I fully agree that one should always gather as much data as possible since one can always remove data that seems to be not useful afterwards. My intention was not to say that one do not need those data or that you should not use the information which those HomRef sites emerged from. I talked to some of my colleagues yesterday and they agreed that they would expect a variant caller to produce variant calls and information/statistics related to those and all further information would be opt-in since they would have no need for this. But my team could just be the minority here and simply an opt-out option for this behavior would be highly recommended!

Npaffen on Jun 29, 2023

Hi @Npaffen @pgrosu

My sentence is written inexactly. I was thinking of the last word in the hyphen - reference in non-reference, and trying to clarify the reference calls. But this was more confusing rather than less.

Reference calls - 0/0 or ./. Variant calls - 0/1, 1/1, 1/2, 0/2, etc.

AndrewCarroll on Jun 22, 2023

Hi @Npaffen

Yes, in phase variants, the reads are assigned a haplotag value. Briefly, in this process, a set of potential variants are scored with heuristics (no neural network) on the likelihood that they are heterozygous variants. A cluster of such variants forms a candidate seed for a haplotype. The evidence from multiple reads across multiple positions are used to identify the putative variants on that haplotype, and then reads are scored based on whether they fall into one of the haplotypes, the other, or cannot be phased. Because this haplotagging uses information from much longer stretches and more candidate variants than the individual process of variant calling, it has the advantage of a broader set of information.

This haplotagging is used to populate the information in the “haplotype channel” which is one of the inputs for DeepVariant long read data. We wrote a blog describing this channel and its impact.

Note that this process is only used to provide the information to the neural network for consider, the neural network will be able to learn when this channel is or is not reliable based on genome context, coverage, etc… the network’s call on the genotype is what finally goes into a variant.

As a result, haplotag is not used as input to generate the non-ref blocks of the gVCF, and as the final variants called are still from the neural network, the definition of a variant remains the same - a position with an ALT allele that receives a non-reference (0/0 or ./.) call.

We are currently working on a deeper description of the phasing logic used in DeepVariant, which may help understand or reproduce the haplotag method more easily.

Please let me know if anything in the explanation is unclear or can be elaborated further.

AndrewCarroll on Jun 22, 2023