bcbio-nextgen: vcf2db "sqlalchemy.exc.InterfaceError (sqlite3.InterfaceError) Error binding parameter 72 - probably unsupported type." - caused by a specific variant record in the vcf file

Hello,

I was running bcbio v1.1.0 on a set of ~90 samples and I noticed 3 of those samples failed at the same step (vcf2db) with the same exact error. Upon further inspection I noticed that this error is caused by the same variant in all these 3 samples, and removing that variant from these vcf files seems to bypass the error (i.e. vcf2db finished successfully). I am attaching the error log, and a slice of the vcf file including the variant here (line 171 in the attached vcf file).

Command used: vcf2db.py C3_4-gatk-haplotype-nomultiallelic-annotated-gemini.vcf.gz C3_4-gatk-haplotype-nomultiallelic.ped C3_4-gatk-haplotype.db

BCBIO LOG: bcbio-nextgen.log

CONFIG FILE: C3_4.yaml.txt

VCF FILE: (variant causing the error is on line 171) C3_4-gatk-haplotype-nomultiallelic-annotated-gemini.vcf.txt

PED FILE: C3_4-gatk-haplotype-nomultiallelic.ped.txt

Could you please help me with this error?

Thank you, Teja.

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Comments: 43 (32 by maintainers)

Commits related to this issue

Most upvoted comments

Hello everyone!

Thanks for the discussion, it helped me to learn a lot more about the issue!

  1. Gnomad 2.1 from the gnomad website: https://storage.googleapis.com/gnomad-public/release/2.1/vcf/exomes/gnomad.exomes.r2.1.sites.chr10.vcf.bgz is indeed clean, sorted and normalized vcf file, see bcftools stats for chr10:
# SN	[2]id	[3]key	[4]value
SN	0	number of samples:	0
SN	0	number of records:	665745
SN	0	number of no-ALTs:	0
SN	0	number of SNPs:	619617
SN	0	number of MNPs:	0
SN	0	number of indels:	46128
SN	0	number of others:	0
SN	0	number of multiallelic sites:	0
SN	0	number of multiallelic SNP sites:	0

There is no issue with duplicated chr10 variant in it:

tabix gnomad.exomes.r2.1.sites.chr10.vcf.bgz chr10:50095648-50095648
<EMPTY OUTPUT>
  1. However, gnomad 2.1 from gnomad weights 60G, it contains many tags and VEP annotation. Ensembl provides 5G gnomad without VEP annotation. Cloudbiolinux GGD recipe pulls gnomad_exome from ensembl: https://github.com/chapmanb/cloudbiolinux/blob/master/ggd-recipes/hg38/gnomad_exome.yaml

Ensembl’s gnomad_exome is v2.0.1, it is not sorted, not decomposed, and not normalized, and it has an issue with duplicated variants (bcftools stats for chr10):

# SN	[2]id	[3]key	[4]value
SN	0	number of samples:	0
SN	0	number of records:	583100
SN	0	number of no-ALTs:	0
SN	0	number of SNPs:	554086
SN	0	number of MNPs:	66
SN	0	number of indels:	40054
SN	0	number of others:	338
SN	0	number of multiallelic sites:	66250
SN	0	number of multiallelic SNP sites:	52822

Issue with duplicated variant:

tabix gnomad.exomes.r2.0.1.sites.GRCh38.noVEP.sorted.chr10.vcf.gz chr10:50095648-50095648
chr10	50095648	.	A	G	7379.2	RF;AC0	AC=0;AF=0;AN=105416;BaseQRankSum=0;ClippingRankSum=0.196;DP=2665618;FS=74.3;InbreedingCoeff=0.0209;MQ=23.95;MQRankSum=0;QD=6.76;ReadPosRankSum=0.358;SOR=4.733;VQSLOD=-221.9;VQSR_culprit=FS;GQ_HIST_ALT=2|44|9|12|2|2|0|4|6|4|0|0|4|1|0|1|0|0|0|0;DP_HIST_ALT=65|22|0|0|0|1|0|2|1|0|0|0|0|0|0|0|0|0|0|0;AB_HIST_ALT=0|0|3|1|1|1|0|1|4|0|9|0|5|6|0|2|1|1|0|0;GQ_HIST_ALL=5197|7001|4213|8918|7852|3162|5118|4039|1689|2849|2482|1151|3003|499|1655|845|2133|299|2102|15845;DP_HIST_ALL=10460|13434|12411|7664|6367|5948|6206|6267|3998|2901|2040|1316|633|275|85|29|9|4|1|1;AB_HIST_ALL=0|0|3|1|1|1|0|1|4|0|9|0|5|6|0|2|1|1|0|0;AC_AFR=0;AC_AMR=0;AC_ASJ=0;AC_EAS=0;AC_FIN=0;AC_NFE=0;AC_OTH=0;AC_SAS=0;AC_Male=0;AC_Female=0;AN_AFR=5090;AN_AMR=12922;AN_ASJ=4264;AN_EAS=6672;AN_FIN=11754;AN_NFE=48900;AN_OTH=2440;AN_SAS=13374;AN_Male=57722;AN_Female=47694;AF_AFR=0;AF_AMR=0;AF_ASJ=0;AF_EAS=0;AF_FIN=0;AF_NFE=0;AF_OTH=0;AF_SAS=0;AF_Male=0;AF_Female=0;GC_AFR=2545,0,0;GC_AMR=6461,0,0;GC_ASJ=2132,0,0;GC_EAS=3336,0,0;GC_FIN=5877,0,0;GC_NFE=24450,0,0;GC_OTH=1220,0,0;GC_SAS=6687,0,0;GC_Male=28861,0,0;GC_Female=23847,0,0;AC_raw=147;AN_raw=160104;AF_raw=0.000918153;GC_raw=79961,35,56;GC=52708,0,0;Hom_AFR=0;Hom_AMR=0;Hom_ASJ=0;Hom_EAS=0;Hom_FIN=0;Hom_NFE=0;Hom_OTH=0;Hom_SAS=0;Hom_Male=0;Hom_Female=0;Hom_raw=56;Hom=0;POPMAX=.;AC_POPMAX=.;AN_POPMAX=.;AF_POPMAX=.;DP_MEDIAN=3;DREF_MEDIAN=6.34858e-07;GQ_MEDIAN=9;AB_MEDIAN=0.5;AS_RF=0.0766183;AS_FilterStatus=RF|AC0
chr10	50095648	rs201672329	A	G	4.59591e+06	PASS	AC=6331;AF=0.0401138;AN=157826;BaseQRankSum=-0.494;ClippingRankSum=0.024;DB;DP=4164344;FS=0.536;InbreedingCoeff=0.3791;MQ=26.03;MQRankSum=0;QD=14.26;ReadPosRankSum=0.267;SOR=0.634;VQSLOD=1.79;VQSR_culprit=MQ;GQ_HIST_ALT=41|165|136|275|231|113|194|163|96|131|118|105|137|122|86|130|146|82|106|3409;DP_HIST_ALT=236|828|885|775|583|415|320|232|208|205|181|197|162|143|100|73|66|46|30|24;AB_HIST_ALT=0|0|10|24|70|142|183|337|613|589|785|410|253|105|36|45|44|47|0|0;GQ_HIST_ALL=3343|6713|4663|10424|9906|4535|7763|6741|2764|4784|3731|1595|4284|781|2125|1264|2833|378|2571|25713;DP_HIST_ALL=7947|16027|16329|13266|10848|8104|6717|6691|5335|4890|3734|2685|1589|919|501|309|191|122|86|62;AB_HIST_ALL=0|0|10|24|70|142|183|337|613|589|784|410|254|105|36|45|44|47|0|0;AC_AFR=894;AC_AMR=537;AC_ASJ=312;AC_EAS=304;AC_FIN=786;AC_NFE=2850;AC_OTH=133;AC_SAS=515;AC_Male=3397;AC_Female=2934;AN_AFR=9284;AN_AMR=23204;AN_ASJ=7354;AN_EAS=12128;AN_FIN=15082;AN_NFE=66842;AN_OTH=3646;AN_SAS=20286;AN_Male=85740;AN_Female=72086;AF_AFR=0.0962947;AF_AMR=0.0231426;AF_ASJ=0.0424259;AF_EAS=0.025066;AF_FIN=0.0521151;AF_NFE=0.0426379;AF_OTH=0.0364783;AF_SAS=0.025387;AF_Male=0.0396198;AF_Female=0.0407014;GC_AFR=3952,486,204;GC_AMR=11178,311,113;GC_ASJ=3436,170,71;GC_EAS=5818,188,58;GC_FIN=6979,338,224;GC_NFE=31239,1514,668;GC_OTH=1716,81,26;GC_SAS=9757,257,129;GC_Male=40287,1769,814;GC_Female=33788,1576,679;AC_raw=8279;AN_raw=213822;AF_raw=0.0387191;GC_raw=100925,3693,2293;GC=74075,3345,1493;Hom_AFR=204;Hom_AMR=113;Hom_ASJ=71;Hom_EAS=58;Hom_FIN=224;Hom_NFE=668;Hom_OTH=26;Hom_SAS=129;Hom_Male=814;Hom_Female=679;Hom_raw=2293;Hom=1493;POPMAX=AFR;AC_POPMAX=894;AN_POPMAX=9284;AF_POPMAX=0.0962947;DP_MEDIAN=22;DREF_MEDIAN=6.30957e-33;GQ_MEDIAN=99;AB_MEDIAN=0.481481;AS_RF=0.865029;AS_FilterStatus=PASS
  1. Hopefully, Ensembl will soon sync their vcf with gnomad2.1 and we will not need processing the vcf in cloudbiolinux/bcbio installation anymore. In the meanwhile, we need that, at least for users who are running bcbio for the first time. They just need a smooth pipeline run, i.e. to call, annotate and prioritize variants, not going into details about little issues of gnomad 2.0.1 reprocessed by ensembl.

  2. One of the duplicated variants has PASS in the filter tag and the other has not. That gave me an idea to prioritize PASS variants rather than first variants when selecting unique ones.

  3. The processing: chr name conversion - sort - PASS - decompose - normalize - uniq works:

bcftools stats for chr10
# SN	[2]id	[3]key	[4]value
SN	0	number of samples:	0
SN	0	number of records:	538598
SN	0	number of no-ALTs:	537
SN	0	number of SNPs:	503850
SN	0	number of MNPs:	59
SN	0	number of indels:	33917
SN	0	number of others:	235
SN	0	number of multiallelic sites:	0
SN	0	number of multiallelic SNP sites:	0

Duplicated variant in chr10 is gone:

tabix gnomad.exomes.r2.0.1.sites.GRCh38.noVEP.sorted.chr10.processed.vcf.gz chr10:50095648-50095648
chr10	50095648	rs201672329	A	G	4.59591e+06	PASS	AC=6331;AF=0.0401138;AN=157826;BaseQRankSum=-0.494;ClippingRankSum=0.024;DB;DP=4164344;FS=0.536;InbreedingCoeff=0.3791;MQ=26.03;MQRankSum=0;QD=14.26;ReadPosRankSum=0.267;SOR=0.634;VQSLOD=1.79;VQSR_culprit=MQ;GQ_HIST_ALT=41|165|136|275|231|113|194|163|96|131|118|105|137|122|86|130|146|82|106|3409;DP_HIST_ALT=236|828|885|775|583|415|320|232|208|205|181|197|162|143|100|73|66|46|30|24;AB_HIST_ALT=0|0|10|24|70|142|183|337|613|589|785|410|253|105|36|45|44|47|0|0;GQ_HIST_ALL=3343|6713|4663|10424|9906|4535|7763|6741|2764|4784|3731|1595|4284|781|2125|1264|2833|378|2571|25713;DP_HIST_ALL=7947|16027|16329|13266|10848|8104|6717|6691|5335|4890|3734|2685|1589|919|501|309|191|122|86|62;AB_HIST_ALL=0|0|10|24|70|142|183|337|613|589|784|410|254|105|36|45|44|47|0|0;AC_AFR=894;AC_AMR=537;AC_ASJ=312;AC_EAS=304;AC_FIN=786;AC_NFE=2850;AC_OTH=133;AC_SAS=515;AC_Male=3397;AC_Female=2934;AN_AFR=9284;AN_AMR=23204;AN_ASJ=7354;AN_EAS=12128;AN_FIN=15082;AN_NFE=66842;AN_OTH=3646;AN_SAS=20286;AN_Male=85740;AN_Female=72086;AF_AFR=0.0962947;AF_AMR=0.0231426;AF_ASJ=0.0424259;AF_EAS=0.025066;AF_FIN=0.0521151;AF_NFE=0.0426379;AF_OTH=0.0364783;AF_SAS=0.025387;AF_Male=0.0396198;AF_Female=0.0407014;GC_AFR=3952,486,204;GC_AMR=11178,311,113;GC_ASJ=3436,170,71;GC_EAS=5818,188,58;GC_FIN=6979,338,224;GC_NFE=31239,1514,668;GC_OTH=1716,81,26;GC_SAS=9757,257,129;GC_Male=40287,1769,814;GC_Female=33788,1576,679;AC_raw=8279;AN_raw=213822;AF_raw=0.0387191;GC_raw=100925,3693,2293;GC=74075,3345,1493;Hom_AFR=204;Hom_AMR=113;Hom_ASJ=71;Hom_EAS=58;Hom_FIN=224;Hom_NFE=668;Hom_OTH=26;Hom_SAS=129;Hom_Male=814;Hom_Female=679;Hom_raw=2293;Hom=1493;POPMAX=AFR;AC_POPMAX=894;AN_POPMAX=9284;AF_POPMAX=0.0962947;DP_MEDIAN=22;DREF_MEDIAN=6.30957e-33;GQ_MEDIAN=99;AB_MEDIAN=0.481481;AS_RF=0.865029;AS_FilterStatus=PASS
  1. I’ve updated the GGD for gnomad_exome in hg38 according to this test. Please let me know if you have any other suggestions. https://github.com/chapmanb/cloudbiolinux/pull/287

Sergey