vamb: ValueError: Length of TNFs and length of RPKM does not match. Verify the inputs

(vamb_env) -bash-4.1$ vamb --fasta mage_output/M-1507-133.A/intermediate/assembly_output/scaffolds.fasta --jgi coverage_output/coverage_metabat2.tsv  --outdir vamb_output
Traceback (most recent call last):
  File "/usr/local/devel/ANNOTATION/jespinoz/anaconda3/envs/vamb_env/bin/vamb", line 11, in <module>
    sys.exit(main())
  File "/usr/local/devel/ANNOTATION/jespinoz/anaconda3/envs/vamb_env/lib/python3.6/site-packages/vamb/__main__.py", line 528, in main
    logfile=logfile)
  File "/usr/local/devel/ANNOTATION/jespinoz/anaconda3/envs/vamb_env/lib/python3.6/site-packages/vamb/__main__.py", line 247, in run
    len(tnfs), minalignscore, minid, subprocesses, logfile)
  File "/usr/local/devel/ANNOTATION/jespinoz/anaconda3/envs/vamb_env/lib/python3.6/site-packages/vamb/__main__.py", line 121, in calc_rpkm
    raise ValueError("Length of TNFs and length of RPKM does not match. Verify the inputs")
ValueError: Length of TNFs and length of RPKM does not match. Verify the inputs

Here’s what the output of jgi_summarize_bam_contig_depths looks like:

(vamb_env) -bash-4.1$ head coverage_output/coverage_metabat2.tsv
contigName	contigLen	totalAvgDepth	sorted.bam	sorted.bam-var
NODE_1_length_581408_cov_10.671907	581408	17.0497	17.0497	24.4313
NODE_2_length_212490_cov_11.140151	212490	17.7493	17.7493	26.3056
NODE_3_length_56611_cov_10.039571	56611	16.0747	16.0747	24.7309
NODE_4_length_52215_cov_10.245380	52215	16.4059	16.4059	20.9325
NODE_5_length_49788_cov_11.464963	49788	18.376	18.376	28.3959
NODE_6_length_44487_cov_9.390124	44487	15.069	15.069	20.5564
NODE_7_length_41442_cov_10.399425	41442	16.6383	16.6384	22.4833
NODE_8_length_37801_cov_9.536534	37801	15.3226	15.3226	25.3435
NODE_9_length_28654_cov_10.767824	28654	17.234	17.234	22.4427

It’s the right number of rows too (n-1 for the headers)

(vamb_env) -bash-4.1$ grep -c "^>" mage_output/M-1507-133.A/intermediate/assembly_output/scaffolds.fasta
25728
(vamb_env) -bash-4.1$ wc -l coverage_output/coverage_metabat2.tsv
25729 coverage_output/coverage_metabat2.tsv

Here’s the version:

(mage_env) -bash-4.1$ conda list | grep "vamb"
vamb                      3.0.2            py36hc5360cc_1    bioconda

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 15 (3 by maintainers)

Most upvoted comments

Dear @HaraldBrolin

The error comes because each contig needs both a TNF (which is obtained from the FASTA file), and an RPKM (which is obtained from the JGI input file). To fix the problem, you need to remove the sequences in the FASTA file for which you don’t have entries in the JGI depths file.

The JGI file does not seem to be correctly formatted, either. It should look like this.

Sort of. It’s not stored, but the final bin name is named e.g. sample1_1 - depending on the names of your contigs - e.g, given the name of an output bin, you can always get the sample and the original bin.

Thanks for using Vamb.

Multi-split is really dirt simple. After assembling individual samples, they are binned together. We then simply split each bin by sample - literally we just take all the contigs in bin 1 from sample 1 and put it in bin_1_1, contigs from bin 1 in sample 2 in bin_1_2, etc. So there is no reduction of redundancy, you get the same genomes duplicated if they are present in multiple samples.