CAMISIM: NanoSim - KeyError: sequence_id not found in mapping

Hi,

Thanks for developing CAMISIM! I am currently trying to simulate data with Illumina and Nanopore reads using the de novo community design. I am using the CAMISIM master branch. With the provided test data (CAMISIM/defaults/genomes/) and the provided mapping files I got it running using art and nanosim(from the https://github.com/abremgesfork).

Then I tried to use the 2nd CAMI Toy Mouse Gut Dataset genomes/, metadata.tsv and genome_to_id.tsv data as a basis to generate new data. For Illumina data this worked smoothly. However, for Nanopore data I get the following errors after simulating the reads and in the final anonymization step:

...
2021-07-09 16:17:40 DEBUG: [GenomePreparation 89018136530] 270448.0     22
2021-07-09 16:17:40 DEBUG: [GenomePreparation 89018136530] SysCmd: '/home-link/qeakr01/development/NanoSim/src/simulator.py linear -n 22 -r /sfs/7/workspace/ws/qeakr01-camisim-0/toy_mouse_gut_time_series/output_nanopore_test2/source_genomes/GCF_000403395.2_Anae_bact_G3_V1.fa -o /sfs/7/workspace/ws/qeakr01-camisim-0/toy_mouse_gut_time_series/tmp/tmp4bpxOM/2021.07.09_15.59.47_sample_0/reads/270448.0 -c tools/nanosim_profile/ecoli --seed 2998104995'
2021-07-09 16:17:40 INFO: [GenomePreparation 89018136530] Simulating reads from 270448.0: '/sfs/7/workspace/ws/qeakr01-camisim-0/toy_mouse_gut_time_series/output_nanopore_test2/source_genomes/GCF_000403395.2_Anae_bact_G3_V1.fa'
2021-07-09 16:31:15 INFO: [GenomePreparation 89018136530] Simulating reads finished
[W::sam_parse1] urecognized reference name; treated as unmapped
[W::sam_parse1] urecognized reference name; treated as unmapped
[W::sam_parse1] urecognized reference name; treated as unmapped
...

and

...
2021-07-09 16:44:30 INFO: [MetagenomeSimulationPipeline] Anonymize Data
2021-07-09 16:44:30 DEBUG: [MetagenomeSimulationPipeline] /sfs/7/workspace/ws/qeakr01-camisim-0/toy_mouse_gut_time_series/tmp/tmp4bpxOM/tmpgY5Fcq
2021-07-09 16:44:30 INFO: [FastaAnonymizer] Shuffle and anonymize '/sfs/7/workspace/ws/qeakr01-camisim-0/toy_mouse_gut_time_series/tmp/tmp4bpxOM/2021.07.09_15.59.47_sample_0/reads'
2021-07-09 16:44:30 DEBUG: [FastaAnonymizer] get_seeded_random() { seed="$1"; openssl enc -aes-256-ctr -pass pass:"$seed" -nosalt < /dev/zero 2>/dev/null; }; python '/nfsmounts/home/qeakr01/development/CAMISIM/fastastreamer.py' -input '/sfs/7/workspace/ws/qeakr01-camisim-0/toy_mouse_gut_time_series/tmp/tmp4bpxOM/2021.07.09_15.59.47_sample_0/reads' -format 'fastq' -ext 'fq' -s | shuf -z --random-source=<(get_seeded_random 2944938622045856594) | tr -d '\000' | python '/nfsmounts/home/qeakr01/development/CAMISIM/anonymizer.py' -prefix 'S0R' -format 'fastq' -map '/sfs/7/workspace/ws/qeakr01-camisim-0/toy_mouse_gut_time_series/tmp/tmp4bpxOM/tmpoArF7B' -out '/sfs/7/workspace/ws/qeakr01-camisim-0/toy_mouse_gut_time_series/tmp/tmp4bpxOM/tmpLJDFBS' -s
2021-07-09 16:48:06 INFO: [MetadataReader 1434768039] Reading file: '/sfs/7/workspace/ws/qeakr01-camisim-0/toy_mouse_gut_time_series/output_nanopore_test2/internal/genome_locations.tsv'
2021-07-09 16:48:08 INFO: [MetadataReader 31538633047] Reading file: '/sfs/7/workspace/ws/qeakr01-camisim-0/toy_mouse_gut_time_series/output_nanopore_test2/internal/meta_data.tsv'
2021-07-09 16:48:08 INFO: [MetadataReader 14979527976] Reading file: '/sfs/7/workspace/ws/qeakr01-camisim-0/toy_mouse_gut_time_series/tmp/tmp4bpxOM/tmpoArF7B'
2021-07-09 16:48:08 ERROR: [Validator 31115876351] sequence_id 'NZ-JH590862.1' not found in mapping

2021-07-09 16:48:08 DEBUG: [MetagenomeSimulationPipeline]
Traceback (most recent call last):
  File "/home-link/qeakr01/development/CAMISIM/metagenomesimulation.py", line 117, in run_pipeline
    self._anonymize_data(list_of_output_gsa, file_path_output_gsa_pooled)
  File "/home-link/qeakr01/development/CAMISIM/metagenomesimulation.py", line 639, in _anonymize_data
    file_path_genome_locations, file_path_metadata, file_path_anonymous_mapping_tmp, stream_output
  File "/nfsmounts/home/qeakr01/development/CAMISIM/scripts/GoldStandardFileFormat/goldstandardfileformat.py", line 370, in gs_read_mapping
    stream_output, dict_anonymous_to_read_id, dict_sequence_to_genome_id, dict_genome_id_to_tax_id)
  File "/nfsmounts/home/qeakr01/development/CAMISIM/scripts/GoldStandardFileFormat/goldstandardfileformat.py", line 244, in write_gs_read_mapping
    raise KeyError(msg)
KeyError: "sequence_id 'NZ-JH590862.1' not found in mapping\n"


2021-07-09 16:48:08 ERROR: [MetagenomeSimulationPipeline] "sequence_id 'NZ-JH590862.1' not found in mapping\n" in line 117
2021-07-09 16:48:08 INFO: [MetagenomeSimulationPipeline] Metagenome simulation aborted
2021-07-09 16:48:08 INFO: [MetagenomeSimulationPipeline] Temporary data stored at:
/sfs/7/workspace/ws/qeakr01-camisim-0/toy_mouse_gut_time_series/tmp/tmp4bpxOM

Do you have any idea what could cause this issue or how I could proceed to fix this?

sim_nanosim.test2.log sim_config.nanosim.test2.ini.txt

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 17

Commits related to this issue

Most upvoted comments

Ah, yeah the size is a problem. Since NanoSim requires the number of reads as input and CAMISIM the dataset size, there has to be a conversion from size -> number of reads. But the number of reads needed for a certain size depends on the average read length - which is specific to the trained models. I updated the used model but did not update the average read size. The fact that this happens points towards the fact that the calculation should be automatic depending on the chosen model.

Also thank you for the log (and information about the non-anonymous gold standards). I hope to find the problems soon - but will be on vacation until 16th of August starting this Friday

Even though I think that if 2.5.0 finished without errors your results probably are usable, I would use the latest NanoSim 3.0 if it works. The model used in 1.2.0 is very old so it probably does not reflect recent chemistry well.