chromap: Seemingly empty files trigger segmentation fault error

Using the pre-compiled version I get the following error:

./chromap --preset hic -x test.index -r ref.fa -1 r1.fq -2 r2.fq -o out.sam --SAM
Preset parameters for Hi-C are used.
Start to map reads.
Parameters: error threshold: 4, min-num-seeds: 2, max-seed-frequency: 500,1000, max-num-best-mappings: 1, max-insert-size: 1000, MAPQ-threshold: 1, min-read-length: 30, bc-error-threshold: 1, bc-probability-threshold: 0.90
Number of threads: 1
Analyze bulk data.
Won't try to remove adapters on 3'.
Won't remove PCR duplicates after mapping.
Will remove PCR duplicates at bulk level.
Won't allocate multi-mappings after mapping.
Only output unique mappings after mapping.
Only output mappings of which barcodes are in whitelist.
Allow split alignment.
Output mappings in SAM format.
Reference file: ref.fa
Index file: test.index
1th read 1 file: r1.fq
1th read 2 file: r2.fq
Output file: out.sam
Loaded all sequences successfully in 4.05s, number of sequences: 1924, number of bases: 1679214157.
Kmer size: 17, window size: 7.
Lookup table size: 177152077, occurrence table size: 309193718.
Loaded index successfully in 7.85s.
Mapped all reads in 0.53s.
Number of reads: 0.
Number of mapped reads: 0.
Number of uniquely mapped reads: 0.
Number of reads have multi-mappings: 0.
Number of candidates: 0.
Number of mappings: 0.
Number of uni-mappings: 0.
Number of multi-mappings: 0.
Segmentation fault

The fastq files are not empty as the error message would suggest, nonetheless they are non-canonical HiC but 4C, actually, and they are derived from a rather unbalanced read length between mates, r1=28 vs r2=120. Could this be violating some assumptions of the software?

head r1.fq
@NS500648:585:HK55NAFX2:1:11101:9001:1025 1:N:0:ACAAGGTA
GCCACNACCTGTTCCTGTGTATCTGCTA
+
AAAAA#EEEEEEEEEEEEEEEEEEEEEE
@NS500648:585:HK55NAFX2:1:11101:13450:1033 1:N:0:ACAAGGTA
GCCACNACCTGTTCCTGTTGAGGATTTG
+
AAAAA#EEEEEEEEEEEEEEEEEEEEEE
@NS500648:585:HK55NAFX2:1:11101:12961:1039 1:N:0:CGTCCCGT
GCCACNACCTGTTCCTGTTTTCTCTCTG
(metacell) [16:29:53 polivar@aquarius:~/src/chromap_l/chromap-80f75ed_x64-linux] (34512)$
head r2.fq
@NS500648:585:HK55NAFX2:1:11101:9001:1025 2:N:0:ACAAGGTA
NNNNNGCATGCAAAGTCAAAAAACACTNTCATGGTCTTATAATCTGCATTTATTTTTACCTAATNNNCCNNNNGACTCNNNTANNNNTCNTNCANTGATTCNNNTGNTTCCAANNCCNNN
+
#####EEEEEEEEEEEEEEEEEEEEEE#EEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEE###EE####E<EEE###A/####EE#E#AE#EEE/AA###AA#AAAAA/##<E###
@NS500648:585:HK55NAFX2:1:11101:13450:1033 2:N:0:ACAAGGTA
NNNTGTTATTGTTACATTTTTCTTACAGCCCTTTAACAATCTCATTTCATTTTCATCCGCTTTCGGGGCTGGGTCATGAAGGCCGCAGTCTTAGGTAAGAACTCCACATACATAGTCCCG
+
###AAAEEEEEEE/EEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEE<EEAEEEAEEEEEEE/EEEAAEEEEAEEEEEEEEEEAEEEAEE<AEE<AAA<AAEA6<6AA<EEEAEEEE/
@NS500648:585:HK55NAFX2:1:11101:12961:1039 2:N:0:CGTCCCGT
NNNCAATGGAAGTAAGGGTCCAAACGCAGTTTATTAACAGCTAAGCCAAGCAGGCAAGGGTGAAACTTTGGCAAACGGAATTATGTGGGTAATCCAATAGCACAGTTGTAGTCACAGACT

sed -n '2~4p' r1.fq | wc -l
664948
sed -n '2~4p' r2.fq | wc -l
664948

Edit: The software runs well using the provided fastqs and test ref.

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 15 (4 by maintainers)

Most upvoted comments

Thanks for the suggestion to change --min-num-seeds to 1. Unfortunately it also Segfaults. In appreciation of your time perhaps it is better to stop here for now. I will re-sequence the same libraries with longer read1 to see whether the length is the issue or rather the fact that it is not really HiC.

These data being 4C, read1 is always the same as a “one-vs-all” approach in which one profiles the contacts for a given viewpoint in the genome. This enrichment is mediated by PCR of the locus of interest (R1) and R2 provides the corresponding contact from the proximity ligation step.

I will update you when I get the new data.

I played a bit with other arguments. Here a report:

chromap --preset hic -x test.index -r ref.fa -1 r1.fq -2 r2.fq -o out.sam --min-read-length 28 --split-alignment --MAPQ-threshold 0 -e 100 --max-num-best-mappings 10 --min-num-seeds 1
Preset parameters for Hi-C are used.
Start to map reads.
Parameters: error threshold: 100, min-num-seeds: 1, max-seed-frequency: 500,1000, max-num-best-mappings: 10, max-insert-size: 1000, MAPQ-threshold: 0, min-read-length: 28, bc-error-threshold: 1, bc-probability-threshold: 0.90
Number of threads: 1
Analyze bulk data.
Won't try to remove adapters on 3'.
Won't remove PCR duplicates after mapping.
Will remove PCR duplicates at bulk level.
Won't allocate multi-mappings after mapping.
Only output unique mappings after mapping.
Only output mappings of which barcodes are in whitelist.
Allow split alignment.
Output mappings in pairs format.
Reference file: ref.fa
Index file: test.index
1th read 1 file: r1.fq
1th read 2 file: r2.fq
Output file: out.sam
Loaded all sequences successfully in 3.33s, number of sequences: 1924, number of bases: 1679214157.
Kmer size: 17, window size: 7.
Lookup table size: 177152077, occurrence table size: 309193718.
Loaded index successfully in 8.21s.
Mapped 500000 read pairs in 232.65s.
Mapped 164948 read pairs in 71.12s.
Mapped all reads in 304.79s.
Number of reads: 1329896.
Number of mapped reads: 0.
Number of uniquely mapped reads: 0.
Number of reads have multi-mappings: 0.
Number of candidates: 118718663.
Number of mappings: 0.
Number of uni-mappings: 0.
Number of multi-mappings: 0.
Segmentation fault

This is still using the conda build h9a82719_0 since I don’t think the version without the segfault bug has been release there.

tzeitim on Jun 22, 2021