tools-iuc: mothur.cluster.split hangs compute nodes due to an uncaught error
I found several compute nodes stuck with these kinds of message in the syslog:
[Wed May 12 10:23:20 2021] mothur[890441]: segfault at 2047c70116 ip 0000000000806cf8 sp 00007ffed82d5870 error 4 in mothur[400000+eab000]
[Wed May 12 10:23:20 2021] mothur[890456]: segfault at 168 ip 0000000000806cf8 sp 00007ffed82d5870 error 4 in mothur[400000+eab000]
[Wed May 12 10:23:20 2021] Code: 66 90 48 85 db 74 73 48 8d 53 08 c7 43 08 00 00 00 00 48 c7 43 10 00 00 00 00 48 c7 43 28 00 00 00 00 48 89 53 18 48 89 53 20 <48> 8b 75 10 48 85 f6 74 47 48 89 df e8 c7 20 c3 ff
48 89 43 10 48
[Wed May 12 10:23:20 2021] Code: 66 90 48 85 db 74 73 48 8d 53 08 c7 43 08 00 00 00 00 48 c7 43 10 00 00 00 00 48 c7 43 28 00 00 00 00 48 89 53 18 48 89 53 20 <48> 8b 75 10 48 85 f6 74 47 48 89 df e8 c7 20 c3 ff
48 89 43 10 48
[Wed May 12 10:23:20 2021] SLUB: Unable to allocate memory on node -1, gfp=0x6000c0(GFP_KERNEL)
[Wed May 12 10:23:20 2021] cache: task_struct(18113:condor_var_lib_condor_execute_slot1_5@vgcnbwc-worker-c125m425-2069.novalocal), object size: 6080, buffer size: 6080, default order: 3, min order: 1
[Wed May 12 10:23:20 2021] node 0: slabs: 85995, objs: 429743, free: 0
In all nodes was running mothur.cluster.split 1.39.5 and this is its sdtout:
mothur v.1.39.5
Last updated: 3/20/2017
by
Patrick D. Schloss
Department of Microbiology & Immunology
University of Michigan
http://www.mothur.org
When using, please cite:
Schloss, P.D., et al., Introducing mothur: Open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl Environ Microbiol, 2009. 75(23):7537-41.
Distributed under the GNU General Public License
Type 'help()' for information on the commands that are available
For questions and analysis support, please visit our forum at https://www.mothur.org/forum
Type 'quit()' to exit program
se,cluster=true,runsensspec=true,processors=8),iters=100,precision=100,large=fal
Using 8 processors.
Splitting the file...
/******************************************/
Running command: dist.seqs(fasta=splitby.fasta.dat.0.temp, processors=8, cutoff=0.03)
Using 8 processors.
/******************************************/
[ERROR]: your sequences are not the same length, aborting.
It took 1 seconds to split the distance file.
[ERROR]: std::bad_allocRAM used: 0.00388718Gigabytes . Total Ram: 211.266Gigabytes.
has occurred in the ClusterSplitCommand class function createProcesses. This error indicates your computer is running out of memory. This is most commonly caused by trying to process a dataset too large, using
multiple processors, or a file format issue. If you are running our 32bit version, your memory usage is limited to 4G. If you have more than 4G of RAM and are running a 64bit OS, using our 64bit version may re
solve your issue. If you are using multiple processors, try running the command with processors=1, the more processors you use the more memory is required. Also, you may be able to reduce the size of your datas
et by using the commands outlined in the Schloss SOP, http://www.mothur.org/wiki/Schloss_SOP. If you are unable to resolve the issue, please contact Pat Schloss at mothur.bugs@gmail.com, and be sure to include t
he mothur.logFile with your inquiry.[ERROR]: std::bad_allocRAM used: 0.00382233Gigabytes . Total Ram: 211.266Gigabytes.
and this is an extract from the fasta file:
>HE855366.1.<1.>570
TCTGTGCTTATTCGTATGGAATTAGCTGGTCCGGGAGTTCAAGTTTTAGGTGGAAATCAT
CAATTATATAATGTTATAGTTACAGGTCACGCTTTTATAATGATTTTTTTTATGGTTATG
CCTGTTCTAATGGGTGGTTATGGTAATTGGTTTGTTCCTATTATGATAGGAGCTCCTGAT
ATGGCTTTTCCTAGAATGAATAATATAAGTTTTTGGTTATTACCACCTTCTTTAATTTTG
TTATTGAGTTCTACATTGGTAGAAATAGGTGTTGGTACTGGTTGGACCGTGTATCCTCCG
TTAAGTAGTATCTCTGGACATCCTGGAGGCGCAGTTGATTTAGGTATATTTAGTTTGCAT
GTAGCAGGTGCTTCCAGTATCTTAGGCGCTATTAATTTTATAACAACTATTTTTAATATG
AGAGTGCCAGGCATGACAATGCACAGAATACCTCTCTTTGTTTGGGCTGTTTTAATAACT
GCTTTTTTACTTCTTTTATCATTACCTGTTTTTGCTGGTGCTATTACTATGTTATTAACT
GATAGAAATTTTAATACAAGTTTTTTTGAC
>AB009419.1.<1.>1059
GGTACATTATATATACTTTTTGGAATAATATCAGGTATTATAGGTACAACTTTGTCTGTT
CTTATAAGGATGGAATTAGCAGGCCCAGGTGTCCAAGTTTTAGGAGGTAATCACCAATTA
TATAACGTTATTATTACTGGTCATGCTTTTATAATGATATTTTTCATGGTAATGCCAGTA
TTGATTGGAGGTTACGGTAACTGGTTTATTCCTATTATGATAGGTGCACCTGATATGAGT
TTCCCAAGAATGAATAATATAAGCTTTTGGTTACTACCACCATCTTTAATTTTGTTATTA
AGCTCTACTCTTGTTGAAGTTGGTGTTGGCACTGGCTGGACAGTATACCCTCCCTTAAGC
TCTATCTCTGGTCATCCCGGTGCAGCAGTTGATTTAGGTATTTTCAGTCTTCATATTGCA
GGTGCATCTAGTATTTTAGGTTCTATTAATTTTATAACTACTATTTTTAATATGAGAGGT
CCTGGAATGACTATGCATAGAATACCTTTGTTTGTTTGGGCTGTTCTTATAACTGCTTTT
TTACTTGTGCTGTCACTTCCTGTATTTGCTGGGGCAATCACTATGCTTCTAACCGATCGT
AATTTTAATACAAGCTTTTTTGAAGCAGCAGGTGGTGGAGACCCTGTTTTATACCAACAT
TTGTTTTGGTTTTTTGGTCATCCGGAAGTTTATATCTTGATATTACCTGCTTTCGGAATT
ATTAGTCATATAACTTCTACTTTTTCTAGAAAACCAGTTTTTGGTTTTATAGGTATGGTA
TATGCCATGTTGAGTATAGGTCTATTAGGTTTTATTGTTTGGGCACATCATATGTATACT
GTCGGTATGGATATTGACACTAGAGCTTACTTTACAGCAGCTACTATGATTATAGCTGTC
CCAACTGGAATAAAAATATTTAGTTGGATAGCGACTATGTGGGGAGGCTCTATTTATCTA
AAAACCCCAATGGTTTTTGCTTTAGGCTTTATATGCTTGTTTACAATTGGAGGGTTATCT
GGTATTATGTTATCGAATGGTGCTTTGGATATAGCCTTT
>EU651892.1.933.2531
ATGTCAAATTTTTTAAATCGCTGGATTTTTTCGACAAATCATAAAGATATTGGTACATTA
TATCTAATTTTTGCAATTTTTGCGGGAGTTGTAGGTACTTTTTTATCGGTTTTAATTCGA
TTGGAATTAGCTGGGCCTGGCGTTCAAATATTAGGGGGTAACCACCAATTATATAACGTA
ATTATTACAGCTCATGCCTTTGTGATGATTTTTTTTATGGTGATGCCTGCACTAATTGGA
GGTTTCGCAAACTGGTTTGTTCCTATTATGATAGGTGCTCCAGATATGGCTTTTCCTCGT
TTAAATAACATTTCTTTCTGGTTATTAATACCTGCCTTCGTTTTATTATTAAGTTCATCA
TTCGTAGAAACTGGTGCGGGTACTGGCTGGACAGTGTACCCACCGTTAAGTAGTATAAGT
GGGCACCCTGGTGGATCTGTGGATTTAGCTATATTTAGCCTTCACGTTGCAGGGGCCTCA
AGTATTTTAGGTGCTTGTAATTTTATTACAACAATTCTTAATATGCGAGCACCAGGGATG
ACATTACACCGATTGCCACTTTTTTGCTGGGCAGTATTAATTACTGCGGTTTTATTAGTA
CTATCACTACCAGTATTTGCAGGGGCGATAACGATGTTGCTTACAGATAGAAATTTTAAT
ACGGCATTTTTTGATGCTAGTCTTGGCGGTGACCCAGTTCTTTATGAACATCTTTTTTGG
TTCTTTGGGCATCCTGAAGTTTATATATTAATTTTACCTGGATTTGGTATTATCAGTCAC
ATCGTATCTACTTTTTCAAGAAAACCTGTTTTCGGTGTAATTGGTATGATTTATGCAATG
GTTAGTATTGGTGTTCTCGGCTTTATAGTGTGGGCGCACCATATGTACACCGTTGGAATG
GACGTAACAACAAGAGCTTATTTCACAGCAACAACAATGGTAATTGCAGTACCTACCGGT
ATCAAGATTTTCTCGTGGATTGCTACAATGTGGGGGGGTTCAATTCATTTGAAAACACCA
ATGGTTTTTGCTATTGGTTTCATTTTCTTATTCACAATTGGTGGATTAACAGGAGTGGTT
CTTTCAAATGGTGGTTTAGATTTAGCGTTCCATGACAGTTATTACGTTGTGGCACATTTT
CACTATGTTCTCTCTATGGGAGCAGTATTCTCAATGTTTGCTGGTTACTATTATTGGATT
GGAAAAATGTCAGGATTTAATTATCCAGAAAATCTCGGAATTATTCACTTTTGGTGCACT
TTTGTAGGGGTTAATTGTACTTTCTTTCCACAGCACTTTTTAGGTTTAGCAGGGATGCCA
AGAAGAATACCAGATTATCCTGATGCATATGCAGGTTGGAATTATATTTCATCTTTTGGT
AGTTCAATTTCGGTTTTTGCAATTCTTTTATTTTTTGTTTTGACGTACGAAACATTTACA
AATATGGACAAATGTCCGGTAAATCCTTGGTCGTTTGCAACATCAAGTGCTGATCCAAAA
TTTGAATTTACGCTTGAATGGGTTGTAGGTTCACCACCATCTTTTCATACTTTTGAGGAA
TTACCTATTATCAAAGATACTGATATAGTAAATGTGTAG
From the point of view of Galaxy and of Condor, the job is still running, but instead it’s badly crashed
About this issue
- Original URL
- State: open
- Created 3 years ago
- Comments: 15 (15 by maintainers)
Most upvoted comments
+1
bgruening on Nov 15, 2022