toil: Job never stop with javascript input
Hello there, we are facing an issue with a workflow that is using this type of CWL file:
the problem appear only when i’m using “InlineJavascriptRequirement” with this kind of javascript expressions :
my_step:
run: ./my_sub_cwl.cwl
in:
# Javascript sub selection
array_input:
source: array_input
valueFrom: |
${
for (var i = 0 ; i < self.length; i++) {
if ( self[i].basename.includes("<here I'm putting any string in order to filter>")) {
return self[i]
}
}
return self[0]
}
out: [ out1, out2]
I also tried another notation, but the exact same issue appears:
valueFrom: $(inputs.array_input.filter(f => f.basename.includes("<here I'm putting any string in order to filter>"))[0])
The actual task is finishing correctly (with a return_code = 0 ), the input File generated by the Javascript expression is properly set and the file is retrieved.
BUT
The job never “finished” properly and loop over this message infinitely :
[2021-12-16T14:48:28+0000] [Thread-2 ] [W] [toil.batchSystems.singleMachine] Sent redundant job completion kill to surviving process group 615 known to batch system 140180515079744
Important note :
- We were not facing this issue with toil v5.3.0 but only after updating to v5.5.0 (we didn’t tried with 5.4.0)
- I’m not facing the issue if I use the following expression :
valueFrom: $(self[0])
I did not found any similar issue for now, and don’t know what to try in order to fix it.
Any help would be really helpful.
Thanks a lot in advance 😃 , Etienne
┆Issue is synchronized with this Jira Task ┆friendlyId: TOIL-1110
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 16 (14 by maintainers)
I did some testing and I don’t think the signal handler in
Toil.__enter__would override the handlers in the_toil_workerprocess. (I also started https://github.com/DataBiosphere/toil/compare/issues/3965-user-provided-exit-handler but I don’t think it’ll help much here)However, the problem is that a sigterm signal on the leader process might not propagate to its workers, so I don’t know the best way to handle interrupts like this on the worker. I can look more into this after winter break.
We use
processes_to_killhttps://github.com/common-workflow-language/cwltool/blob/041cc0eb8f0272846e5b3e685fe51367f3ef93a6/cwltool/sandboxjs.py#L18
https://github.com/common-workflow-language/cwltool/blob/041cc0eb8f0272846e5b3e685fe51367f3ef93a6/cwltool/utils.py#L58
Where each nodeja subprocess is appended to this, both non-containerized
https://github.com/common-workflow-language/cwltool/blob/041cc0eb8f0272846e5b3e685fe51367f3ef93a6/cwltool/sandboxjs.py#L76
Or containerized
https://github.com/common-workflow-language/cwltool/blob/041cc0eb8f0272846e5b3e685fe51367f3ef93a6/cwltool/sandboxjs.py#L159
So
toil-cwl-runnerand_toil_workerboth need to call https://github.com/common-workflow-language/cwltool/blob/041cc0eb8f0272846e5b3e685fe51367f3ef93a6/cwltool/main.py#L103 from their shutdown routines. And also from an interrupt handler like https://github.com/common-workflow-language/cwltool/blob/041cc0eb8f0272846e5b3e685fe51367f3ef93a6/cwltool/main.py#L137 Via https://github.com/common-workflow-language/cwltool/blob/041cc0eb8f0272846e5b3e685fe51367f3ef93a6/cwltool/main.py#L1460Ah-ha, good point. We need to fix that in
cwltool, yesBTW, you might need to uninstall the
cwltoolinstalled by Toil before re-installing a local copy. Something like:@EtiennePer are you running this on a Toil cluster / linux environment by any chance? I was able to reproduce this on a Toil cluster, where
cwltooldoesn’t seem to terminate the nodejs processes that it spawns (at least on a Toil cluster).For example, running this test workflow with
cwltoolworks but 2 defunct nodejs processes are left behind:With
toil-cwl-runnerthough, Toil wants to clean up all its children so it sits in an infinite loop waiting for the child processes to terminate, but it never gets to kill the ones spawned by cwltool.@adamnovak and I suspected that this nodejs process is what stays behind since
popen.poll()is non-blocking. So we inspected the processes during this loop, which showed that the stdin was never closed:After installing a local copy of
cwltooland making the following changes here, the nodejs processes seem to exit properly for bothcwltoolandtoil-cwl-runner:I suspect the stdin can be automatically closed on some systems but not all. Please let me know if this fixes your issue as well.
On a separate note, we also realized that we might not want Toil to do this check, so we might remove this loop and use a better init process so it can properly reap the zombie children.
Hello @EtiennePer and thanks for your report. Can you reproduce your problem with
cwltool, and if so, which version?Does your system have
nodejsinstalled? If so, what version? Or do you use the docker/singularity container fallback? Then please let us know your docker engine or Singularity version