toil: Job never stop with javascript input

Hello there, we are facing an issue with a workflow that is using this type of CWL file:

the problem appear only when i’m using “InlineJavascriptRequirement” with this kind of javascript expressions :

my_step:
    run: ./my_sub_cwl.cwl
    in:
      # Javascript sub selection
      array_input:
        source: array_input
        valueFrom: |
          ${
            for (var i = 0 ; i < self.length; i++) {
              if ( self[i].basename.includes("<here I'm putting any string in order to filter>")) {
                return self[i]
              }
            }
            return self[0]
          }
     
    out: [ out1, out2]

I also tried another notation, but the exact same issue appears:

valueFrom: $(inputs.array_input.filter(f => f.basename.includes("<here I'm putting any string in order to filter>"))[0])

The actual task is finishing correctly (with a return_code = 0 ), the input File generated by the Javascript expression is properly set and the file is retrieved.

BUT

The job never “finished” properly and loop over this message infinitely : [2021-12-16T14:48:28+0000] [Thread-2 ] [W] [toil.batchSystems.singleMachine] Sent redundant job completion kill to surviving process group 615 known to batch system 140180515079744

Important note :

  1. We were not facing this issue with toil v5.3.0 but only after updating to v5.5.0 (we didn’t tried with 5.4.0)
  2. I’m not facing the issue if I use the following expression : valueFrom: $(self[0])

I did not found any similar issue for now, and don’t know what to try in order to fix it.

Any help would be really helpful.

Thanks a lot in advance 😃 , Etienne

┆Issue is synchronized with this Jira Task ┆friendlyId: TOIL-1110

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 16 (14 by maintainers)

Most upvoted comments

I did some testing and I don’t think the signal handler in Toil.__enter__ would override the handlers in the _toil_worker process. (I also started https://github.com/DataBiosphere/toil/compare/issues/3965-user-provided-exit-handler but I don’t think it’ll help much here)

However, the problem is that a sigterm signal on the leader process might not propagate to its workers, so I don’t know the best way to handle interrupts like this on the worker. I can look more into this after winter break.

BTW, you might need to uninstall the cwltool installed by Toil before re-installing a local copy. Something like:

mkdir tests && cd tests

# set up a fresh virtual environment
virtualenv -p 3.8 venv && . venv/bin/activate

pip install -e git+https://github.com/DataBiosphere/toil.git@10c591b509d24d45cf58830e748e6a4e7f5e4f60#egg=toil[cwl]
pip uninstall -y cwltool

git clone https://github.com/common-workflow-language/cwltool && cd cwltool
git checkout 0209b0b7ce66f03c8498b5a686f8d31690a2acb3  # latest cwltool==3.1.20211107152837

# temp fix for cwltool
sed -i '294s/nodejs.poll()/nodejs.wait()/' cwltool/sandboxjs.py
sed -i '291i \ \ \ \ nodejs.stdin.close()' cwltool/sandboxjs.py

python setup.py install
cd ..

# run your cwl file with `toil-cwl-runner`

@EtiennePer are you running this on a Toil cluster / linux environment by any chance? I was able to reproduce this on a Toil cluster, where cwltool doesn’t seem to terminate the nodejs processes that it spawns (at least on a Toil cluster).

For example, running this test workflow with cwltool works but 2 defunct nodejs processes are left behind:

# test.cwl
class: ExpressionTool
requirements:
  - class: InlineJavascriptRequirement
cwlVersion: v1.2

inputs: []

outputs:
  output: int

expression: "$({'output': 1})"

# test.json
{}
root@ip-172-31-18-73:~/tests# ps -Al
F S   UID     PID    PPID  C PRI  NI ADDR SZ WCHAN  TTY          TIME CMD
4 S     0       1       0  0  80   0 - 211726 -     ?        00:00:00 mesos-master
4 S     0      17       0  0  80   0 -  4574 -      pts/0    00:00:00 bash
0 Z     0      45       1  9  80   0 -     0 -      pts/0    00:00:00 nodejs <defunct>
0 Z     0      59       1  2  80   0 -     0 -      pts/0    00:00:00 nodejs <defunct>
0 R     0      67      17  0  80   0 -  6495 -      pts/0    00:00:00 ps

With toil-cwl-runner though, Toil wants to clean up all its children so it sits in an infinite loop waiting for the child processes to terminate, but it never gets to kill the ones spawned by cwltool.

@adamnovak and I suspected that this nodejs process is what stays behind since popen.poll() is non-blocking. So we inspected the processes during this loop, which showed that the stdin was never closed:

root@ip-172-31-18-73:~/tests# ps -Al
F S   UID     PID    PPID  C PRI  NI ADDR SZ WCHAN  TTY          TIME CMD
...
0 T     0     160      17 14  80   0 - 121956 -     pts/0    00:00:04 toil-cwl-runner
0 T     0     168     160  1  80   0 - 222947 -     pts/0    00:00:00 nodejs

root@ip-172-31-18-73:~/tests# ps x | less 
...
    168 pts/0    Tl     0:00 nodejs --eval "use strict"; process.stdin.setEncoding("utf8"); var incoming = ""; var firstInput = true; var context = {};  process.stdin.on("data", function(chunk) {   incoming += chunk;   var i = incoming.indexOf("\n");   while (i > -1) {     try{       var input = incoming.substr(0, i);       incoming = incoming.substr(i+1);       var fn = JSON.parse(input);       if(firstInput){         context = require("vm").runInNewContext(fn, {});       }       else{         process.stdout.write(JSON.stringify(require("vm").runInNewContext(fn, context)) + "\n");       }     }     catch(e){       console.error(e);     }     if(firstInput){       firstInput = false;     }     else{       /*strings to indicate the process has finished*/       console.log("r1cepzbhUTxtykz5XTC4");       console.error("r1cepzbhUTxtykz5XTC4");     }      i = incoming.indexOf("\n");   } }); process.stdin.on("end", process.exit); 

After installing a local copy of cwltool and making the following changes here, the nodejs processes seem to exit properly for both cwltool and toil-cwl-runner:

    stdin_buf.close()
    nodejs.stdin.close()
...

    nodejs.wait()

I suspect the stdin can be automatically closed on some systems but not all. Please let me know if this fixes your issue as well.

On a separate note, we also realized that we might not want Toil to do this check, so we might remove this loop and use a better init process so it can properly reap the zombie children.

Hello @EtiennePer and thanks for your report. Can you reproduce your problem with cwltool, and if so, which version?

Does your system have nodejs installed? If so, what version? Or do you use the docker/singularity container fallback? Then please let us know your docker engine or Singularity version