sos: Slowness with large input

This topic has been discussed before but perhaps not the same context. I’ve got a couple of workflow steps like this:

[step]
input: '/path/to/a/single/file.gz', for_each = 'chroms', concurrent = True
output: dynamic(glob.glob('{cwd}/{y_data:bnn}/chr*/*.rds'))
[another_step]
input: glob.glob(f'{cwd}/{y_data:bnn}/chr*/*.rds'), group_by = 1, concurrent = True
output: dynamic(glob.glob(f'{cwd}/{y_data:bnn}/SuSiE_CS_*/*.rds'))
R: expand = "${ }"

I run it in 2 separate sequential SoS commands:

sos run step
sos run another_step

You see the first step takes a single file file.gz, pair it with different chroms then create many small rds dynamic output. The actual output length at the end of the pipeline is

>>> len(glob.glob('chr*/*.rds'))
43601

Now when I run the 2nd step it got stuck at the single SoS process to prepare for the run, for 10 minutes (i started writing this post 5 min ago), and it is still working on it … not yet analyzing the data.

~43K files does not sound a big deal right? But this is indeed the first time I use dynamic output of a previous step as the input of the next, in separate commands. I am wondering what is going on maybe in this context? and if we can do something about it.

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Comments: 64 (60 by maintainers)

Commits related to this issue

Most upvoted comments

BTW, there was a nextflow tweet saying XXX million tasks have been completed by nextflow. Not sure how it came up with this number but SoS is quickly catching up with your small tasks. 😄