sos: Slowness with large input
This topic has been discussed before but perhaps not the same context. I’ve got a couple of workflow steps like this:
[step]
input: '/path/to/a/single/file.gz', for_each = 'chroms', concurrent = True
output: dynamic(glob.glob('{cwd}/{y_data:bnn}/chr*/*.rds'))
[another_step]
input: glob.glob(f'{cwd}/{y_data:bnn}/chr*/*.rds'), group_by = 1, concurrent = True
output: dynamic(glob.glob(f'{cwd}/{y_data:bnn}/SuSiE_CS_*/*.rds'))
R: expand = "${ }"
I run it in 2 separate sequential SoS commands:
sos run step
sos run another_step
You see the first step takes a single file file.gz, pair it with different chroms then create many small rds dynamic output. The actual output length at the end of the pipeline is
>>> len(glob.glob('chr*/*.rds'))
43601
Now when I run the 2nd step it got stuck at the single SoS process to prepare for the run, for 10 minutes (i started writing this post 5 min ago), and it is still working on it … not yet analyzing the data.
~43K files does not sound a big deal right? But this is indeed the first time I use dynamic output of a previous step as the input of the next, in separate commands. I am wondering what is going on maybe in this context? and if we can do something about it.
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Comments: 64 (60 by maintainers)
Commits related to this issue
- Re-generate dangling target list only when the DAG has been changed. #991 — committed to vatlab/sos by deleted user 6 years ago
- Fix the consequences of sudden disappearance of .pulse files #991 — committed to vatlab/sos by deleted user 6 years ago
- Using task files for def files #991 — committed to vatlab/sos by deleted user 6 years ago
- Wait 5 seconds and try to remove the task files again #991 — committed to vatlab/sos by deleted user 6 years ago
- Fix the status report when the task file has new runtime info #991 — committed to vatlab/sos by deleted user 6 years ago
- How the monitor thread should respond to the sudden disappear of .pulse file #991 — committed to vatlab/sos by deleted user 6 years ago
- Introducing a new "new" type for the merge of def and task files #991 — committed to vatlab/sos by deleted user 6 years ago
- Fixing the consequence of merging def and task files #991 — committed to vatlab/sos by deleted user 6 years ago
- Fixing the consequence of merging def and task files #991 — committed to vatlab/sos by deleted user 6 years ago
- Merge .res to .task files #991 — committed to vatlab/sos by deleted user 6 years ago
- workflow_handler needs to handle new status "new" #991 — committed to vatlab/sos by deleted user 6 years ago
- Fix the monitoring process with .pulse now absorted into .task #991 — committed to vatlab/sos by deleted user 6 years ago
- Properly handle sos resume by not overwriting task file #991 — committed to vatlab/sos by deleted user 6 years ago
- Fix overwriting task files, which could change the status of the task #991 — committed to vatlab/sos by deleted user 6 years ago
- Properly define and test TaskFile class #991 — committed to vatlab/sos by deleted user 6 years ago
- Start using TaskFile class #991 — committed to vatlab/sos by deleted user 6 years ago
- Prepare to add signatures to .task file #991 — committed to vatlab/sos by deleted user 6 years ago
- Working on task preparation #991 — committed to vatlab/sos by deleted user 6 years ago
- Use in-memory signature for tasks #991 — committed to vatlab/sos by deleted user 6 years ago
- Adjust header to allow more efficient writing of task status #991 — committed to vatlab/sos by deleted user 6 years ago
BTW, there was a nextflow tweet saying XXX million tasks have been completed by nextflow. Not sure how it came up with this number but SoS is quickly catching up with your small tasks. 😄