snakemake: Mandatory option `--cores` is breaking peoples workflows
Also see #283
Snakemake version 5.13
Describe the bug
The option --cores is mandatory, which is inconsistent with the documentation printed by --help.
Minimal example See #283
Desired behavior Users should not be required to set mandatory options. Please pick reasonable defaults. The default for --cores should be “all”, like in ninja. This make snakemake run in parallel by default, which is what most users naively expect anyway. The need to restrict the number of cores only occurs when people run many things in parallel on the same computer, which is much less common.
About this issue
- Original URL
- State: open
- Created 4 years ago
- Reactions: 10
- Comments: 44 (19 by maintainers)
Commits related to this issue
- build: Default Snakemake's --cores option to "all" Applies when our own --cpus option isn't provided. This will allow upgrading of Snakemake in our Docker runtime without inflicting the addition of ... — committed to nextstrain/cli by tsibley 2 years ago
- build: Default Snakemake's --cores option to "all" Applies when our own --cpus option isn't provided. This will allow upgrading of Snakemake in our Docker runtime without inflicting the addition of ... — committed to nextstrain/cli by tsibley 2 years ago
- build: Default Snakemake's --cores option to "all" Applies when our own --cpus option isn't provided. This will allow upgrading of Snakemake in our Docker runtime without inflicting the addition of ... — committed to nextstrain/cli by tsibley 2 years ago
Piling on here…
A lot of people work in environments where they share servers with other people. A 1-core default is the safe behavior. If the user wants more cores than that, then they ought to make the decision consciously. It shouldn’t be a required argument, either.
I don’t see how people are confused by the 1-core default. If they bother looking at the documentation or
--helpfor five minutes, they’ll see the--coresoption.I love Snakemake and have really enjoyed using it. It’s been a huge boost to my productivity. But, frankly, I’m flabbergasted by these design choices in the recent releases.
For a slightly different argument - I use snakemake all the time and it would just be annoying to have to type
--cores Nfor every single invocation. Please do not make this required - it feels like the kind of design decision I’d have to apologize for if recommending this tool to a colleague. Or if it must be required, allow for some specification in the snakefile or config file to override this.I understand the rationale (users were confused as to why their workflows weren’t running in parallel), but the way it has been working (default to 1 core) mirrors the usage of make and many other commands that have parallel options.
Maybe I’m being naive here but I honestly have no real understanding of why “all” would be a good default option. There is no guarantee that the machine is well configured and that using all cores won’t drown out a server and prevent others from working or that it may spawn more processes than the user is allowed to use and result in throttling or job cancelations.
Rephrasing, “all” has edge cases which could be seriously problematic. “One” in contrast is never a problem; at worst it’s just a bit slower you might expect (which is easily resolved after looking at the manual or help page for no more than 5 minutes). A default of 1 just seems intuitive, safe, well behaved and consistent with other similar tools. Still regardless of “one” or “all” having no default is just crazy.
I just don’t buy the story that “there is no consensus”. I see the majority of users on one side of this discussion.
As far as I can tell, the situation looks like this:
--cores 1default. This includes me, @deto, @i-Zaak , @frankier , @mbhall88 , @danijoo , @tbooth , @jjarmagost. Most of us base our argument around overwhelmingly common conventions for command line tools.--coresargument. They also mention the ninja build system as an argument for default--cores allbehavior.Do we really want to let the design of Snakemake be driven by the experience of beginners, rather than well-established conventions? Do we really want to let the design of Snakemake be driven by people who haven’t read the (very good) online documentation? or the --help message?
Consensus matters, but consensus among the right group matters more.
All of that said – I like Snakemake a lot and will continue using it regardless of the decision. @johanneskoester makes the very fair point that we can define our own profiles.
From reading this, I’d agree that there is no consensus, but I’d say that disagreement has mostly been about whether they default should be 1 or All cores.
However, my sense is that most people would prefer there to be a default rather than for this option to always need to be specified.
I think there is an elegant simplicity in typing
snakemake <foo>and having that file be built. And when extra options are required, we lose some of that. I would suggest that the appeal of this simplicity has been a large contributor to the widespread adoption of this great tool. I’ve given tutorials showing other bioinformaticians how to use snakemake and I’d prefer not to have to start having the conversation about ‘what are cores’ and parallelism when just introducing the Hello World example.So even though I would prefer a default of 1 core, I strongly advocate for the use of any default over making this a required parameter.
As a new user,
--coresshould default to 1.As was mentioned above, this goes against every great unix command line program, and it is particularly important for a workflow management program. It would be equivalent to
docker composerequiring a mandatory option. The whole point of workflow management is to use text files to specify configs and rules.snakemakeshould use sensible defaults so it can be used as an entrypoint.And from a beginner perspective, you should be able to run ‘hello world’ without diving into parallelization. There are dozens of workflow management programs out there and the best way to learn about them is to read the docs and run a ‘hello world’ example. The
--coresmandatory option is a red flag for a newcomer.Other than that it is an amazing tool that has greatly simplified my work so I greatly appreciate your contribution.
First and foremost: I think Snakemake is absolutely wonderful and easily the best tool of its kind that I’ve used.
I’m just so confused about that mandatory --cores design choice. I think of it this way:
One of these is less useful than the others. 😃
For what it’s worth, I’d like to say that I agree with @mbhall88 that I think the old default of j=1 was most sensible. Automatically claiming all cores on the system (Linux is a shared computing environment after all) is dangerous, and adding new mandatory arguments to stable interfaces is problematic.
One issue I see here is that you can get away with some shortcuts when j=1, like having a temporary file called
tempfileor printing results directly in the console. Once j>1 then the contents oftempfileand the console output can get scrambled. Now you could just wag your finger and say “write your rules properly!” “use shadow rules!” but here we’re trying to help people who haven’t even twigged that they need to use the--coressetting to get parallel execution. So I fear that the questions from people asking “why is my workflow not running in parallel” will now be replaced with more complex problems from people who don’t understand that parallel execution is tricky, even with Snakemake to help. I’ll bet that many people have workflows that work OK with one thread, and now they’ll see that--coresis mandatory so they will give a number >1 and then the workflow will, as they see it, suddenly break.I think the issue here wasn’t technical. If a user runs a program and expects it to automatically parallelize, does not look up in the options what a possible argument could be and ignores warnings, we do have a very different kind of problem! I mean, can you imagine all the other problems (say concurrent file access) that a user with that expectation runs into? That’s not the user which should be targeted in terms of required arguments, that’s the user group which should be targeted with “read a tutorial and get some basic computing understanding first”.
We have to be very careful and sometimes say simply “no” and maybe blame the user. In fact, a warning is already quite generous to me!
Definitely undo and restate the default. I also think one core is the most sensible. While I understand the argument for all cores, I think the usecases are:
Actually, it gets especially ridiculous when running with
-n(dry run) and it requires me to specify the number of cores: this is not a design choice but a design mistake, since the number of cores is ignored anyway.Well, what we can easily see from this discussion is that there is no consensus about what behavior is best. Hence, forcing a decision from the user by default seems like a reasonable choice to me.
And then, there are profiles, which allow you to configure the default behavior of Snakemake on your system. Moreover, I am currently crafting the ability to have a default profile in PR #467. So, you could simply have a profile
.config/snakemake/default/config.yamlwiththe behavior will be the same as GNU Make again (1 core by default). Similarly, with
in the same place, you get ninja behavior (all cores). And without a profile, Snakemake will just ask, as it is now.
@johanneskoester let’s maybe get the facts speaking: how many users did ever complain that they expected it to run “automatically” in parallel, did not read the warnings that the code printed and also did not look up the docs?
We have in this thread here a consensus among probably more experienced developers: have a default. We have an adult discussion if one or all should be the default with many (including me) leaning towards 1. But even - so I read from the posts - “all” supporters would prefer a default over no default.
So let’s get some facts straight (and I say that because I love snakemake and find the “continued degredation” through questionable design choices just difficult to watch):
--cores? What’s the user experience that they need this to be mandatory? I.e. that they cannot just see that it “doesn’t parallelize” and then Google it and find the answer after 2 minutes? This 2 minutes that it “saves them” (and introduces potentially more problems in case of concurrent I/O), is that what the change is for?I just really can’t make up my mind why such a great tool with many good design decisions ends up incorporating such poor decisions too.
I still think the users should be advised to start with -j,1 and only after everything is working correctly, try to get some speedup with parallelization. Race conditions https://github.com/snakemake/snakemake/issues/308#issuecomment-617692653 and performance issues https://github.com/snakemake/snakemake/issues/308#issuecomment-648481872 are, strong enough arguments, I think. Default value being a form of very clear advice (even for those who don’t read --help or manual, which is apparently prevalent enough to warrant the change in the first place).
ahem can anyone name a (widely used) tool that implicitly takes all cores, where that’s a good design choice?
@jjarmagost - Genuine question (not trying to be antagonistic am just genuinely curious for my own understanding/knowledge) what would be the arguments for a default of “all” and how do you mitigate issues relating to shared servers ?
Also echo your comments that this is fundamentally an amazing tool and hope that none of my critiques of the
-j1give any impression to the contrary.A major version bump is maybe indeed a good idea. When implementing this change, I did not see the command line interface as an API. But I now acknowledge that Snakemake is used as a backend in some contexts, so it is indeed appropriate to make a major version bump. I will probably combine that with some other breaking changes that can be used to simplify the codebase (removing
dynamic:,version:, andsingularity:(the latter can be written ascontainer:now, since the runtime is pretty arbitrary nowadays)).The question is do you want to be more like a traditional, well behaved Unix tool, which is quiet, runs as a single process by default and trusts users to read the manual to understand the full extent of its power, or more like monoliths like Matlab, the JVM, and Apache which think of the user’s machine as more theirs than the user’s? To me it’s very obvious which school has good taste and which bad, but we are unfortunately in the territory of taste and this could be bikeshedded indefinitely.
I disagree. Can anyone name a (widely used) tool that implicitly takes all cores?
I certainly agree, if the API has changed.
Wouldn’t the same be accomplished by a warning if the number of cores is not specified? I agree wholeheartedly with all the arguments above for default -j1.