berry: [Bug] Extremely slow `yarn run xxxx` commands on large monorepo
The bug was explained there by my colleague @etiennedupont , but maybe it should be reported here as it concerns v2 ?
https://github.com/yarnpkg/yarn/issues/7917
Bug description
When running scripts defined in package.json using yarn run xxx on a large monorepo, there is an extremely long delay before the command is actually run. This delay is always the same and seems to depend on the size of the monorepo.
Command
Here is a simple example:
In package.json
{
"scripts": {
"echo": "echo ok"
}
}
Then in shell run:
> yarn echo
.... WAITS FOR 265s .....
ok
Using time to confirm the duration:
> time yarn echo
ok
yarn echo 264.68s user 1.33s system 99% cpu 4:26.01 total
What is the current behavior?
Yarn does something using 100% of a CPU core for 265s (on my repo and machine) before actually running the command.
What is the expected behavior?
Yarn runs the command instantly
Steps to Reproduce
- Have a large monorepo with 176packages in the workspace.
- yarn install
- Run any yarn run xxxx command in any of the packages folder.
- Environment
Node Version: 10.15.3 Yarn v1 Version: 2.0.0-rc.29.git.20200218.926781b7 OS and version: MacOS Catalina
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 1
- Comments: 38 (31 by maintainers)
To give you some context, the way the
resolvePeerDependency
function works, it traverses the tree to figure out for each nodeN
what are the dependencies it needs to extract from its parentP
to satisfy the peer dependencies ofN
- then once it’s done, it recurses inside the newly created packageN2
(which doesn’t have peer dependencies anymore) in order to let its transitive dependencies inherit their peer dependencies as well.I’m not entirely sure of the circumstances under which you’d end up with a huge tree - there’s likely a mix of peer dependencies and regular dependencies that trigger a catastrophic expansion. My main guess is that something somewhere (possibly in multiple places) is listed as a regular dependency instead of being a peer dependency, causing Yarn to dig into it and generating those huge graphs. If you print
parentLocator.name
in yourconsole.group
you might get a clue by seeing which package is referenced the most.Hey @RaynalHugo! I took a quick look at generating a huge monorepo with 200 packages, but no luck - my run time was mostly unaffected (I used the following script inside a project configured with a
packages/*
glob pattern:).I think the best would be to use
--no-minify
as you guessed, and put some timing statements aroundsetupWorkspaces
andresolveEverything
. Depending on the results other places may be good profiling targets as well, such asinitializePackageEnvironments
.Closing with #997 and #998
For internal lockfiles it’s not so much required, but for open-source ones it’s important to make sure the resolutions point to the expected location (otherwise I could make a PR adding lodash, except that the lockfile would actually point to
my-malicious-package
).I think I need to work on the algorithm performances first, because it’s less likely someone else will be able to efficiently navigate this part of the code. Given my other wips, I’d say at least a few weeks before I can implement it - so any help would be much appreciated! 😃
It would probably be better to implement this in the core - it’s unlikely someone would want a different implementation, so a hook just for that would be awkwardly specific. Plus the experience won’t be worse because of it, so better ship it by default.
I don’t think it would be very complex - basically what I think would need to be done on the top of my head:
The
persistLockfile
function would have to be extended to generate the virtual state file as well as the lockfile.A new function (
hydrateVirtualPackages
?) would have to be added to theProject
instance - it would load the data from the virtual state file (and potentially fallback on the regular resolution if the file isn’t found? I’m not certain about this 🤔). The implementation would look likesetupResolutions
, except that we would reuse the existingoriginalPackages
as much as possible (cf the shortened format I mentioned).The
yarn run
function (and potentially others, you can find them by greppinglockfileOnly: true
) would need to be modified to referencehydrateVirtualPackages
instead of calling the resolver manually.The documentation would have to be updated to reference this in the section where we detail the
gitignore
patterns (the virtual state would have to be ignored when not using zero installs).@arcanis regarding the
.yarn/virtual-state.yml
idea, you got us excited! So here are a few additional questions:Best
I didn’t know humans were supposed to review
yarn.lock
files aha, ours is 45,000 lines.Indeed, I think yarn v2 is actually the first-ever fully-correct package manager and this is a big deal and much appreciated.
The solution of having
.yarn/virtual-state.yml
looks very good to me, and as I said it would totally solve our problem. Also, it might actually be the only way to fully solve this problem: I am not sure that optimizing the resolution pass algorithm would do the job. Right now it can take up to 400s, even making it 100 times faster would still be considered quite slow, taking 4s to run a script that sometimes completes in 2s.So conclusion: I give a big 👍 to the
.yarn/virtual-state.yml
, it seems to be the last show stopper for us to adopt yarn v2 in production since everything now works as expected.We did that and ran:
It gives these stats: the most imported package is
@babel/helper-plugin-utils
, which is imported… 39163 times… I guess this is a lot even for a monorepo of 160 packages.We did the group thing, the stacks don’t seem that deep, but the stream is neverending, there are millions of lines…