pip: Resolver preformance regression in 21.2

Description

As of 21.2, the resolver decides it needs to download every possible version, to determine which version is compatible with other requirements., for multiple dependencies.

Note that this works fine for some dependencies (coverage, pytest_cov), as it only downloads a few. For dependencies botocore and boto3, however, it downloads every possible version.

A minimal reproducible example can be found here: https://github.com/bblommers/pipresolverbug

Not sure what the offending dependency/installation method is - I could only reproduce it for this specific workflow.

The CI for this project runs this specific example for both 21.1 and 21.2. Version 21.1 completes in 15 seconds - 21.2 is still running after 20 minutes.

The CI logs can be found here: https://github.com/bblommers/pipresolverbug/actions/runs/1064350454

Expected behavior

To only download the latest version of a dependency

pip version

21.2

Python version

3.7

OS

Linux (Ubuntu)

How to Reproduce

git clone git@github.com:bblommers/pipresolverbug.git
pip install -r req.txt

Output

Collecting botocore>=1.12.201
  Using cached botocore-1.21.2-py3-none-any.whl (7.7 MB)
Collecting boto3>=1.9.201
  Using cached boto3-1.18.1-py3-none-any.whl (131 kB)
Collecting botocore>=1.12.201
  Using cached botocore-1.21.1-py3-none-any.whl (7.7 MB)
Collecting boto3>=1.9.201
  Using cached boto3-1.18.0-py3-none-any.whl (131 kB)
Collecting botocore>=1.12.201
  Using cached botocore-1.21.0-py3-none-any.whl (7.7 MB)
Collecting boto3>=1.9.201
  Using cached boto3-1.17.112-py2.py3-none-any.whl (131 kB)
Collecting s3transfer<0.5.0,>=0.4.0
  Using cached s3transfer-0.4.2-py2.py3-none-any.whl (79 kB)
Collecting botocore>=1.12.201
  Using cached botocore-1.20.112-py2.py3-none-any.whl (7.7 MB)
INFO: pip is looking at multiple versions of boto3 to determine which version is compatible with other requirements. This could take a while.
Collecting boto3>=1.9.201
  Using cached boto3-1.17.111-py2.py3-none-any.whl (131 kB)
Collecting botocore>=1.12.201

Code of Conduct

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 8
  • Comments: 101 (77 by maintainers)

Most upvoted comments

@notatallshaw Are all (applicable) reports in this issue in your test cases already? If that’s the case, I feel we can just close this one and track those cases on their own.

Yes, of the ones I could reproduce. Some I simply could never reproduce and some have had their dependencies fixed (In those cases I’ve done my best to keep the test cases working by adding {package} <= {last known bad version} to my test requirement files).

I agree on closing, and once #10481 lands I think all pip issues related to heavy backtracking should be closed (e.g. #10373 #9187 #9215), as any new reports should be tested against that and if they experiencing heavy backtracking still it will take further analysis and the reasoning will not be the same.

I have summarized my proposal to assist with dependency trees like this and linked the PRs here: https://github.com/pypa/pip/issues/10479

If anyone has any ideas of how to improve or where I need to make a stronger argument let me know.

This new resolver just destroyed our CI process, which went from 4 minutes to timing out at 6 hours.

First of all, the work you’re doing is really appreciated, so please don’t take this comment the wrong way. But have you collected any evidence as to whether your approach makes any cases that are currently fast, slower?

Not taken the wrong way, I would really like to test this and find examples. But a couple of points: 1) If the resolver never backtracks there will be no difference, 2) As my latest version only changes the preference order then the worst possible times are the same.

In theory, I think, the worst possible case for my optimization (and therefore presumably slower than the current resolver) would be: Package A version n depends on Package B and Package C, but they are both 100% incompatible , Package A version n-1 depends on 2 different packages Package D and Package E, there are also incompatible, etc. until eventually package A version 1 installs. Packages B, C, D, E, … have similar dependency incompatibilities to Package A.

But I don’t think this represents a real world dependency tree, the assumption of the optimization is if package A version n depends on package B, then package A version n-1 probably also dependends on package B. Where this doesn’t hold the optimization will be slower, but I think changing dependencies usually only happens once in a while and when the dependencies do change the user presumably changes them to a valid installable set of dependencies, in which case my optimization should still be faster.

Obviously it’s almost impossible to know how to check that - maybe it would be worth looking at some of the issues that were slow before 21.2, which the changes in 21.2 improved? Confirming that we don’t make them worse again would be a helpful data point.

I have been backtracking through open and closed issues that look performance based and trying to find any real world reproducible examples, there are shockingly few, most users just write something to the effect of “takes too long” and don’t provide an actual reproducible example.

If you have specific cases that improved from 21.1 to 21.2 I would love to test them. But I would like to point out my optimization is only adding 1 extra parameter to the pinning preference, most of the improvements that came with 21.2 should also be included in my optimization (in fact even better I fixed a bug in one of the improvements: https://github.com/pypa/pip/issues/10415 )

Actually, what would be perfect would be if we could add performance tests to the test suite, so we could capture some of these bad cases and ensure we don’t regress. But I haven’t the foggiest idea how we could do this 🙁

An idea I had was trying to build a tool that takes a real world test case and turns it in to an offline repeatable test case. I imagine this would be pretty tricky but hopefully possible.

But as of right now I can not get pip’s test suite to even start to run on my machine, I spent an hour trying yesterday and tox/pytest just keep throwing me cryptic errors. I’ll give it another shot at the weekend when I have more time to work on it.

Instead of measuring performance in terms of say duration, how about doing so in terms of expected number of backtracks/…? This would presumably be less flaky than traditional performance tests.

@edmorley I think the issue is having any kinda of benchmark suite right now, yes the performance should probably be measured in iterations not time.

The second attempt fails at both the POC and the original project, I’m afraid - runtimes are back to 21.2.x levels with this iteration.

@bblommers Hmm interesting, there must be something different about your exact projects vs. the way I’ve built the reproducible examples. I will take a look when I get a chance. This is a very hacky approach I am taking, once I formalize my ideas a bit and have time to better understand the pip code base I will maybe have a better test.

I don’t think it’d be a waste of time – helping you navigate pip’s codebase and work with git/tooling is exactly the sorts of things that I’d be happy to help you with. 😃

@pradyunsg Thanks, appreciated. I don’t think I’ll be able to schedule anything fixed this weekend or next. But I’ll take a look at at my calendar and book/msg you.

One thing that I don’t think is commonly known¹ and which might be useful for people hitting problems with long resolution times involving boto, is that if you have a constraint file containing version limits for a project, those limits are applied when pip fetches the list of versions from the package index, i.e., before the resolution and backtracking step. Conversely, version limits that are in a requirements file and/or individual project dependencies, are only applied when the dependency is first encountered (which may be after the version list is fetched).

So as a rule of thumb:

If you want to force pip to only consider versions of a package newer that a given version, add pkg>=version to a constraints file and supply that (via the --constraint <file> option) to pip. This can be in addition to any existing requirements in requirements.txt or in project dependencies.

So, for example, in the case here, try adding

boto3>=1.9.201
botocore>=1.12.201

to a constraints.txt file, and see if that helps.

¹ By which I mean that I thought it would be a useful option to add, went and looked at the code and discovered that I’d implemented it when I first added constraint support to the new resolver, but had forgotten about it 🙂

I want to emphasis again, please post what you are installing and the output you get. Dropping a comment without any details won’t get your issue fixed. We are developers here, not psychics.

I put together another commit that focuses only on Hypothesis 2, i.e. it tries to resolve the current conflicts before anything else.

It prefers the causes and parents of causes of the last known conflicts. In my testing it fixes OPs issue and other test cases I have to hand. The hypothesis here is that “If package A version n depends on package B then package A version n-1 probably also depends on package B”. Therefore “If package A and C both depend on package B but on different versions of package B the resolution will probably be faster it it focuses on packages A and C and not on some package X which does not depend on package B”.

Feel free to try here, I make no promises it will improve your particular case: python -m pip install git+git://github.com/notatallshaw/pip@third_attempt_at_prefer_non_conflicts

Code diff: https://github.com/notatallshaw/pip/compare/21.2.3...notatallshaw:third_attempt_at_prefer_non_conflicts

Compared to before it doesn’t try to hack the resolution steps of Pip, so it’s worst case is the same as the current worst case. I think this approach is now simple enough that it’s worth me trying to work a real PR rather than just PoC commits. FYI it also includes a fix for #10415 which I think may also help some use cases.

Wait WTF. Now I see

Hi, thanks all for offering up potential additional test cases. I will take a look at them over the weekend when I have spare time.

Just a quick question though, what version of pip are you using?

If you are using the tag I created there is no guarantee it will perform as expected, it is very much doing hacky things to the internals of the pip Resolver and I have made no attempt to verify correctness.

Thanks so much! I’ll take a closer look at the logic you implemented, but my intuition is this may be worthwhile to pursue, if I understand your logic correctly from a high level. The main criterion should be would it be easy enough for a user to identify and avoid the worst case scenario. The current logic does not perform very well because (using the importlib-metadata example in https://github.com/pypa/pip/issues/9215#issuecomment-897259778) the resolver is working on seemingly unrelated stuff and the user cannot easy tell what exactly is causing the backtracks before the resolver stops to emit an error. So it’d be worthwhile to have pip outputs on those made-up examples to see if it can point users to the issue. Pip’s test suite has some useful functions you can use to create made-up packages (create_basic_wheel something) for this purpose.

Okay I had a bit of time today and I’ve developed a proof of concept for this strategy!

Hypothesis: When resolving requirements of a group of packages and there is a conflict then it is likely to find a solution faster by delaying trying new candidates of packages which are unrelated to the found conflicts.

Here is my PoC commit, it is based on Pip 21.2.1 and the code is very rough and definitely has some bugs in the approach, but is just supposed to show case the approach for the 2 reproducible test cases (listed below): https://github.com/notatallshaw/pip/commit/d085824b52bdb11efd6b46be92fe60a9c4040bd4

Reproducible test case 1:

pytest-cov
coverage==4.5.4
boto3

Reproducible test case 2 (python 3.7 only):

poetry2setup
wheel
twine

For those who wish to try if it improves their performance regression you can install with the following command (there are zero guarantees it will not crash): python -m pip install git+git://github.com/notatallshaw/pip@first_attempt_at_prefer_non_conflicts

Constructed counter example: I want to highlight that this hypothesis is probabilistic, you can construct cases where this approach is slower, but all the constructions I have been able to come up with seem highly esoteric.

For example let’s say you have 4 packages: a (v1 - v3), b (v1 - v10), c (v1 - v10), d (v1 - v3) , your requirements are a, b, c, and all versions of b and c require d. Now b v2 - v9 depends on d==v3, and c v2 - v9 depends on d==v2 but both b v1 and c v1 depend on d==v1, finally a v2 and v3 don’t depend on anything but a v1 depends on b==v1 and c==v1. Therefore only solution to this is a v1, b v1, c v1, and d v1.

The current resolver method would backtrack twice on a to get to a v1, find the requirement for b==v1 and c==v1 and then find their requirements for d==v1 and return the solution. However the resolver method I am proposing would not backtrack on a as it does not present any conflicts until b and c have been exhausted, therefore being much slower in this constructed example.

@uranusjr Given my hypothesis, PoC commit, and my admitted example where this approach would actually be slower, is it worth pursuing this further to the point where I (or someone else) can try and submit a PR? Is this something you would be interested in?

At this point I think this optimization would be a significant amount of work to implement in a way that wasn’t super hacky. So there’s no point proceeding if you don’t agree with the approach. That said I am personally of the opinion this would be a significant optimization for pip in many situations where it currently performs very badly. I actually also thinks it might negate the setuptools hack in almost all situations (though I do not have any data for this yet).

I haven’t been able to take a good look into this, but a couple of notes:

  1. I’m going to release 21.2.2 for some other bugs in 21.2.1 and this will not be fixed in it (I’ll push the milestone back).
  2. One of the advantages of the new resolve logic is it’s possible to tweak its behaviour by reordering your dependencies, and/or adding some of the dependencies to the requirements file. A lot of the performance depends on the ordering of things, so you can work out the best ordering for your project if the resolver can’t (and it never will be able to find the best route 100% of the time, all we can do is make the worst case scenario happen less and less tasxing them it happens).

The problem happens with any and all packages that aren’t pinned to a specific version.

I’m not part of pip team but I’d like to point out this isn’t true, I have lots of requirements that don’t pin a specific version of a package and I don’t have this issue.

As already stated by others if you don’t post specifically what you’re doing (giving the command and requirements) then your problem probably won’t be helped.

Thanks, I’d forgotten that. Does adding botocore/boto3 to that list help with this issue?

If no one else does by this evening (EST) I’ll attempt to reproduce the issue and then try adding those packages to the list and report the results.

We only resolve packages with “tighter constraints” first, but don’t actually count the number of releases. I think the main reason is there’s currently no mechanism to do that; the information is of course available on the index page, but we don’t have a way to bubble it up past PackageFinder.