pip: PEP 518 build requirements cannot be overriden by user

Apparently not, it seems to call pip install --ignore-installed .... Because the build itself is not be isolated from the environment in other respects, I’m not sure if this is actually sensible behavior by pip…

If the target computer already has a satisfactory version of numpy, then the build system should use that version. Only if the version is not already installed should pip use an isolated environment.

Related: scipy/scipy#7309

About this issue

  • Original URL
  • State: open
  • Created 7 years ago
  • Comments: 65 (36 by maintainers)

Most upvoted comments

Stepping back from the specific request of overriding build dependencies, the problem presented in the top post can be avoided by adding additional logic to how build dependencies are chosen. When a package specifies numpy (for example) as a build dependency, pip can choose freely any version of numpy. Right now it chooses the latest simply because it’s the default logic. But we can instead condition the logic to prefer matching the run-time environment if possible instead, which would keep the spirit of build isolation, while at the same time solve the build/run-time ABI mismatch problem.

+1 this is a healthy idea in general, and I don’t see serious downsides.

Note that for numpy specifically, we try to teach people good habits, and there’s a package oldest-supported-numpy that people can depend on in pyproject.toml. But many people new to shipping a package on PyPI won’t be aware of that.

@rgommers:

We’d love to have metadata that’s understood for SIMD extensions, GPU support, etc.

I think this is relevant as we (well, mostly @alalazo and @becker33) wrote a library and factored it out of Spack – initially for CPU micro-architectures (and their features/extensions), but we’re hoping GPU ISA’s (compute capabilities, whatever) can also be encoded.

The library is archspec. You can already pip install it. It does a few things that might be interesting for package management and binary distribution. It’s basically designed for labeling binaries with uarch ISA information and deciding whether you can build or run that binary. Specifically it:

  1. Defines a compatibility graph and names for CPU microarchitectures (defined in microarchitectures.json)
  2. It’ll detect the host microarchitecture (on macOS and Linux so far)
  3. You can ask things like “is a zen2 binary compatible with cascadelake?”, or “will an x86_64_v4 binary run on haswell?” (we support generic x86_64 levels, which are also very helpful for binary distribution)
  4. You can query microarchitectures for feature support (does the host arch support avx512?)
  5. You can ask, given a compiler version and a microarchitecture, what flags are needed for that compiler to emit that uarch’s ISA. For things like generic x86-64 levels we try to emulate that (with complicated flags) for older compilers that do not support those names directly.

We have gotten some vendor contributions to archspec (e.g., from AMD and some others), but if it were adopted by pip,I think we’d get more, so maybe a win-win? It would be awesome to expand the project b/c I think we are trying to solve the same problem, at least in this domain (ISA compatibility).

More here if you want the gory details: archspec paper

I think being able to provide users with a way to say “I want all my builds to happen with setuptools == 56.0.1” is worthwhile; even if we don’t end up tackling the binary compatibility story. That’s useful for bug-for-bug compatibility, ensuring that you have deterministic builds and more.


I think the “fix” for the binary compatibility problem is complete rethink of how we handle binary compatibility (which is a lot of deeply technical work) which needs to pass through our standardisation process (which is a mix of technical and social work). And I’m not sure there’s either appetite or interest in doing all of that right now. Or if it would justify the churn budget costs.

If there is interest and we think the value is sufficient, I’m afraid I’m still not quite sure how tractable the problem even is and where we’d want to draw the line of what we want to bother with.

I’m sure @rgommers, @njs, @tgamblin and many other folks will have thoughts on this as well. They’re a lot more familiar with this stuff than I am.

As for the pip caching issue, I wonder if there’s some sort of cache busting that can be done with build tags in the wheel filename (generated by the package). It won’t work for PyPI wheels, but it should be feasible to encode build-related information in the build tag, for the packages that people build themselves locally. This might even be the right mechanism to try using existing semantics of toward solving some of the issues.

Regardless, I do think that’s related but somewhat independent of this issue.

@pradyunsg:

I think being able to provide users with a way to say “I want all my builds to happen with setuptools == 56.0.1” is worthwhile; even if we don’t end up tackling the binary compatibility story.

Happy to talk about how we’ve implemented “solving around” already-installed stuff and how that might translate to the pip solver. The gist of that is in the PackagingCon talk – we’re working on a paper on that stuff as well and I could send it along when it’s a little more done if you think it would help.

I think fixing a particular package version isn’t actually all that hard – I suspect you could implement that feature mostly with what you’ve got. The place where things get nasty for us are binary compatibility constraints – at the moment, we model the following on nodes and can enforce requirements between them:

  • compiler used to build, and its version
  • variants (e.g. is a particular build option enabled)
  • target uarch (modeled by archspec, mentioned above)
  • transitive dependencies: if you say you want a particular numpy, we also make sure you use its transitive dependencies. We’re working on a model where we could loosen that as long as things are binary compatible (and we have a notion of “splicing” a node or sub-dag into a graph and preserving build provenance that we’re experimenting with).

The big thing we are working on right now w.r.t. compatibility is compiler runtime libraries for mixed-compiler (or mixed compiler version) builds (e.g., making sure libstdc++, openmp libraries, etc. are compatible). We don’t currently model compilers or their implicit libs as proper dependencies and that’s something we’re finally getting to. I am a little embarrassed that I gave this talk on compiler dependencies in 2018 and it took a whole new solver and too many years to handle it.

The other thing we are trying to model is actual symbols in binaries – we have a research project on the side right now to look at verifying the compatibility of entry/exit calls and types between libraries (ala libabigail or other binary analysis tools). We want to integrate that kind of checking into the solve. I consider this part pretty far off at least in production settings, but it might help to inform discussions on binary metadata for pip.

Anyway, yes we’ve thought about a lot of aspects of binary compatibility, versioning, and what’s needed as far as metadata quite a bit. Happy to talk about how we could work together/help/etc.

I agree that being able to override build dependencies is worthwhile, I just don’t think it’ll necessarily address all of the problems in this space (e.g., I expect we’ll still get a certain level of support questions from people about this, and “you can override the build dependencies” won’t be seen as an ideal solution - see https://github.com/pypa/pip/issues/10731#issuecomment-995544692 for an example of the sort of reaction I mean).

To be clear, build tags are a thing in the existing wheel file format

Hmm, yes, we might be able to use them somehow. Good thought.

And I’m not sure there’s either appetite or interest in doing all of that right now. Or if it would justify the churn budget costs.

I think it’s a significant issue for some of our users, who would consider it justified. The problem for the pip project is how we spend our limited resources - even if the packaging community[^1] develops such a standard, should pip spend time implementing it, or should we work on something like lockfiles, or should we focus on critically-needed UI/UX rationalisation and improvement - or something else entirely?

I see no real reason to treat build and runtime dependencies in such an asymmetric way as is done now.

Agreed. This is something I alluded to in my comment above about “UI/UX rationalisation”. I think that pip really needs to take a breather from implementing new functionality at this point, and tidy up the UI. And one of the things I’d include in that would be looking at how we do or don’t share options between the install process and the isolated build environment setup. Sharing requirement overrides between build and install might just naturally fall out of something like that.

But 🤷, any of this needs someone who can put in the work, and that’s the key bottleneck at the moment.

[^1]: And the same problem applies for the packaging community, in that we only have a certain amount of bandwidth for the PEP process, and we don’t have a process for judging how universal the benefit of a given PEP is. Maybe that’s something the packaging manager would cover, but there’s been little sign of interaction with the PyPA from them yet, so it’s hard to be sure.

@pfmoore those are valid questions/observations I think - and a lot broader than just this build reqs issue. We’d love to have metadata that’s understood for SIMD extensions, GPU support, etc. - encoding everything in filenames only is very limiting.

(and honestly, expecting the end user to know how to specify the right overrides is probably optimistic).

This is true, but it’s also true for runtime dependencies - most users won’t know how that works or if/when to override them. I see no real reason to treat build and runtime dependencies in such an asymmetric way as is done now.

If we want to properly address this issue, we probably need an extension to the metadata standards. And that’s going to be a pretty big, complicated discussion (general dependency management for binaries is way beyond the current scope of Python packaging).

Agreed. It’s not about dependency management of binaries though. There are, I think, 3 main functions of PyPI:

  1. Be the authoritative index of Python packages, flow of open source code from authors to redistributors (Linux distros, Homebrew, conda-forge, etc.)
  2. Let end users install binaries (wheels)
  3. Let end users install from source (sdist’s)

This mix of binaries and from-source builds is the problem, and in particular - also for this issue - (3) is what causes most problems. It’s naive that we expect that from-source builds of packages with complicated dependencies will work for end users. This is obviously never going to work reliably when builds are complex and have non-Python dependencies. An extension of metadata alone is definitely not enough to solve this problem. And I can’t think of anything that will really solve it, because even much more advanced “package manager + associated package repos” where complete metadata is enforced don’t do both binary and from-source installs in a mixed fashion.

And I’m not sure there’s either appetite or interest in doing all of that right now. Or if it would justify the churn budget costs.

I have an interest, and some budget, for thoroughly documenting all the key problems that we see for scientific & data-science/ML/AI packages in the first half of next year. In order to be at least on the same page about what the problems are, and can discuss which ones may be solvable and which ones are going to be out of scope.

Regardless, I do think that’s related but somewhat independent of this issue.

agreed

@rbtcollins Don’t take offense, but I suspect you are not familiar with the subject matter. There is no consideration of installed requirements when resolving build dependencies. In addition, resolution of build dependencies is currently a complete hack.

I don’t think I proposed any new solution, just trying to pick the right version specifiers for projects depending on numpy given the current PEP 518 spec.

The specification requires the project to specify build requirements. In other words, the project should specify the range of versions that the project could build against, rather than just a particular version. I understand pyproject.toml is the way that it is in SciPy, but the correct fix is to modify the behavior of pip.

This thread is large and confusing, it would be useful to be more explicit.

In layman’s terms, pip should build against the version already (or going to be) installed on the user’s computer, to mirror behavior without PEP 518.

After some experience using pip 10.0, I do not think that the solution proposed by @rgommers is acceptable. What has essentially happened is that I’ve made changes to numpy.distutils to allow building dependent projects on windows; these changes are included in the latest release. However, pip 10.0 downloads the pinned version of NumPy that @rgommers has chosen for me which does not have these improvements, leading to a build failure.

What this effectively means is that I will not be able to use pip 10.0 unless I manually edit pyproject.toml before installing the project. From this perspective, the logic proposed by @dstufft seems much more appealing.

@njsmith That’s interesting, and it almost makes me wonder if our build requirements logic should be a bit more… complex? Although this gets complicated fast so I’m not sure it’s even possible. My immediate thought is:

If we need to install X to build Y:

  • … and X is not already installed in the environment and is not in our requirements set, then install the latest version of X that matches the version specifier in the build requirements.
  • … and X is already installed in the environment and is not in our requirements set and that already installed version of X matches the version specifier in the build requirements, then treat it as if the build-requires is X==$INSTALLED_VERSION.
  • … and X is already installed in the environment and is not in our requirements set and that already installed version of X does not match the version specifier in the build requirements, then install the latest version of X that matched the build requirements.
  • … and X is in our requirements set and it matches the version specifier in our build-requires, then treat it as if the build requirements has X==$REQUIREMENT_SET_VERSION.
  • … and X is in our requirements set and it doesn’t match the version specifier in our build-requires, then install the latest version of X that matches the version specifier in build requirements.

I’m REALLY not sure how I feel about that, it feels super magical and I feel like the edge cases are going to be gnarly but in a quick 5 minute thought, it feels like it might also do the right thing more often and require some sort of override less often… but I dunno it feels kinda icky.

I don’t think we need a new issue, I think this issue is fine I’ll just update the title because the current title isn’t really meaningful I think.

Yeah, scipy and other packages using the numpy C API ought to couple their numpy install-requires to whichever version of numpy they’re built against. (In fact numpy should probably export some API saying “if you build against me, then here’s what you should put in your install-requires”.) But that’s a separate issue.

The pyproject.toml thing is probably clearer with some examples though. Let’s assume we’re on a platform where no scipy wheel is available (e.g. a raspberry pi).

Scenario 1

pip install scipy into a fresh empty virtualenv

Before pyproject.toml: this fails with an error, “You need to install numpy first”. User has to manually install numpy, and then scipy. Not so great.

After pyproject.toml: scipy has a build-requires on numpy, so this automatically works, hooray

Scenario 2

pip install scipy into a virtualenv that has an old version of numpy installed

Before pyproject.toml: scipy is automatically built against the installed version of numpy, all is good

After pyproject.toml: scipy is automatically built against whatever version of numpy is declared in pyproject.toml. If this is just requires = ["numpy"] with no version constraint, then it’s automatically built against the newest version of numpy. This gives a version of scipy that requires the latest numpy. We can/should fix scipy’s build system so that at least it knows that it , but doing this for all projects downstream of numpy will take a little while. And even after that fix, this is still problematic if you don’t want to upgrade numpy in this venv; and if the wheel goes into the wheel cache, it’s problematic if you ever want to create a venv on this machine that uses an older version of numpy + this version of scipy. For example, you might want to test that the library you’re writing works on an old version of numpy, or switch to an old version of numpy to reproduce some old results. (Like, imagine a tox configuration that tries to test against old-numpy + old-scipy, numpy == 1.10.1, scipy == 0.17.1, but silently ends up actually testing against numpy-latest + scipy == 0.17.1 instead.) Not so great

OTOH, you can configure pyproject.toml like requires = ["numpy == $SPECIFICOLDVERSION"]. Then scipy is automatically built against an old version of numpy, the wheel in the cache works with any supported version of numpy, all is good

Scenario 3

pip install scipy into a python 3.7 virtualenv that has numpy 1.13 installed

Before pyproject.toml: You have to manually install numpy, and you might have problems if you ever try to downgrade numpy, but at least in this simple case all is good

After pyproject.toml: If scipy uses requires = ["numpy"], then you get a forced upgrade of numpy and all the other issues described above, but it does work. Not so great

OTOH, if scipy uses requires = ["numpy == $SPECIFICVERSION"], and it turns out that they guessed wrong about whether $SPECIFICVERSION works on python 3.7, then this is totally broken and they have to roll a new release to support 3.7.

Summary

Scipy and similar projects have to pick how to do version pinning in their pyproject.toml, and all of the options cause some regression in some edge cases. My current feeling is that the numpy == $SPECIFICVERSION approach is probably the best option, and overall it’s great that we’re moving to a more structured/reliable/predictable way of handling all this stuff, but it does still have some downsides. And unfortunately it’s a bit difficult to tell end-users “oh, right, you’re using a new version of python, so what you need to do first of all is make a list of all the packages you use that link against numpy, and then write a custom build frontend…”