setuptools: package data in subdirectory causes warning

setuptools version

62.3.2

Python version

3.10

OS

Debian with conda

Additional environment information

No response

Description

pyopencl has OpenCL files and some headers in a subdirectory pyopencl/cl and they are included as package_data so that the python module can find them.

package_data={
                    "pyopencl": [
                        "cl/*.cl",
                        "cl/*.h",
                        "cl/pyopencl-random123/*.cl",
                        "cl/pyopencl-random123/*.h",
                        ]
                    },

With new setuptools, there is a warning saying


    ############################
    # Package would be ignored #
    ############################
    Python recognizes 'pyopencl.cl' as an importable package, however it is
    included in the distribution as "data".
    This behavior is likely to change in future versions of setuptools (and
    therefore is considered deprecated).

    Please make sure that 'pyopencl.cl' is included as a package by using
    setuptools' `packages` configuration field or the proper discovery methods
    (for example by using `find_namespace_packages(...)`/`find_namespace:`
    instead of `find_packages(...)`/`find:`).

    You can read more about "package discovery" and "data files" on setuptools
    documentation page.

cc @inducer

Expected behavior

No warning

How to Reproduce

  1. clone https://github.com/inducer/pyopencl
  2. install numpy
  3. Run python setup.py install

Output

$ python setup.py install
running install
/home/idf2/miniforge3/lib/python3.10/site-packages/setuptools/command/install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.
  warnings.warn(
/home/idf2/miniforge3/lib/python3.10/site-packages/setuptools/command/easy_install.py:144: EasyInstallDeprecationWarning: easy_install command is deprecated. Use build and pip and other standards-based tools.
  warnings.warn(
running bdist_egg
running egg_info
writing pyopencl.egg-info/PKG-INFO
writing dependency_links to pyopencl.egg-info/dependency_links.txt
writing requirements to pyopencl.egg-info/requires.txt
writing top-level names to pyopencl.egg-info/top_level.txt
reading manifest file 'pyopencl.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'
adding license file 'LICENSE'
writing manifest file 'pyopencl.egg-info/SOURCES.txt'
installing library code to build/bdist.linux-x86_64/egg
running install_lib
running build_py
/home/idf2/miniforge3/lib/python3.10/site-packages/setuptools/command/build_py.py:153: SetuptoolsDeprecationWarning:     Installing 'pyopencl.cl' as data is deprecated, please list it in `packages`.
    !!


    ############################
    # Package would be ignored #
    ############################
    Python recognizes 'pyopencl.cl' as an importable package, however it is
    included in the distribution as "data".
    This behavior is likely to change in future versions of setuptools (and
    therefore is considered deprecated).

    Please make sure that 'pyopencl.cl' is included as a package by using
    setuptools' `packages` configuration field or the proper discovery methods
    (for example by using `find_namespace_packages(...)`/`find_namespace:`
    instead of `find_packages(...)`/`find:`).

    You can read more about "package discovery" and "data files" on setuptools
    documentation page.


!!

  check.warn(importable)
running build_ext

About this issue

  • Original URL
  • State: open
  • Created 2 years ago
  • Reactions: 2
  • Comments: 54 (32 by maintainers)

Commits related to this issue

Most upvoted comments

Hi @leycec, I understand you’re frustrated with the changes here, but please try be more more respectful and considerate in your communication in the future.

As a project of the PyPA, everyone in this issue tracker is expected to follow the Python Community Code of Conduct. The expectation is that everyone interacting here should be courteous when raising issues and disagreements.

Specifically, calling the setuptools maintainers “insane lunatics” is unacceptable: it’s not constructive, it’s not welcoming or inclusive, and it easily qualifies as harassment.

Additionally, your previous comments in this issue tracker (“Heads need rolling (especially those currently attached to the still-functioning torsos of managerial project leads)”) is a clear example of violent language directed against another person, and is also unacceptable.

In short: we don’t do this here. You are welcome to continue participating in this project, but if you continue to violate the code of conduct here, you will no longer be permitted to participate.

Insane lunatics who are publicly disembowelling setuptools while distraught children are franticly screaming, please stop publicly disembowelling setuptools while distraught children are franticly screaming.

This means you, @abravalheri – and everyone else with setuptools push authority who mistakenly believed that Masaki Kobayashi’s seminal masterpiece “Harakiri” was not, in fact, a brutal dissection of authoritarian militancy under entrenched hierarchy but instead an exemplary paragon of software best practices in the enlightened post-Python 2.7 era.

So, @leycec. Bro. What’s The Big Deal, Yo?

Saliently, replacing find_packages() with find_namespace_packages() is not a valid solution for most projects. Why? Because find_namespace_packages() erroneously matches all repository subdirectories1 in the root repository as importable packages.

1. …possibly excluding dot directories, not like it particularly matters. </shrug>

Clearly, most repository subdirectories in the root repository are not importable packages; they’re superfluous workflow subdirectories like {package_name}.egg-info/, build/, dist/, docs/, pip-wheel-metadata/, and the list just goes on and on. Moreover, the set of these subdirectories significantly changes over time and is almost entirely outside the control of downstream developers. Explicitly listing these subdirectories in the exclude parameter to find_namespace_packages() is thus infeasible. Therefore, replacing find_packages() with find_namespace_packages() is not a valid solution for most projects.

So, @leycec. Bro. How Did You Fix This?

Thanks. I’m so glad you asked. The obvious solution is just to abandon setuptools.

That’s what everybody else has done. But I’m curmudgeonly. I have a grubby beard and live in a mildew-infested cabin in the Canadian wilderness. People like me are disinclined to do what we should. Instead, I did what I shouldn’t.

I continue using setuptools despite its repeated outbursts of insanity. In this case, I compelled setuptools to obey my perfidious will via ludicrous boilerplate which I will now copy-and-paste into every Python project I maintain – much to the shared agony of junior developers and my wife, who must now suffer my delusions in silence. This is that boilerplate:

# In "setup.py":
PACKAGE_NAME = 'your_package_name_here'  # <-- edit this, you who are sadly reading this and now contemplating watching "Harakiri" on a Friday at 2:32AM despite knowing that to be a very bad idea

# Ludicrous boilerplate: I summon you!
import os, setuptools

# Do you remember when setuptools just worked? Because @leycec remembers.
_PACKAGE_NONDATA_NAMES = setuptools.find_packages(exclude=(
    'test',
    'test.*',
))
'''
List of the fully-qualified names of all **non-data Python packages** (i.e.,
directories containing the standard ``"__init__.py"`` file and zero or more
Python modules) to be installed, including the top-level application package and
all subpackages of this package but excluding the top-level test package and
all subpackages of that package.
'''

# This assumes your package data lives in a "data/" subdirectory of your package.
# If your package data lives elsewhere, it probably shouldn't.
_PACKAGE_DATA_NAMES = [
    f'{PACKAGE_NAME}.data.{package_data_name}'
    for package_data_name in setuptools.find_namespace_packages(
        where=os.path.join(PACKAGE_NAME, 'data'))
]
'''
List of the fully-qualified names of all **data Python pseudo-packages** (i.e.,
directories containing *no* standard ``"__init__.py"`` file and zero or more
data paths) to be installed.

Note that this is largely nonsensical. Ideally, a project subdirectory
containing *no* standard ``"__init__.py"`` file would be transparently treated
by both Python itself and :mod:`setuptools` as an unimportable non-package
directory rather than an importable package. Ideally, *only* directories
containing ``"__init__.py"`` files would be treated as importable packages.
Sadly, :pep:`420` (i.e., "Implicit Namespace Packages") fundamentally broke this
reasonable expectation by unconditionally forcing *all* project subdirectories
to be importable packages regardless of developer wants, needs, or expectations.

:mod:`setuptools` now "complies" with this nonsense by requiring that data
directories by explicitly listed as namespace packages. Of course, data
directories are *not* namespace packages -- but nobody in either the official
PyPA or CPython communities appears to care. If this is *not* done,
:mod:`setuptools` now emits one deprecation warning for each data subdirectory
and file resembling:

    Installing '{data_path}' as data is deprecated, please list it in `packages`.

Lastly, note that we could also avoid this unctuous list comprehension
altogether by simply replacing the above call to
:func:`setuptools.find_packages` with
:func:`setuptools.find_namespace_packages`. Then why do we not do so? Because
doing so would make things even worse. Why? Because then :mod:`setuptools` would
erroneously match *all* subdirectories of this root repository directory as
importable packages to be installed -- including obviously irrelevant root
subdirectories like ``"{package_name}.egg-info"``, ``".github"``, ``".github"``,
``"doc"``, and ``"pip-wheel-metadata"``. Since the set of all such
subdirectories frequently changes with upstream revisions beyond our control,
explicitly specifying this set by listing these ignorable subdirectories in an
``exclude`` parameter is infeasible. In short, this is the least bad thing.

See Also
----------
https://github.com/pypa/setuptools/issues/3340
    Upstream :mod:`setuptools` issue where :mod:`setuptools` casually admit to
    breaking their entire toolchain for no demonstrably good reason.
'''

# You're welcome.
setup(
    ...
    packages=_PACKAGE_DATA_NAMES + _PACKAGE_NONDATA_NAMES,
)

I didn’t make insanity. I only break it over my arthritic knee.

So, @leycec. Bro. Could You Like Stop Talking?

The party ends abruptly when @leycec walks through the door. The silence is deafening. I’m pretty sure the silence gave me tinnitus. Since everyone fled, I’ll say one last thing to the empty room:

Continually waving your hands about while screeching “PEP 420 made us stab ourselves in our eyeballs!!!” is no valid justification for stabbing everyone else in their eyeballs, too.

From my perspective, I have specified
package_data={'my_package': ['data_folder/*.*']} the intention is clear that I consider it a folder of data files of my_package, I don’t really care that python considers it a namespace package. It would be really nice if setuptools added the namespace packages for me instead of giving a warning.

The above discussion covers how to go about building a correct distribution without using deprecated setuptools functionality. Thanks all for that!

My question here is more philosophical than practical. It seems like setup(packages=...) and MANIFEST.in are at least partially overlapping in functionality. In our MANIFEST.in we specify exactly what files and directories to include in distribution bundles. It seems redundant to have to specify similar information (the directories part) via the packages= configuration. Couldn’t that list of packages theoretically be inferred from the MANIFEST.in contents?

I’m faced with the inverse situation: my packages= and MANIFEST.in are well defined, with exactly the files I want included in my wheel. There are files within the package that I don’t want to see included (e.g. tests, sass files, non minified js, etc.). Now, setuptools adds them to the wheel, with no “Stop doing this, I know what I’m doing” flag that I can see. Am I missing something?

I was following this issue because I was very confused (as a newbie) by the warning that came up. Can I suggest a further edit to the warning message: Currently {importable!r} is only added to the distribution because it may contain data files… –> Currently {importable!r} has been automatically added to the distribution because it may contain data files… This is my understanding of what is actually happening (i.e. automatic inclusion). If this suggested change is wrong, then I’m still not following how this works…

Hi @shakfu, thank you very much for sharing your thoughts. Please see my comments below:

[…] quite challenging situation due to recent changes (PEP 420) […] But this problematic new state is not one that developers should learn to live with […]

Please note that the “directory is a package” behaviour introduced in PEP 420 is actually quite old. The PEP has been approved in 19/Apr/2012 and the specified behaviour implemented in Python 3.3, over 10 years ago which is a lot in “software development years”. It is safe to assume that this behaviour is stable and that at some point in their carreers Python developers will indeed learn what a Python package means and how adding a directory nested somewhere under one of the entries of sys.path corresponds effectively to create a Python package that can be regularly imported as any other package via an import statement.

[…] they should not be compelled to adopt insecure quick fixes.

Could you please clarify in what sense the solutions presented in the warning message are insecure? The error message presents to the user a couple of suggestions (to be chosen accordingly to what fits better their use case) which include to manually add the missing entries to the packages configuration option, or to use a convenience function provided by setuptools. In both cases, the procedure is very mature and stable.

The proposed fix which resolves the deprecation is apparently to use find_namespace_packages. But this is just wrong. Since namespace packages are meant to contain importable executable code.

I understand that this is a popular interpretation of what the concept of packages (and/or namespace packages) might mean for Python and I see where it comes from. But I don’t think this interpretation is backed by the Python implementation and the way it works…

You can import directories that don’t contain .py files, and having packages for holding non-Python files is actually a very useful feature[^2]! I never found official documentation saying that packages/namespace packages are meant to contain importable executable code and cannot be used to contain only non-Python files, I don’t think there is an official stance on that.

[^2]: They make it really easy to find non-Python files them runtime using importlib.resources. You can also implement extensions/plugin systems on top of it, and etc…

Within the resources folder structure (which is specified by the library I’m wrapping) there are 775 files, 10.4 MB in total. Most of these are templated source files and all are part of the thirdparty library itself. But all are precisely data, not to be imported in the project package, only to be used for the purposes of code generation.

If these non-Python files are not meant to be installed in the end-user’s machine, I believe it is a matter of properly configuring packages/package_data/include_package_data/exclude_package_data/MANIFEST.in so that they are part of the sdist but not part of the wheel. Otherwise, if they end up nested somewhere under an entry of sys.path, they will be import packages, effectively.

If you really don’t like the idea of having these directories as importable packages, then the alternative is to use data-files, which will translate into a special directory in the wheel file ({name}-{version}.data/data/). In turn, pip will stall them in a different location that will not be nested somewhere in sys.path. It is a lot of effort to align expectation and implementation, for most people it might just be worth to adapt their expectations.

May I suggest that the deprecation warning be dropped or even abbreviated somewhat until the data / code packaging distinction can be restored and preserved by some new PEP or other. It makes no sense to promote solutions, in the interim, which incorrectly conflate the two.

I think that simply dropping the warning would be unwise. The purpose of the warning is for developers to align their configuration to their expectations. If they want certain directories to be installed somewhere under sys.path, this effectively mean that they are asking setupotools to include certains packages/subpackages into the wheel. Within setuptools, that desire is captured by the packages configuration option.

We need users to start clarifying their configuration, because the next step is to fix other related bugs (see https://github.com/pypa/setuptools/issues/3340#issuecomment-1219321087), and it would be bad if suddenly some folders are missing from packages.

I don’t know of anyone currently attempting to introduce a data / code distinction (and what that will mean for Python packages and import system) via a new PEP. Personally, I like the status quo and I think it works quite well. Since there is no concrete plans for such change in the ecosystem, there is no foreseeable risk of conflation, and we don’t need to treat this situation as “interim”. As far as we know this is the stable behaviour that we should be targetting to achieve after 10 years of transition.


There is another approach[^4] that I have absolute no problems in considering and actually would welcome with open arms: if a member of the community is willing to contribute (i.e. design, discuss, find consensus, implement, document, fix, support …) a different way of configuring setuptools that is more conceptually self-evident and less prone to ambiguity than packages/package_data/include_package_data/exclude_package_data/MANIFEST.in[^3]. Extra requirements for such solution are: backward compatibility and easy maintenance.

It is a tough challenge which I don’t have the resources to tackle myself, but I would be very grateful if someone else can.

[^3]: In some sense, the automatic discovery (when the user does not specify packages) that was introduced a couple of years ago is meant to be easier and less confusing. But automatic discovery is not a fit for all and edge cases still require playing with packages/package_data/include_package_data/exclude_package_data.

[^4]: But that is orthogonal to the warning and next steps discussed here.

hey there -

why does this warning refer to find_namespace when the current documentation indicates that find_namespace is only for namespace packages ? can this documentation please be updated to indicate it now has another use case for packages that are explicitly not namespace packages also ?

setuptools provides find_namespace: (find_namespace_packages()) which behaves similarly to find: but works with namespace packages.

should read something like:

setuptools provides find_namespace: (find_namespace_packages()) which behaves similarly to find: but works with namespace packages; additionally, **it allows one to indicate specific file paths to be included such as XYZ when including datafiles etc etc **

or otherwise can some document please be added that exactly explains the situation the warning is detecting and how we are to treat this situation as a “namespace package”. Projects with datafiles are ubiquitous and it’s not reasonable to launch a new warning that refers to off-label use of some obscure feature of setuptools in a vague way as how to resolve.

I have been stumped for hours by this issue. I just want to add data folders to my package, and there is no clear instructions to do this. How can we specify in a pyproject.toml which folders we want to recursively include (and especially how to specify a nested folder from where to start looking)? The documentation does not provide any example for nested structures.

EDIT: Also this issue would be a potential solution if implemented: https://github.com/pypa/setuptools/issues/3341

EDIT2: For example, here is my repository producing (lots of) warnings: https://github.com/lrq3000/pyFileFixity/tree/ea447c548c9b736ea3c2c76bafa61bf1b51af4ca

And YES I can suppress the warnings with auto discovery by removing all the content of tool.setuptools.packages.find, but I do not want to rely on a beta feature! I want to manually specify my project’s structure, I prefer to know what I am doing and to be explicit, especially for something as crucial as packaging, it needs to be very deterministic and future-proof.

I have spent a few minutes trying a src layout but I can’t get an editable pip install to work.

edit:

pkg_name = "my-package"

# Obviously in "flat" layout you place files under the package's slug if its name includes a dash.
pkg_slug = pkg_name.replace("-", "_")

setup(
    ...
    packages=setuptools.find_namespace_packages(
        include=[pkg_slug, pkg_slug + ".*"],
        exclude=["tests", "tests.*"],
    ),
    ...
)

Something like this appears to work for me. Where a typical project is structured like:

README.md
setup.py
my_package/
    __init__.py
    __main__.py
    data/
        data1.yaml
        data2.yaml
    ...
tests/
    __init__.py
    test_entry.py
    ...

It’s not really clear to me from the documentation what setuptools actually wants. I think where this software fails is in making it clearer what the nominal package structure is. It’s nice that you can kind of do whatever you want and make it work but most of us shipping software out into the wild would rather just conform to something that “just works” and move on. I’m not really seeing that solution emerge out of the discussion or the documentation.

@milesgranger you are correct. This is a bug (https://github.com/pypa/setuptools/issues/3260).

The problem is that we cannot resolve this bug without first deprecating and removing the behaviour described in this issue (you can see that there is a lot of people depending on it yet…).

I suppose you can have a workaround by one of the following:

  • Set exclude_package_data to remove all files in the tests folder or

  • Set include_package_data=False and add package_data with more specific file patterns.

Sorry for the trouble, if we change things right now, several projects in the ecosystem might break (so we have to go through the deprecation period).

@abravalheri I stumbled on a bunch of build warning burried in CI logs that I was reviewing randomly and I am puzzled… can you articulate what end-user benefit do you expect with this change? (e.g. package maintainers that rely on setuptools) ?

Personally I do not think such as warning can be easily seen. My wheels contains thousands of files and the warnings are just drowned in CI log files never looked at unless the build fails. So I am not convinced that this warning would have much effect.

You wrote:

Right now there is no concept of a “data directory” for the package ecosystem.

IMHO the current behaviour is the de-facto way that package maintainers understand and have grown to rely on. e.g. when you “include_package_data” anything (file or dir) in the tree of included packages is included.

Since PEP 420, effectively all directories are packages regardless of containing a init.py file or not. With this warning, my intention is to align the expectations of the users with the behaviour we observe in Python.

What is the Python behaviour there beyond the fact that files in the package tree are accessible? I could not find anything about data files or data directories mentioned in PEP 420.

Now, the proposed future behaviour does not seem entirely consistent: when there are data files in a directory with Python code (either a legacy init-style or “namespace” package) these are included but data files in a subdirectory of the same would be not included, e.g., some data files would need an intervention and some data files would not? Unless a subdir of a package dir is not a Python identifier (e.g. with a dash as in “foo-bar”), and then this is included without warning.

So if I understand the to-be behaviour correctly based on deprecation messages this would mean this (assuming in all cases that include_package_data is True):

  • plain data files under a legacy or namespace package directory are included
  • directories with a non-valid Python identifier name under a legacy or namespace package directory are included
  • directories with a valid Python name under a legacy or namespace package directory will not be included and would require special treatment (e.g. adding an __init__.py) or a declaration such that are treated as namespace packages.

I am not sure that this would contribute to a better and consistent user experience.

Hi @mhkline. The warning is not about the files themselves, but about the directory. Right now there is no concept of a “data directory” for the package ecosystem.

Since PEP 420, effectively all directories are packages regardless of containing a __init__.py file or not. With this warning, my intention is to align the expectations of the users with the behaviour we observe in Python.

If you want the directory to be included in the distribution, you can include it via the packages= configuration. find_namespace_packages() in setup.py or find_namespace: in setup.cfg will do that for you, and probably make the warning go away.

I’m facing the same problem, and I’m not clear on the nature of the change suggested by @abravalheri. Are you saying that directories in the package hierarchy that contain only data files and no Python code should be included in the project’s list of packages despite not actually being Python packages?