pandas: BUILD: missing test data in 2.1.0 sdist/install

Installation check

Platform

Linux-6.4.7-gentoo-dist-x86_64-AMD_Ryzen_5_3600_6-Core_Processor-with-glibc2.38

Installation Method

pip install

pandas Version

2.1.0

Python Version

3.11.5

Installation Logs

$ pip install pandas
Collecting pandas
  Obtaining dependency information for pandas from https://files.pythonhosted.org/packages/d9/26/895a49ebddb4211f2d777150f38ef9e538deff6df7e179a3624c663efc98/pandas-2.1.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata
  Downloading pandas-2.1.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (18 kB)
Collecting numpy>=1.23.2 (from pandas)
  Obtaining dependency information for numpy>=1.23.2 from https://files.pythonhosted.org/packages/32/6a/65dbc57a89078af9ff8bfcd4c0761a50172d90192eaeb1b6f56e5fbf1c3d/numpy-1.25.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata
  Using cached numpy-1.25.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.6 kB)
Collecting python-dateutil>=2.8.2 (from pandas)
  Using cached python_dateutil-2.8.2-py2.py3-none-any.whl (247 kB)
Collecting pytz>=2020.1 (from pandas)
  Using cached pytz-2023.3-py2.py3-none-any.whl (502 kB)
Collecting tzdata>=2022.1 (from pandas)
  Using cached tzdata-2023.3-py2.py3-none-any.whl (341 kB)
Collecting six>=1.5 (from python-dateutil>=2.8.2->pandas)
  Using cached six-1.16.0-py2.py3-none-any.whl (11 kB)
Downloading pandas-2.1.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.6 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12.6/12.6 MB 63.1 MB/s eta 0:00:00
Using cached numpy-1.25.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.2 MB)
Installing collected packages: pytz, tzdata, six, numpy, python-dateutil, pandas
Successfully installed numpy-1.25.2 pandas-2.1.0 python-dateutil-2.8.2 pytz-2023.3 six-1.16.0 tzdata-2023.3
$ cd ${VIRTUAL_ENV}/lib/python3.11/site-packages
# e.g.:
$ python -m pytest pandas/tests/io/parser/common/test_file_buffer_url.py::test_context_manager -x
========================================================= test session starts =========================================================
platform linux -- Python 3.11.5, pytest-7.4.0, pluggy-1.3.0
rootdir: /tmp/.venv/lib/python3.11/site-packages/pandas
configfile: pyproject.toml
plugins: asyncio-0.21.1, hypothesis-6.82.7
asyncio: mode=Mode.STRICT
collected 4 items                                                                                                                     

pandas/tests/io/parser/common/test_file_buffer_url.py F

============================================================== FAILURES ===============================================================
____________________________________________________ test_context_manager[c_high] _____________________________________________________

all_parsers = <pandas.tests.io.parser.conftest.CParserHighMemory object at 0x7ff0df5f2fd0>
datapath = <function datapath.<locals>.deco at 0x7ff0df775440>

    def test_context_manager(all_parsers, datapath):
        # make sure that opened files are closed
        parser = all_parsers
    
>       path = datapath("io", "data", "csv", "iris.csv")

pandas/tests/io/parser/common/test_file_buffer_url.py:372: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

args = ('io', 'data', 'csv', 'iris.csv'), path = '/tmp/.venv/lib/python3.11/site-packages/pandas/tests/io/data/csv/iris.csv'

    def deco(*args):
        path = os.path.join(BASE_PATH, *args)
        if not os.path.exists(path):
            if strict_data_files:
>               raise ValueError(
                    f"Could not find file {path} and --no-strict-data-files is not set."
                )
E               ValueError: Could not find file /tmp/.venv/lib/python3.11/site-packages/pandas/tests/io/data/csv/iris.csv and --no-strict-data-files is not set.

pandas/conftest.py:1201: ValueError
------------------------------ generated xml file: /tmp/.venv/lib/python3.11/site-packages/test-data.xml ------------------------------
======================================================== slowest 30 durations =========================================================

(3 durations < 0.005s hidden.  Use -vv to show these durations.)
======================================================= short test summary info =======================================================
FAILED pandas/tests/io/parser/common/test_file_buffer_url.py::test_context_manager[c_high] - ValueError: Could not find file /tmp/.venv/lib/python3.11/site-packages/pandas/tests/io/data/csv/iris.csv and --no-strict-data-fil...
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
========================================================== 1 failed in 0.12s ==========================================================

<del>The file’s in source directory, so I guess it isn’t installed by meson.</del>

About this issue

  • Original URL
  • State: open
  • Created 10 months ago
  • Reactions: 2
  • Comments: 24 (23 by maintainers)

Commits related to this issue

Most upvoted comments

The version from https://github.com/gentoo/gentoo/blob/1761e8fcdfda09370046cdd0e382c3aa206d3f61/dev-python/pandas/pandas-2.1.0.ebuild is more up-to-date.

I’ve done --no-strict-data-files for now but 1) as @bnavigator points out, it’s far from optimal, 2) it doesn’t seem to cover lxml tests.

I’ve learned that apparently .gitattributes are necessary because of bad design in meson(-python) that apparently doesn’t allow controlling sdist contents (sigh).

My only idea so far would be to move all the undesirable test data from subdirectories into one git submodule. That should prevent it from being included in sdist, and make it easy for us to fetch it independently and merge with the rest.

(You might want to consider building from the git tag of the release then, if you need all the files.)

That’s the problem: The direct download link is the same as the “Source code” on the release page: https://github.com/pandas-dev/pandas/archive/refs/tags/v2.1.0.zip does not contain the data. Only a proper git clone will have it.

I checked the Source code (tar.gz) file, and it has the pandas/tests/io/data directory.

The directories in it are empty.

_version_meson.py is generated by the build system.

Only if there is a setup.py for versioneer to use. But that one is also missing.

Also, you mention the Github archive is a fallback, is there something wrong with the sdist on PyPI?

Yes, it lacks the test data. We distribution packagers need to run the test suite as completely as possible in order to ensure package integrity

I can understand shrinking wheel sizes but what I don’t really understand is why you’re also stripping it from GitHub archives that are our last fallback for when sdists are unsuitable for testing.

It’s not just the test data - the documentation, and possibly also some smaller items, have also been removed.

In Debian, I’ve switched to using the git repository itself, so this isn’t blocking for my packaging. I don’t know whether the Gentoo or openSUSE build tools have an equivalent option.

move all the undesirable test data from subdirectories into one git submodule

If you do that, you might find my patch for loading test data from a different path useful. (Debian prefers to (also) run tests against the as-installed package, and we don’t want test data taking up space in the user package either.)