pandas: BUILD: missing test data in 2.1.0 sdist/install
Installation check
- I have read the installation guide.
Platform
Linux-6.4.7-gentoo-dist-x86_64-AMD_Ryzen_5_3600_6-Core_Processor-with-glibc2.38
Installation Method
pip install
pandas Version
2.1.0
Python Version
3.11.5
Installation Logs
$ pip install pandas
Collecting pandas
Obtaining dependency information for pandas from https://files.pythonhosted.org/packages/d9/26/895a49ebddb4211f2d777150f38ef9e538deff6df7e179a3624c663efc98/pandas-2.1.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata
Downloading pandas-2.1.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (18 kB)
Collecting numpy>=1.23.2 (from pandas)
Obtaining dependency information for numpy>=1.23.2 from https://files.pythonhosted.org/packages/32/6a/65dbc57a89078af9ff8bfcd4c0761a50172d90192eaeb1b6f56e5fbf1c3d/numpy-1.25.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata
Using cached numpy-1.25.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.6 kB)
Collecting python-dateutil>=2.8.2 (from pandas)
Using cached python_dateutil-2.8.2-py2.py3-none-any.whl (247 kB)
Collecting pytz>=2020.1 (from pandas)
Using cached pytz-2023.3-py2.py3-none-any.whl (502 kB)
Collecting tzdata>=2022.1 (from pandas)
Using cached tzdata-2023.3-py2.py3-none-any.whl (341 kB)
Collecting six>=1.5 (from python-dateutil>=2.8.2->pandas)
Using cached six-1.16.0-py2.py3-none-any.whl (11 kB)
Downloading pandas-2.1.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.6 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12.6/12.6 MB 63.1 MB/s eta 0:00:00
Using cached numpy-1.25.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.2 MB)
Installing collected packages: pytz, tzdata, six, numpy, python-dateutil, pandas
Successfully installed numpy-1.25.2 pandas-2.1.0 python-dateutil-2.8.2 pytz-2023.3 six-1.16.0 tzdata-2023.3
$ cd ${VIRTUAL_ENV}/lib/python3.11/site-packages
# e.g.:
$ python -m pytest pandas/tests/io/parser/common/test_file_buffer_url.py::test_context_manager -x
========================================================= test session starts =========================================================
platform linux -- Python 3.11.5, pytest-7.4.0, pluggy-1.3.0
rootdir: /tmp/.venv/lib/python3.11/site-packages/pandas
configfile: pyproject.toml
plugins: asyncio-0.21.1, hypothesis-6.82.7
asyncio: mode=Mode.STRICT
collected 4 items
pandas/tests/io/parser/common/test_file_buffer_url.py F
============================================================== FAILURES ===============================================================
____________________________________________________ test_context_manager[c_high] _____________________________________________________
all_parsers = <pandas.tests.io.parser.conftest.CParserHighMemory object at 0x7ff0df5f2fd0>
datapath = <function datapath.<locals>.deco at 0x7ff0df775440>
def test_context_manager(all_parsers, datapath):
# make sure that opened files are closed
parser = all_parsers
> path = datapath("io", "data", "csv", "iris.csv")
pandas/tests/io/parser/common/test_file_buffer_url.py:372:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
args = ('io', 'data', 'csv', 'iris.csv'), path = '/tmp/.venv/lib/python3.11/site-packages/pandas/tests/io/data/csv/iris.csv'
def deco(*args):
path = os.path.join(BASE_PATH, *args)
if not os.path.exists(path):
if strict_data_files:
> raise ValueError(
f"Could not find file {path} and --no-strict-data-files is not set."
)
E ValueError: Could not find file /tmp/.venv/lib/python3.11/site-packages/pandas/tests/io/data/csv/iris.csv and --no-strict-data-files is not set.
pandas/conftest.py:1201: ValueError
------------------------------ generated xml file: /tmp/.venv/lib/python3.11/site-packages/test-data.xml ------------------------------
======================================================== slowest 30 durations =========================================================
(3 durations < 0.005s hidden. Use -vv to show these durations.)
======================================================= short test summary info =======================================================
FAILED pandas/tests/io/parser/common/test_file_buffer_url.py::test_context_manager[c_high] - ValueError: Could not find file /tmp/.venv/lib/python3.11/site-packages/pandas/tests/io/data/csv/iris.csv and --no-strict-data-fil...
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
========================================================== 1 failed in 0.12s ==========================================================
<del>The file’s in source directory, so I guess it isn’t installed by meson.</del>
About this issue
- Original URL
- State: open
- Created 10 months ago
- Reactions: 2
- Comments: 24 (23 by maintainers)
Commits related to this issue
- Allow tests to use the data files in the source tree We don't ship these in the package, but do want to run the tests that use them tests_path() is removed completely because it is unclear whether i... — committed to raspbian-packages/pandas by deleted user 5 months ago
- Allow tests to use the data files in the source tree We don't ship these in the package, but do want to run the tests that use them tests_path() is removed completely because it is unclear whether i... — committed to raspbian-packages/pandas by deleted user 4 months ago
The version from https://github.com/gentoo/gentoo/blob/1761e8fcdfda09370046cdd0e382c3aa206d3f61/dev-python/pandas/pandas-2.1.0.ebuild is more up-to-date.
I’ve done
--no-strict-data-files
for now but 1) as @bnavigator points out, it’s far from optimal, 2) it doesn’t seem to cover lxml tests.I’ve learned that apparently
.gitattributes
are necessary because of bad design in meson(-python) that apparently doesn’t allow controlling sdist contents (sigh).My only idea so far would be to move all the undesirable test data from subdirectories into one git submodule. That should prevent it from being included in sdist, and make it easy for us to fetch it independently and merge with the rest.
That’s the problem: The direct download link is the same as the “Source code” on the release page: https://github.com/pandas-dev/pandas/archive/refs/tags/v2.1.0.zip does not contain the data. Only a proper git clone will have it.
The directories in it are empty.
Only if there is a setup.py for versioneer to use. But that one is also missing.
Yes, it lacks the test data. We distribution packagers need to run the test suite as completely as possible in order to ensure package integrity
I can understand shrinking wheel sizes but what I don’t really understand is why you’re also stripping it from GitHub archives that are our last fallback for when sdists are unsuitable for testing.
It’s not just the test data - the documentation, and possibly also some smaller items, have also been removed.
In Debian, I’ve switched to using the git repository itself, so this isn’t blocking for my packaging. I don’t know whether the Gentoo or openSUSE build tools have an equivalent option.
If you do that, you might find my patch for loading test data from a different path useful. (Debian prefers to (also) run tests against the as-installed package, and we don’t want test data taking up space in the user package either.)