arrow: [Python] BUG: Reading ORC segfaults on windows (if TZDIR isn't set)

Describe the bug, including details regarding any error messages, version, and platform.

orc on windows doesn’t exist for a long time yet in conda-forge, and we’ve only recently enabled it for the C++ portion of arrow. I tried to switch it on for pyarrow now as well in https://github.com/conda-forge/arrow-cpp-feedstock/pull/1086, and the test suite segfaults as soon as it gets to test_dataset.py::test_orc_format

stacktrace
[...]
test_dataset.py::test_ipc_format[threaded] PASSED                        [ 20%]
test_dataset.py::test_ipc_format[serial] PASSED                          [ 20%]
Fatal Python error: Aborted

Thread 0x00000ea0 (most recent call first):
  File "D:\bld\apache-arrow_1686428319811\_test_env\Lib\site-packages\pyarrow\tests\test_dataset.py", line 261 in to_table
  File "D:\bld\apache-arrow_1686428319811\_test_env\Lib\site-packages\pyarrow\tests\test_dataset.py", line 2991 in test_orc_format
  File "D:\bld\apache-arrow_1686428319811\_test_env\Lib\site-packages\_pytest\python.py", line 194 in pytest_pyfunc_call
  File "D:\bld\apache-arrow_1686428319811\_test_env\Lib\site-packages\pluggy\_callers.py", line 39 in _multicall
  File "D:\bld\apache-arrow_1686428319811\_test_env\Lib\site-packages\pluggy\_manager.py", line 80 in _hookexec
  File "D:\bld\apache-arrow_1686428319811\_test_env\Lib\site-packages\pluggy\_hooks.py", line 265 in __call__
  File "D:\bld\apache-arrow_1686428319811\_test_env\Lib\site-packages\_pytest\python.py", line 1799 in runtest
  File "D:\bld\apache-arrow_1686428319811\_test_env\Lib\site-packages\_pytest\runner.py", line 169 in pytest_runtest_call
  File "D:\bld\apache-arrow_1686428319811\_test_env\Lib\site-packages\pluggy\_callers.py", line 39 in _multicall
  File "D:\bld\apache-arrow_1686428319811\_test_env\Lib\site-packages\pluggy\_manager.py", line 80 in _hookexec
  File "D:\bld\apache-arrow_1686428319811\_test_env\Lib\site-packages\pluggy\_hooks.py", line 265 in __call__
  File "D:\bld\apache-arrow_1686428319811\_test_env\Lib\site-packages\_pytest\runner.py", line 262 in <lambda>
  File "D:\bld\apache-arrow_1686428319811\_test_env\Lib\site-packages\_pytest\runner.py", line 341 in from_call
  File "D:\bld\apache-arrow_1686428319811\_test_env\Lib\site-packages\_pytest\runner.py", line 261 in call_runtest_hook
  File "D:\bld\apache-arrow_1686428319811\_test_env\Lib\site-packages\_pytest\runner.py", line 222 in call_and_report
  File "D:\bld\apache-arrow_1686428319811\_test_env\Lib\site-packages\_pytest\runner.py", line 133 in runtestprotocol
  File "D:\bld\apache-arrow_1686428319811\_test_env\Lib\site-packages\_pytest\runner.py", line 114 in pytest_runtest_protocol
  File "D:\bld\apache-arrow_1686428319811\_test_env\Lib\site-packages\pluggy\_callers.py", line 39 in _multicall
  File "D:\bld\apache-arrow_1686428319811\_test_env\Lib\site-packages\pluggy\_manager.py", line 80 in _hookexec
  File "D:\bld\apache-arrow_1686428319811\_test_env\Lib\site-packages\pluggy\_hooks.py", line 265 in __call__
  File "D:\bld\apache-arrow_1686428319811\_test_env\Lib\site-packages\_pytest\main.py", line 348 in pytest_runtestloop
  File "D:\bld\apache-arrow_1686428319811\_test_env\Lib\site-packages\pluggy\_callers.py", line 39 in _multicall
  File "D:\bld\apache-arrow_1686428319811\_test_env\Lib\site-packages\pluggy\_manager.py", line 80 in _hookexec
  File "D:\bld\apache-arrow_1686428319811\_test_env\Lib\site-packages\pluggy\_hooks.py", line 265 in __call__
  File "D:\bld\apache-arrow_1686428319811\_test_env\Lib\site-packages\_pytest\main.py", line 323 in _main
  File "D:\bld\apache-arrow_1686428319811\_test_env\Lib\site-packages\_pytest\main.py", line 269 in wrap_session
  File "D:\bld\apache-arrow_1686428319811\_test_env\Lib\site-packages\_pytest\main.py", line 316 in pytest_cmdline_main
  File "D:\bld\apache-arrow_1686428319811\_test_env\Lib\site-packages\pluggy\_callers.py", line 39 in _multicall
  File "D:\bld\apache-arrow_1686428319811\_test_env\Lib\site-packages\pluggy\_manager.py", line 80 in _hookexec
  File "D:\bld\apache-arrow_1686428319811\_test_env\Lib\site-packages\pluggy\_hooks.py", line 265 in __call__
  File "D:\bld\apache-arrow_1686428319811\_test_env\Lib\site-packages\_pytest\config\__init__.py", line 166 in main
  File "D:\bld\apache-arrow_1686428319811\_test_env\Lib\site-packages\_pytest\config\__init__.py", line 189 in console_main
  File "D:\bld\apache-arrow_1686428319811\_test_env\Scripts\pytest-script.py", line 9 in <module>

Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, pyarrow.lib, pyarrow._hdfsio, pyarrow._fs, pyarrow._hdfs, pyarrow._gcsfs, pyarrow._s3fs, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.strptime, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.lib, pandas._libs.hashing, pandas._libs.tslib, pandas._libs.ops, pyarrow._compute, pandas._libs.arrays, pandas._libs.sparse, pandas._libs.reduction, pandas._libs.indexing, pandas._libs.index, pandas._libs.internals, pandas._libs.join, pandas._libs.writers, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.groupby, pandas._libs.testing, pandas._libs.parsers, pandas._libs.json, fastparquet.cencoding, fastparquet.speedups, pyarrow.gandiva, pyarrow._acero, pyarrow._csv, pyarrow._dataset, pyarrow._dataset_orc, pyarrow._parquet, pyarrow._dataset_parquet, pyarrow._orc, pyarrow._parquet_encryption, pyarrow._flight, pyarrow._substrait, _cffi_backend, pyarrow._pyarrow_cpp_tests, pyarrow._feather, pyarrow._json, numpy.linalg.lapack_lite, scipy._lib._ccallback_c, scipy.sparse._sparsetools, scipy.sparse._csparsetools, scipy.sparse.linalg._isolve._iterative, scipy.linalg._fblas, scipy.linalg._flapack, scipy.linalg._cythonized_array_utils, scipy.linalg._flinalg, scipy.linalg._solve_toeplitz, scipy.linalg._matfuncs_sqrtm_triu, scipy.linalg.cython_lapack, scipy.linalg.cython_blas, scipy.linalg._matfuncs_expm, scipy.linalg._decomp_update, scipy.sparse.linalg._dsolve._superlu, scipy.sparse.linalg._eigen.arpack._arpack, scipy.sparse.csgraph._tools, scipy.sparse.csgraph._shortest_path, scipy.sparse.csgraph._traversal, scipy.sparse.csgraph._min_spanning_tree, scipy.sparse.csgraph._flow, scipy.sparse.csgraph._matching, scipy.sparse.csgraph._reordering, pyarrow_cython_example, bound_function_visit_strings (total: 104)
Tests failed for pyarrow-tests-12.0.0-py311h385a57a_8_cpu.conda - moving package to D:\bld\broken
WARNING:conda_build.build:Tests failed for pyarrow-tests-12.0.0-py311h385a57a_8_cpu.conda - moving package to D:\bld\broken
TESTS FAILED: pyarrow-tests-12.0.0-py311h385a57a_8_cpu.conda

Component(s)

Format, Packaging, Python

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 56 (56 by maintainers)

Commits related to this issue

Most upvoted comments

set TZDIR=%CONDA_PREFIX%\share\zoneinfo

I can confirm that setting TZDIR makes pyarrow built with PYARROW_WITH_ORC=1 pass the test suite also on windows. 🥳

Thanks a lot for debugging this, great to have this finally sorted out!

How should we fix this though? I see that in #40609 you’re downloading the tzdb, but that’s not really viable for us in conda-forge. It would be good if pyarrow could automatically check %CONDA_PREFIX%\share\zoneinfo when looking for the tzdb (relative to the site-packages directory on windows, the path would be ../../share/zoneinfo).

Thanks for the debugging!

I started a CI run that sets TZDIR, see https://github.com/conda-forge/arrow-cpp-feedstock/pull/1086/commits/fd6a4f60ba8b78c1696bef2df6e789e82af6b4e2. Given that the build phase passed without issue, I did not set TZDIR in the build scripts, but I can do that as well if it helps.

Hmm. I hope that the Findlz4Alt.cmake can be found by set(CMAKE_MODULE_PATH "${CMAKE_CURRENT_LIST_DIR}") in ArrowConfig.cmake but it seems that it doesn’t work…

Anyway, this is not related to ORC. We can ignore this by removing lz4Alt from ARROW_SYSTEM_DEPENDENCIES. ARROW_SYSTEM_DEPENDENCIES is only needed for static linking and PyArrow uses shared linking.

cpp/ build uses its Findlz4Alt.cmake (we don’t need to set search path) but python/ build uses Findlz4Alt.cmake installed by cmake --install of cpp/ (we need to set search path). (If my assumption is correct. 😃

@wgtmac may this related to C++ Orc release?

I don’t think so. It seems that this issue exists in the old 1.8.x releases as well.