pandas: Pandas doesn't recognize Pyarrow as a Parquet engine even though it's installed

Code Sample, a copy-pastable example if possible

In [6]: pd.io.parquet.get_engine('auto')
---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-6-77cb1d6c8933> in <module>
----> 1 pd.io.parquet.get_engine('auto')

~/miniconda3/lib/python3.6/site-packages/pandas/io/parquet.py in get_engine(engine)
     30             pass
     31
---> 32         raise ImportError("Unable to find a usable engine; "
     33                           "tried using: 'pyarrow', 'fastparquet'.\n"
     34                           "pyarrow or fastparquet is required for parquet "

ImportError: Unable to find a usable engine; tried using: 'pyarrow', 'fastparquet'.
pyarrow or fastparquet is required for parquet support

Problem description

Pandas doesn’t recognize Pyarrow as a Parquet engine even though it’s installed. Note that you can see that Pyarrow 0.12.0 is installed in the output of pd.show_versions() below.

Expected Output

In [2]: pd.io.parquet.get_engine('auto')
Out[2]: <pandas.io.parquet.PyArrowImpl at 0x119c78f28>

Output of pd.show_versions()

In [5]: pd.show_versions()

INSTALLED VERSIONS

commit: None python: 3.6.6.final.0 python-bits: 64 OS: Linux OS-release: 4.15.0-29-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: en_US.UTF-8 LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.24.0 pytest: 3.9.3 pip: 18.1 setuptools: 40.5.0 Cython: None numpy: 1.15.4 scipy: 1.1.0 pyarrow: 0.12.0 xarray: None IPython: 7.1.1 sphinx: 1.8.2 patsy: 0.5.1 dateutil: 2.7.5 pytz: 2018.7 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: 3.0.1 openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml.etree: 4.2.5 bs4: None html5lib: None sqlalchemy: None pymysql: None psycopg2: None jinja2: 2.10 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None gcsfs: None

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Reactions: 5
  • Comments: 17 (3 by maintainers)

Most upvoted comments

TLDR: I got it working by uninstalling via conda and installing with pip. So it appears that there’s something off about that specific conda version. Sorry for the noise.

Details below for others.

I didn’t have multiple versions of pyarrow installed.

I uninstalled via conda, verified I didn’t have pyarrow from pip, reinstalled via conda, and got the same error:

brandon@datascience-dev:~
$ conda uninstall pyarrow parquet-cpp
Collecting package metadata: done
Solving environment: done

## Package Plan ##

  environment location: /home/brandon/miniconda3

  removed specs:
    - parquet-cpp
    - pyarrow


The following packages will be REMOVED:

  parquet-cpp-1.5.1-1
  pyarrow-0.11.1-py36hbbcf98d_1002


Proceed ([y]/n)? y

Preparing transaction: done
Verifying transaction: done
Executing transaction: done
brandon@datascience-dev:~
$ pip uninstall pyarrow
Skipping pyarrow as it is not installed.
brandon@datascience-dev:~
$ python -c 'import pyarrow'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
ModuleNotFoundError: No module named 'pyarrow'
brandon@datascience-dev:~
$ pip show pyarrow
brandon@datascience-dev:~
$ conda list pyarrow
# packages in environment at /home/brandon/miniconda3:
#
# Name                    Version                   Build  Channel
brandon@datascience-dev:~
$ conda install -y pyarrow
Collecting package metadata: done
Solving environment: done

## Package Plan ##

  environment location: /home/brandon/miniconda3

  added / updated specs:
    - pyarrow


The following NEW packages will be INSTALLED:

  parquet-cpp        conda-forge/noarch::parquet-cpp-1.5.1-2
  pyarrow            conda-forge/linux-64::pyarrow-0.12.0-py36hbbcf98d_2

The following packages will be UPDATED:

  arrow-cpp                        0.11.1-py36h0e61e49_1004 --> 0.12.0-py36h0e61e49_0


Preparing transaction: done
Verifying transaction: done
Executing transaction: done
brandon@datascience-dev:~
$ python -c 'import pyarrow'

And then

brandon@datascience-dev:~
$ jupyter console
Jupyter console 6.0.0

Python 3.6.6 | packaged by conda-forge | (default, Oct 12 2018, 14:08:43)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.1.1 -- An enhanced Interactive Python. Type '?' for help.



In [1]: import pandas as pd

In [2]: pd.io.parquet.PyArrowImpl()
---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
~/miniconda3/lib/python3.6/site-packages/pandas/io/parquet.py in __init__(self)
     82             import pyarrow
---> 83             import pyarrow.parquet
     84         except ImportError:

~/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py in <module>
     29 import pyarrow.lib as lib
---> 30 import pyarrow._parquet as _parquet
     31

ImportError: /home/brandon/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libparquet.so.12: undefined symbol: _ZNK5boost16re_detail_10680031cpp_regex_traits_implementationIcE17transform_primaryB5cxx11EPKcS4_

During handling of the above exception, another exception occurred:

ImportError                               Traceback (most recent call last)
<ipython-input-2-9b1c87fa6892> in <module>
----> 1 pd.io.parquet.PyArrowImpl()

~/miniconda3/lib/python3.6/site-packages/pandas/io/parquet.py in __init__(self)
     84         except ImportError:
     85             raise ImportError(
---> 86                 "pyarrow is required for parquet support\n\n"
     87                 "you can install via conda\n"
     88                 "conda install pyarrow -c conda-forge\n"

ImportError: pyarrow is required for parquet support

you can install via conda
conda install pyarrow -c conda-forge

or via pip
pip install -U pyarrow

I got it working by uninstalling via conda and installing with pip:

$ pip install -U pyarrow
Collecting pyarrow
  Downloading https://files.pythonhosted.org/packages/42/be/8682e7f6c12dd42f31f32b18d5d61b4f578f12aaa9058081aef954a481d3/pyarrow-0.12.0-cp36-cp36m-manylinux1_x86_64.whl (12.5MB)
    100% |████████████████████████████████| 12.5MB 6.0MB/s
Requirement already satisfied, skipping upgrade: six>=1.0.0 in /home/brandon/miniconda3/lib/python3.6/site-packages (from pyarrow) (1.11.0)
Requirement already satisfied, skipping upgrade: numpy>=1.14 in /home/brandon/miniconda3/lib/python3.6/site-packages (from pyarrow) (1.15.4)
Installing collected packages: pyarrow
Successfully installed pyarrow-0.12.0
brandon@datascience-dev:~/miniconda3/lib/python3.6/site-packages
$ python
Python 3.6.7 | packaged by conda-forge | (default, Nov 21 2018, 03:09:43)
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> pd.io.parquet.PyArrowImpl()
<pandas.io.parquet.PyArrowImpl object at 0x7fc8ab9ee0f0>

So it appears that there’s something off about that specific conda version.

Can you debug this any further?

Can you try running

pd.io.parquet.PyArrowImpl()

and posting the traceback?

Same issue here: ImportError: Missing optional dependency 'pyarrow'. pyarrow is required for parquet support. Use pip or conda to install pyarrow.

pandas             2.0.3
pyarrow            14.0.1

pd.read_parquet(file_name, engine="pyarrow")

Python 3.11.6

Linux Mint 21.2 Cinnamon

Do you have multiple versions of pyarrow installed (perhaps one from pip)?

From your traceback, it seems like the issue is specifically pyarrow.parquet. I’m not sure that site-packages/pyarrow/../../../libparquet.so.12 is the expected path for libparquet… I’d recommend conda uninstalling pyarrow, parquet-cpp, and pip uninstall pyarrow a few times.

I’m going to close this, since it seems to be an issue with your environment, but please keep posting here in case others run into the same issue.