cudf: [FEA] Readers report which specified types are unsupported
Is your feature request related to a problem? Please describe.
Sometimes cudf.read_csv fails with
RuntimeError: cuDF failure at: /conda/conda-bld/libcudf_1581433420693/work/cpp/src/io/csv/legacy/csv_reader_impl.cu
when given the dtype=MY_TYPES argument. For example,
from io import StringIO
import cudf
import numpy as np
my_types = {
'frame_time': str,
'frame_number': int,
'ip_src': str,
'tcp_srcport': np.int32,
'ip_dst': str,
'tcp_dstport': np.int32,
'frame_len': int,
'tcp_flags_syn': bool,
'tcp_flags_fin': bool,
}
s = StringIO("""
"Jul 3, 2017 11:55:58.598308000 UTC","1","8.254.250.126","80","192.168.10.5","49188","60","0","1"
"Jul 3, 2017 11:55:58.598312000 UTC","2","8.254.250.126","80","192.168.10.5","49188","60","0","1"
"Jul 3, 2017 11:55:58.598313000 UTC","3","8.254.250.126","80","192.168.10.5","49188","60","0","1"
"Jul 3, 2017 11:55:58.598314000 UTC","4","8.254.250.126","80","192.168.10.5","49188","60","0","1"
"Jul 3, 2017 11:55:58.598315000 UTC","5","8.254.250.126","80","192.168.10.5","49188","60","0","1"
"Jul 3, 2017 11:55:58.598316000 UTC","6","8.254.250.126","80","192.168.10.5","49188","60","0","1"
"Jul 3, 2017 11:55:58.598317000 UTC","7","8.254.250.126","80","192.168.10.5","49188","60","0","1"
"Jul 3, 2017 11:55:58.598318000 UTC","8","8.254.250.126","80","192.168.10.5","49188","60","0","1"
"Jul 3, 2017 11:56:22.331018000 UTC","20","8.253.185.121","80","192.168.10.14","49486","60","0","1"
"Jul 3, 2017 11:56:22.331021000 UTC","21","8.253.185.121","80","192.168.10.14","49486","60","0","1"
""")
print(cudf.read_csv(s, header=None, names=list(my_types.keys()), dtype=my_types).dtypes)
gives
Traceback (most recent call last):
File "test.py", line 31, in <module>
print(cudf.read_csv(s, header=None, names=list(dtypes.keys()), dtype=dtypes).dtypes)
File "/home/wbadar/workspace/.miniconda3/envs/rapids14/lib/python3.7/contextlib.py", line 74, in inner
return func(*args, **kwds)
File "/home/wbadar/workspace/.miniconda3/envs/rapids14/lib/python3.7/site-packages/cudf/io/csv.py", line 84, in read_csv
index_col=index_col,
File "cudf/_lib/legacy/csv.pyx", line 37, in cudf._lib.legacy.csv.read_csv
File "cudf/_lib/legacy/csv.pyx", line 227, in cudf._lib.legacy.csv.read_csv
RuntimeError: cuDF failure at: /conda/conda-bld/libcudf_1587234373268/work/cpp/src/io/csv/legacy/csv_reader_impl.cu:663: Unsupported data type
While swapping in pandas gives:
frame_time object
frame_number int64
ip_src object
tcp_srcport int32
ip_dst object
tcp_dstport int32
frame_len int64
tcp_flags_syn object
tcp_flags_fin object
dtype: object
(I do wonder if this particular example is hitting a bug, or a problem in my data even; are any of bool, int64, int32 and str actually unsupported?)
Describe the solution you’d like
If it’s possible, it would be nice to know which type in MY_TYPES is unsupported. Can
and
be extended to support this?
(And I guess also https://github.com/rapidsai/cudf/blob/branch-0.14/cpp/src/io/csv/legacy/csv_reader_impl.cu#L624 and https://github.com/rapidsai/cudf/blob/branch-0.14/cpp/src/io/csv/legacy/csv_reader_impl.cu#L638. There might be more spots; this is just what I surfaced with some quick grepping around.)
Describe alternatives you’ve considered
One alternative would be to simply document supported dtypes. If this exists already, I apologize for not finding it (though if this is the case, could we perhaps link or otherwise include the list in the read_csv documentation?).
Additional context
`conda env export` for the above example
name: rapids14
channels:
- rapidsai-nightly
- nvidia
- conda-forge
- defaults
dependencies:
- _libgcc_mutex=0.1=conda_forge
- _openmp_mutex=4.5=1_llvm
- aiohttp=3.6.2=py37h516909a_0
- appdirs=1.4.3=py_1
- arrow-cpp=0.15.0=py37h090bef1_2
- async-timeout=3.0.1=py_1000
- attrs=19.3.0=py_0
- backcall=0.1.0=py_0
- bleach=3.1.4=pyh9f0ad1d_0
- bokeh=1.4.0=py37hc8dfbb8_1
- boost=1.70.0=py37h9de70de_1
- boost-cpp=1.70.0=h8e57a91_2
- brotli=1.0.7=he1b5a44_1001
- brotlipy=0.7.0=py37h8f50634_1000
- bzip2=1.0.8=h516909a_2
- c-ares=1.15.0=h516909a_1001
- ca-certificates=2020.4.5.1=hecc5488_0
- cairo=1.16.0=hcf35c78_1003
- certifi=2020.4.5.1=py37hc8dfbb8_0
- cffi=1.14.0=py37hd463f26_0
- cfitsio=3.470=hb60a0a2_2
- chardet=3.0.4=py37hc8dfbb8_1006
- click=7.1.1=pyh8c360ce_0
- click-plugins=1.1.1=py_0
- cligj=0.5.0=py_0
- cloudpickle=1.3.0=py_0
- colorcet=2.0.1=py_0
- cryptography=2.8=py37hb09aad4_2
- cudatoolkit=10.1.243=h6bb024c_0
- cudf=0.14.0a200418=py37_3339
- cudnn=7.6.0=cuda10.1_0
- cugraph=0.14.0a200418=py37_299
- cuml=0.14.0a200418=cuda10.1_py37_1429
- cupy=7.3.0=py37h0632833_0
- curl=7.69.1=h33f0ec9_0
- cusignal=0.14.0a200418=py37_179
- cuspatial=0.14.0a200418=py37_169
- cuxfilter=0.14.0a200418=py37_54
- cycler=0.10.0=py_2
- cytoolz=0.10.1=py37h516909a_0
- dask=2.14.0=py_0
- dask-core=2.14.0=py_0
- dask-cuda=0.14.0a200418=py37_43
- dask-cudf=0.14.0a200418=py37_3339
- dask-xgboost=0.2.0.dev28=cuda10.1py36_0
- datashader=0.10.0=py_0
- datashape=0.5.4=py_1
- decorator=4.4.2=py_0
- defusedxml=0.6.0=py_0
- distributed=2.14.0=py37hc8dfbb8_0
- dlpack=0.2=he1b5a44_1
- double-conversion=3.1.5=he1b5a44_2
- entrypoints=0.3=py37hc8dfbb8_1001
- expat=2.2.9=he1b5a44_2
- fastavro=0.23.1=py37h8f50634_0
- fastrlock=0.4=py37h3340039_1001
- fiona=1.8.9.post2=py37hdff7cfa_0
- fontconfig=2.13.1=h86ecdb6_1001
- freetype=2.10.1=he06d7ca_0
- freexl=1.0.5=h14c3975_1002
- fsspec=0.7.2=py_0
- gdal=2.4.4=py37h5f563d9_0
- geopandas=0.7.0=py_1
- geos=3.8.0=he1b5a44_1
- geotiff=1.5.1=h38872f0_8
- gettext=0.19.8.1=hc5be6a0_1002
- gflags=2.2.2=he1b5a44_1002
- giflib=5.1.7=h516909a_1
- glib=2.64.2=h6f030ca_0
- glog=0.4.0=h49b9bf7_3
- grpc-cpp=1.23.0=h18db393_0
- hdf4=4.2.13=hf30be14_1003
- hdf5=1.10.5=nompi_h3c11f04_1104
- heapdict=1.0.1=py_0
- icu=64.2=he1b5a44_1
- idna=2.9=py_1
- imageio=2.8.0=py_0
- importlib-metadata=1.6.0=py37hc8dfbb8_0
- importlib_metadata=1.6.0=0
- ipykernel=5.2.0=py37h43977f1_1
- ipython=7.13.0=py37hc8dfbb8_2
- ipython_genutils=0.2.0=py_1
- jedi=0.17.0=py37hc8dfbb8_0
- jinja2=2.11.2=pyh9f0ad1d_0
- joblib=0.14.1=py_0
- jpeg=9c=h14c3975_1001
- json-c=0.13.1=h14c3975_1001
- jsonschema=3.2.0=py37hc8dfbb8_1
- jupyter-server-proxy=1.3.2=py_0
- jupyter_client=6.1.3=py_0
- jupyter_core=4.6.3=py37hc8dfbb8_1
- kealib=1.4.13=hec59c27_0
- kiwisolver=1.2.0=py37h99015e2_0
- krb5=1.17.1=h2fd8d38_0
- ld_impl_linux-64=2.34=h53a641e_0
- libblas=3.8.0=16_openblas
- libcblas=3.8.0=16_openblas
- libcudf=0.14.0a200418=cuda10.1_3339
- libcugraph=0.14.0a200418=cuda10.1_299
- libcuml=0.14.0a200418=cuda10.1_1429
- libcumlprims=0.14.0a200417=cuda10.1_22
- libcurl=7.69.1=hf7181ac_0
- libcuspatial=0.14.0a200418=cuda10.1_169
- libdap4=3.20.4=hd3bb157_0
- libedit=3.1.20170329=hf8c457e_1001
- libevent=2.1.10=h72c5cf5_0
- libffi=3.2.1=he1b5a44_1007
- libgcc-ng=9.2.0=h24d8f2e_2
- libgdal=2.4.4=h2b6fda6_0
- libgfortran-ng=7.3.0=hdf63c60_5
- libhwloc=2.1.0=h3c4fd83_0
- libiconv=1.15=h516909a_1006
- libkml=1.3.0=h4fcabce_1010
- liblapack=3.8.0=16_openblas
- libllvm8=8.0.1=hc9558a2_0
- libnetcdf=4.7.3=nompi_h9f9fd6a_101
- libnvstrings=0.14.0a200418=cuda10.1_3339
- libopenblas=0.3.9=h5ec1e0e_0
- libpng=1.6.37=hed695b0_1
- libpq=12.2=h5513abc_1
- libprotobuf=3.8.0=h8b12597_0
- librmm=0.14.0a200418=cuda10.1_258
- libsodium=1.0.17=h516909a_0
- libspatialindex=1.9.3=he1b5a44_3
- libspatialite=4.3.0a=ha48a99a_1034
- libssh2=1.8.2=h22169c7_2
- libstdcxx-ng=9.2.0=hdf63c60_2
- libtiff=4.1.0=hfc65ed5_0
- libuuid=2.32.1=h14c3975_1000
- libxcb=1.13=h14c3975_1002
- libxgboost=1.0.2dev.rapidsai0.13=cuda10.1_6
- libxml2=2.9.10=hee79883_0
- llvm-openmp=10.0.0=hc9558a2_0
- llvmlite=0.31.0=py37h5202443_1
- locket=0.2.0=py_2
- lz4-c=1.8.3=he1b5a44_1001
- markdown=3.2.1=py_0
- markupsafe=1.1.1=py37h8f50634_1
- matplotlib-base=3.2.1=py37h30547a4_0
- mistune=0.8.4=py37h8f50634_1001
- msgpack-python=1.0.0=py37h99015e2_1
- multidict=4.7.5=py37h516909a_0
- multipledispatch=0.6.0=py_0
- munch=2.5.0=py_0
- nbconvert=5.6.1=py37hc8dfbb8_1
- nbformat=5.0.4=py_0
- nccl=2.5.7.1=h51cf6c1_0
- ncurses=6.1=hf484d3e_1002
- networkx=2.4=py_1
- notebook=6.0.3=py37_0
- numba=0.48.0=py37hb3f55d8_0
- numpy=1.18.1=py37h8960a57_1
- nvstrings=0.14.0a200418=py37_3339
- olefile=0.46=py_0
- openjpeg=2.3.1=h981e76c_3
- openssl=1.1.1f=h516909a_0
- packaging=20.1=py_0
- pandas=0.25.3=py37hb3f55d8_0
- pandoc=2.9.2.1=0
- pandocfilters=1.4.2=py_1
- panel=0.6.4=0
- param=1.9.3=py_0
- parquet-cpp=1.5.1=2
- parso=0.7.0=pyh9f0ad1d_0
- partd=1.1.0=py_0
- pcre=8.44=he1b5a44_0
- pexpect=4.8.0=py37hc8dfbb8_1
- pickleshare=0.7.5=py37hc8dfbb8_1001
- pillow=7.1.1=py37h718be6c_0
- pip=20.0.2=py_2
- pixman=0.38.0=h516909a_1003
- poppler=0.67.0=h14e79db_8
- poppler-data=0.4.9=1
- postgresql=12.2=h8573dbc_1
- proj=6.3.0=hc80f0dc_0
- prometheus_client=0.7.1=py_0
- prompt-toolkit=3.0.5=py_0
- psutil=5.7.0=py37h8f50634_1
- pthread-stubs=0.4=h14c3975_1001
- ptyprocess=0.6.0=py_1001
- py-xgboost=1.0.2dev.rapidsai0.13=cuda10.1py37_6
- pyarrow=0.15.0=py37h8b68381_1
- pycparser=2.20=py_0
- pyct=0.4.6=py_0
- pyct-core=0.4.6=py_0
- pyee=7.0.1=py_0
- pygments=2.6.1=py_0
- pynvml=8.0.4=py_0
- pyopenssl=19.1.0=py_1
- pyparsing=2.4.7=pyh9f0ad1d_0
- pyppeteer=0.0.25=py_1
- pyproj=2.5.0=py37h8ff28aa_0
- pyrsistent=0.16.0=py37h8f50634_0
- pysocks=1.7.1=py37hc8dfbb8_1
- python=3.7.6=h8356626_5_cpython
- python-dateutil=2.8.1=py_0
- python_abi=3.7=1_cp37m
- pytz=2019.3=py_0
- pyviz_comms=0.7.4=pyh8c360ce_0
- pywavelets=1.1.1=py37h03ebfcd_1
- pyyaml=5.3.1=py37h8f50634_0
- pyzmq=19.0.0=py37hac76be4_1
- rapids=0.14.0=cuda10.1_py37_150
- rapids-xgboost=0.14.0=cuda10.1_py37_150
- re2=2020.04.01=he1b5a44_0
- readline=8.0=hf8c457e_0
- requests=2.23.0=pyh8c360ce_2
- rmm=0.14.0a200418=py37_258
- rtree=0.9.4=py37h8526d28_1
- scikit-image=0.16.2=py37hb3f55d8_0
- scikit-learn=0.22.2.post1=py37hcdab131_0
- scipy=1.4.1=py37ha3d9a3c_3
- send2trash=1.5.0=py_0
- setuptools=46.1.3=py37hc8dfbb8_0
- shapely=1.7.0=py37hb106bac_1
- simpervisor=0.3=py_1
- six=1.14.0=py_1
- snappy=1.1.8=he1b5a44_1
- sortedcontainers=2.1.0=py_0
- sqlite=3.30.1=hcee41ef_0
- tblib=1.6.0=py_0
- terminado=0.8.3=py37hc8dfbb8_1
- testpath=0.4.4=py_0
- thrift-cpp=0.12.0=hf3afdfd_1004
- tk=8.6.10=hed695b0_0
- toolz=0.10.0=py_0
- tornado=6.0.4=py37h8f50634_1
- tqdm=4.45.0=pyh9f0ad1d_0
- traitlets=4.3.3=py37hc8dfbb8_1
- tzcode=2019a=h516909a_1002
- ucx=1.7.0+g9d06c3a=cuda10.1_0
- uriparser=0.9.3=he1b5a44_1
- urllib3=1.25.9=py_0
- wcwidth=0.1.9=pyh9f0ad1d_0
- webencodings=0.5.1=py_1
- websockets=8.1=py37h8f50634_1
- wheel=0.34.2=py_1
- xarray=0.15.1=py_0
- xerces-c=3.2.2=h8412b87_1004
- xgboost=1.0.2dev.rapidsai0.13=cuda10.1py37_6
- xorg-kbproto=1.0.7=h14c3975_1002
- xorg-libice=1.0.10=h516909a_0
- xorg-libsm=1.2.3=h84519dc_1000
- xorg-libx11=1.6.9=h516909a_0
- xorg-libxau=1.0.9=h14c3975_0
- xorg-libxdmcp=1.1.3=h516909a_0
- xorg-libxext=1.3.4=h516909a_0
- xorg-libxrender=0.9.10=h516909a_1002
- xorg-renderproto=0.11.1=h14c3975_1002
- xorg-xextproto=7.3.0=h14c3975_1002
- xorg-xproto=7.0.31=h14c3975_1007
- xz=5.2.5=h516909a_0
- yaml=0.2.3=h516909a_0
- yarl=1.3.0=py37h516909a_1000
- zeromq=4.3.2=he1b5a44_2
- zict=2.0.0=py_0
- zipp=3.1.0=py_0
- zlib=1.2.11=h516909a_1006
- zstd=1.4.3=h3b9ef0a_0
- pip:
- ucx-py==0.14.0a0+133.ge9a2c92
prefix: /home/wbadar/workspace/.miniconda3/envs/rapids14
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 17 (11 by maintainers)
Boom:
Thanks for the suggestion @OlivierNV. If you think it’d be appropriate, I’d be happy to contribute some documentation to clarify the expected use of
read_csv’sdtypeparameter, to reflect our discussion. Let me know!Hey @kkraus14, thanks for the tip – just wanted to report that it worked the first time I tried it. It doesn’t seem to be anywhere in the documentation which most people will consult when they’re stuck on this, but appreciate that you all are refactoring and cleaning things up. That seems like it might be a quick fix in the meantime though (clearing up documentation), or a blog post that will show up on SEO maybe.
I’ll draft something up!
Also, here’s our call to the legacy reader, since that came up:
https://github.com/rapidsai/cudf/blob/fff2bedc878f37b54fae61c7b4511f1a023556c3/python/cudf/cudf/io/csv.py#L52