cudf: [FEA] Readers report which specified types are unsupported

Is your feature request related to a problem? Please describe. Sometimes cudf.read_csv fails with

RuntimeError: cuDF failure at: /conda/conda-bld/libcudf_1581433420693/work/cpp/src/io/csv/legacy/csv_reader_impl.cu

when given the dtype=MY_TYPES argument. For example,

from io import StringIO
import cudf
import numpy as np

my_types = {
   'frame_time': str,
   'frame_number': int,
   'ip_src': str,
   'tcp_srcport': np.int32,
   'ip_dst': str,
   'tcp_dstport': np.int32,
   'frame_len': int,
   'tcp_flags_syn': bool,
   'tcp_flags_fin': bool,
}

s = StringIO("""
    "Jul  3, 2017 11:55:58.598308000 UTC","1","8.254.250.126","80","192.168.10.5","49188","60","0","1"
    "Jul  3, 2017 11:55:58.598312000 UTC","2","8.254.250.126","80","192.168.10.5","49188","60","0","1"
    "Jul  3, 2017 11:55:58.598313000 UTC","3","8.254.250.126","80","192.168.10.5","49188","60","0","1"
    "Jul  3, 2017 11:55:58.598314000 UTC","4","8.254.250.126","80","192.168.10.5","49188","60","0","1"
    "Jul  3, 2017 11:55:58.598315000 UTC","5","8.254.250.126","80","192.168.10.5","49188","60","0","1"
    "Jul  3, 2017 11:55:58.598316000 UTC","6","8.254.250.126","80","192.168.10.5","49188","60","0","1"
    "Jul  3, 2017 11:55:58.598317000 UTC","7","8.254.250.126","80","192.168.10.5","49188","60","0","1"
    "Jul  3, 2017 11:55:58.598318000 UTC","8","8.254.250.126","80","192.168.10.5","49188","60","0","1"
    "Jul  3, 2017 11:56:22.331018000 UTC","20","8.253.185.121","80","192.168.10.14","49486","60","0","1"
    "Jul  3, 2017 11:56:22.331021000 UTC","21","8.253.185.121","80","192.168.10.14","49486","60","0","1"
""")

print(cudf.read_csv(s, header=None, names=list(my_types.keys()), dtype=my_types).dtypes)

gives

Traceback (most recent call last):
  File "test.py", line 31, in <module>
    print(cudf.read_csv(s, header=None, names=list(dtypes.keys()), dtype=dtypes).dtypes)
  File "/home/wbadar/workspace/.miniconda3/envs/rapids14/lib/python3.7/contextlib.py", line 74, in inner
    return func(*args, **kwds)
  File "/home/wbadar/workspace/.miniconda3/envs/rapids14/lib/python3.7/site-packages/cudf/io/csv.py", line 84, in read_csv
    index_col=index_col,
  File "cudf/_lib/legacy/csv.pyx", line 37, in cudf._lib.legacy.csv.read_csv
  File "cudf/_lib/legacy/csv.pyx", line 227, in cudf._lib.legacy.csv.read_csv
RuntimeError: cuDF failure at: /conda/conda-bld/libcudf_1587234373268/work/cpp/src/io/csv/legacy/csv_reader_impl.cu:663: Unsupported data type

While swapping in pandas gives:

frame_time       object
frame_number      int64
ip_src           object
tcp_srcport       int32
ip_dst           object
tcp_dstport       int32
frame_len         int64
tcp_flags_syn    object
tcp_flags_fin    object
dtype: object

(I do wonder if this particular example is hitting a bug, or a problem in my data even; are any of bool, int64, int32 and str actually unsupported?)

Describe the solution you’d like If it’s possible, it would be nice to know which type in MY_TYPES is unsupported. Can

https://github.com/rapidsai/cudf/blob/8e90792e58e6dc24dcae78d3806c0536003fd2bb/cpp/src/io/csv/reader_impl.cu#L627-L628

and

https://github.com/rapidsai/cudf/blob/8e90792e58e6dc24dcae78d3806c0536003fd2bb/cpp/src/io/csv/reader_impl.cu#L641-L642

be extended to support this?

(And I guess also https://github.com/rapidsai/cudf/blob/branch-0.14/cpp/src/io/csv/legacy/csv_reader_impl.cu#L624 and https://github.com/rapidsai/cudf/blob/branch-0.14/cpp/src/io/csv/legacy/csv_reader_impl.cu#L638. There might be more spots; this is just what I surfaced with some quick grepping around.)

Describe alternatives you’ve considered One alternative would be to simply document supported dtypes. If this exists already, I apologize for not finding it (though if this is the case, could we perhaps link or otherwise include the list in the read_csv documentation?).

Additional context

`conda env export` for the above example
name: rapids14
channels:
  - rapidsai-nightly
  - nvidia
  - conda-forge
  - defaults
dependencies:
  - _libgcc_mutex=0.1=conda_forge
  - _openmp_mutex=4.5=1_llvm
  - aiohttp=3.6.2=py37h516909a_0
  - appdirs=1.4.3=py_1
  - arrow-cpp=0.15.0=py37h090bef1_2
  - async-timeout=3.0.1=py_1000
  - attrs=19.3.0=py_0
  - backcall=0.1.0=py_0
  - bleach=3.1.4=pyh9f0ad1d_0
  - bokeh=1.4.0=py37hc8dfbb8_1
  - boost=1.70.0=py37h9de70de_1
  - boost-cpp=1.70.0=h8e57a91_2
  - brotli=1.0.7=he1b5a44_1001
  - brotlipy=0.7.0=py37h8f50634_1000
  - bzip2=1.0.8=h516909a_2
  - c-ares=1.15.0=h516909a_1001
  - ca-certificates=2020.4.5.1=hecc5488_0
  - cairo=1.16.0=hcf35c78_1003
  - certifi=2020.4.5.1=py37hc8dfbb8_0
  - cffi=1.14.0=py37hd463f26_0
  - cfitsio=3.470=hb60a0a2_2
  - chardet=3.0.4=py37hc8dfbb8_1006
  - click=7.1.1=pyh8c360ce_0
  - click-plugins=1.1.1=py_0
  - cligj=0.5.0=py_0
  - cloudpickle=1.3.0=py_0
  - colorcet=2.0.1=py_0
  - cryptography=2.8=py37hb09aad4_2
  - cudatoolkit=10.1.243=h6bb024c_0
  - cudf=0.14.0a200418=py37_3339
  - cudnn=7.6.0=cuda10.1_0
  - cugraph=0.14.0a200418=py37_299
  - cuml=0.14.0a200418=cuda10.1_py37_1429
  - cupy=7.3.0=py37h0632833_0
  - curl=7.69.1=h33f0ec9_0
  - cusignal=0.14.0a200418=py37_179
  - cuspatial=0.14.0a200418=py37_169
  - cuxfilter=0.14.0a200418=py37_54
  - cycler=0.10.0=py_2
  - cytoolz=0.10.1=py37h516909a_0
  - dask=2.14.0=py_0
  - dask-core=2.14.0=py_0
  - dask-cuda=0.14.0a200418=py37_43
  - dask-cudf=0.14.0a200418=py37_3339
  - dask-xgboost=0.2.0.dev28=cuda10.1py36_0
  - datashader=0.10.0=py_0
  - datashape=0.5.4=py_1
  - decorator=4.4.2=py_0
  - defusedxml=0.6.0=py_0
  - distributed=2.14.0=py37hc8dfbb8_0
  - dlpack=0.2=he1b5a44_1
  - double-conversion=3.1.5=he1b5a44_2
  - entrypoints=0.3=py37hc8dfbb8_1001
  - expat=2.2.9=he1b5a44_2
  - fastavro=0.23.1=py37h8f50634_0
  - fastrlock=0.4=py37h3340039_1001
  - fiona=1.8.9.post2=py37hdff7cfa_0
  - fontconfig=2.13.1=h86ecdb6_1001
  - freetype=2.10.1=he06d7ca_0
  - freexl=1.0.5=h14c3975_1002
  - fsspec=0.7.2=py_0
  - gdal=2.4.4=py37h5f563d9_0
  - geopandas=0.7.0=py_1
  - geos=3.8.0=he1b5a44_1
  - geotiff=1.5.1=h38872f0_8
  - gettext=0.19.8.1=hc5be6a0_1002
  - gflags=2.2.2=he1b5a44_1002
  - giflib=5.1.7=h516909a_1
  - glib=2.64.2=h6f030ca_0
  - glog=0.4.0=h49b9bf7_3
  - grpc-cpp=1.23.0=h18db393_0
  - hdf4=4.2.13=hf30be14_1003
  - hdf5=1.10.5=nompi_h3c11f04_1104
  - heapdict=1.0.1=py_0
  - icu=64.2=he1b5a44_1
  - idna=2.9=py_1
  - imageio=2.8.0=py_0
  - importlib-metadata=1.6.0=py37hc8dfbb8_0
  - importlib_metadata=1.6.0=0
  - ipykernel=5.2.0=py37h43977f1_1
  - ipython=7.13.0=py37hc8dfbb8_2
  - ipython_genutils=0.2.0=py_1
  - jedi=0.17.0=py37hc8dfbb8_0
  - jinja2=2.11.2=pyh9f0ad1d_0
  - joblib=0.14.1=py_0
  - jpeg=9c=h14c3975_1001
  - json-c=0.13.1=h14c3975_1001
  - jsonschema=3.2.0=py37hc8dfbb8_1
  - jupyter-server-proxy=1.3.2=py_0
  - jupyter_client=6.1.3=py_0
  - jupyter_core=4.6.3=py37hc8dfbb8_1
  - kealib=1.4.13=hec59c27_0
  - kiwisolver=1.2.0=py37h99015e2_0
  - krb5=1.17.1=h2fd8d38_0
  - ld_impl_linux-64=2.34=h53a641e_0
  - libblas=3.8.0=16_openblas
  - libcblas=3.8.0=16_openblas
  - libcudf=0.14.0a200418=cuda10.1_3339
  - libcugraph=0.14.0a200418=cuda10.1_299
  - libcuml=0.14.0a200418=cuda10.1_1429
  - libcumlprims=0.14.0a200417=cuda10.1_22
  - libcurl=7.69.1=hf7181ac_0
  - libcuspatial=0.14.0a200418=cuda10.1_169
  - libdap4=3.20.4=hd3bb157_0
  - libedit=3.1.20170329=hf8c457e_1001
  - libevent=2.1.10=h72c5cf5_0
  - libffi=3.2.1=he1b5a44_1007
  - libgcc-ng=9.2.0=h24d8f2e_2
  - libgdal=2.4.4=h2b6fda6_0
  - libgfortran-ng=7.3.0=hdf63c60_5
  - libhwloc=2.1.0=h3c4fd83_0
  - libiconv=1.15=h516909a_1006
  - libkml=1.3.0=h4fcabce_1010
  - liblapack=3.8.0=16_openblas
  - libllvm8=8.0.1=hc9558a2_0
  - libnetcdf=4.7.3=nompi_h9f9fd6a_101
  - libnvstrings=0.14.0a200418=cuda10.1_3339
  - libopenblas=0.3.9=h5ec1e0e_0
  - libpng=1.6.37=hed695b0_1
  - libpq=12.2=h5513abc_1
  - libprotobuf=3.8.0=h8b12597_0
  - librmm=0.14.0a200418=cuda10.1_258
  - libsodium=1.0.17=h516909a_0
  - libspatialindex=1.9.3=he1b5a44_3
  - libspatialite=4.3.0a=ha48a99a_1034
  - libssh2=1.8.2=h22169c7_2
  - libstdcxx-ng=9.2.0=hdf63c60_2
  - libtiff=4.1.0=hfc65ed5_0
  - libuuid=2.32.1=h14c3975_1000
  - libxcb=1.13=h14c3975_1002
  - libxgboost=1.0.2dev.rapidsai0.13=cuda10.1_6
  - libxml2=2.9.10=hee79883_0
  - llvm-openmp=10.0.0=hc9558a2_0
  - llvmlite=0.31.0=py37h5202443_1
  - locket=0.2.0=py_2
  - lz4-c=1.8.3=he1b5a44_1001
  - markdown=3.2.1=py_0
  - markupsafe=1.1.1=py37h8f50634_1
  - matplotlib-base=3.2.1=py37h30547a4_0
  - mistune=0.8.4=py37h8f50634_1001
  - msgpack-python=1.0.0=py37h99015e2_1
  - multidict=4.7.5=py37h516909a_0
  - multipledispatch=0.6.0=py_0
  - munch=2.5.0=py_0
  - nbconvert=5.6.1=py37hc8dfbb8_1
  - nbformat=5.0.4=py_0
  - nccl=2.5.7.1=h51cf6c1_0
  - ncurses=6.1=hf484d3e_1002
  - networkx=2.4=py_1
  - notebook=6.0.3=py37_0
  - numba=0.48.0=py37hb3f55d8_0
  - numpy=1.18.1=py37h8960a57_1
  - nvstrings=0.14.0a200418=py37_3339
  - olefile=0.46=py_0
  - openjpeg=2.3.1=h981e76c_3
  - openssl=1.1.1f=h516909a_0
  - packaging=20.1=py_0
  - pandas=0.25.3=py37hb3f55d8_0
  - pandoc=2.9.2.1=0
  - pandocfilters=1.4.2=py_1
  - panel=0.6.4=0
  - param=1.9.3=py_0
  - parquet-cpp=1.5.1=2
  - parso=0.7.0=pyh9f0ad1d_0
  - partd=1.1.0=py_0
  - pcre=8.44=he1b5a44_0
  - pexpect=4.8.0=py37hc8dfbb8_1
  - pickleshare=0.7.5=py37hc8dfbb8_1001
  - pillow=7.1.1=py37h718be6c_0
  - pip=20.0.2=py_2
  - pixman=0.38.0=h516909a_1003
  - poppler=0.67.0=h14e79db_8
  - poppler-data=0.4.9=1
  - postgresql=12.2=h8573dbc_1
  - proj=6.3.0=hc80f0dc_0
  - prometheus_client=0.7.1=py_0
  - prompt-toolkit=3.0.5=py_0
  - psutil=5.7.0=py37h8f50634_1
  - pthread-stubs=0.4=h14c3975_1001
  - ptyprocess=0.6.0=py_1001
  - py-xgboost=1.0.2dev.rapidsai0.13=cuda10.1py37_6
  - pyarrow=0.15.0=py37h8b68381_1
  - pycparser=2.20=py_0
  - pyct=0.4.6=py_0
  - pyct-core=0.4.6=py_0
  - pyee=7.0.1=py_0
  - pygments=2.6.1=py_0
  - pynvml=8.0.4=py_0
  - pyopenssl=19.1.0=py_1
  - pyparsing=2.4.7=pyh9f0ad1d_0
  - pyppeteer=0.0.25=py_1
  - pyproj=2.5.0=py37h8ff28aa_0
  - pyrsistent=0.16.0=py37h8f50634_0
  - pysocks=1.7.1=py37hc8dfbb8_1
  - python=3.7.6=h8356626_5_cpython
  - python-dateutil=2.8.1=py_0
  - python_abi=3.7=1_cp37m
  - pytz=2019.3=py_0
  - pyviz_comms=0.7.4=pyh8c360ce_0
  - pywavelets=1.1.1=py37h03ebfcd_1
  - pyyaml=5.3.1=py37h8f50634_0
  - pyzmq=19.0.0=py37hac76be4_1
  - rapids=0.14.0=cuda10.1_py37_150
  - rapids-xgboost=0.14.0=cuda10.1_py37_150
  - re2=2020.04.01=he1b5a44_0
  - readline=8.0=hf8c457e_0
  - requests=2.23.0=pyh8c360ce_2
  - rmm=0.14.0a200418=py37_258
  - rtree=0.9.4=py37h8526d28_1
  - scikit-image=0.16.2=py37hb3f55d8_0
  - scikit-learn=0.22.2.post1=py37hcdab131_0
  - scipy=1.4.1=py37ha3d9a3c_3
  - send2trash=1.5.0=py_0
  - setuptools=46.1.3=py37hc8dfbb8_0
  - shapely=1.7.0=py37hb106bac_1
  - simpervisor=0.3=py_1
  - six=1.14.0=py_1
  - snappy=1.1.8=he1b5a44_1
  - sortedcontainers=2.1.0=py_0
  - sqlite=3.30.1=hcee41ef_0
  - tblib=1.6.0=py_0
  - terminado=0.8.3=py37hc8dfbb8_1
  - testpath=0.4.4=py_0
  - thrift-cpp=0.12.0=hf3afdfd_1004
  - tk=8.6.10=hed695b0_0
  - toolz=0.10.0=py_0
  - tornado=6.0.4=py37h8f50634_1
  - tqdm=4.45.0=pyh9f0ad1d_0
  - traitlets=4.3.3=py37hc8dfbb8_1
  - tzcode=2019a=h516909a_1002
  - ucx=1.7.0+g9d06c3a=cuda10.1_0
  - uriparser=0.9.3=he1b5a44_1
  - urllib3=1.25.9=py_0
  - wcwidth=0.1.9=pyh9f0ad1d_0
  - webencodings=0.5.1=py_1
  - websockets=8.1=py37h8f50634_1
  - wheel=0.34.2=py_1
  - xarray=0.15.1=py_0
  - xerces-c=3.2.2=h8412b87_1004
  - xgboost=1.0.2dev.rapidsai0.13=cuda10.1py37_6
  - xorg-kbproto=1.0.7=h14c3975_1002
  - xorg-libice=1.0.10=h516909a_0
  - xorg-libsm=1.2.3=h84519dc_1000
  - xorg-libx11=1.6.9=h516909a_0
  - xorg-libxau=1.0.9=h14c3975_0
  - xorg-libxdmcp=1.1.3=h516909a_0
  - xorg-libxext=1.3.4=h516909a_0
  - xorg-libxrender=0.9.10=h516909a_1002
  - xorg-renderproto=0.11.1=h14c3975_1002
  - xorg-xextproto=7.3.0=h14c3975_1002
  - xorg-xproto=7.0.31=h14c3975_1007
  - xz=5.2.5=h516909a_0
  - yaml=0.2.3=h516909a_0
  - yarl=1.3.0=py37h516909a_1000
  - zeromq=4.3.2=he1b5a44_2
  - zict=2.0.0=py_0
  - zipp=3.1.0=py_0
  - zlib=1.2.11=h516909a_1006
  - zstd=1.4.3=h3b9ef0a_0
  - pip:
    - ucx-py==0.14.0a0+133.ge9a2c92
prefix: /home/wbadar/workspace/.miniconda3/envs/rapids14

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 17 (11 by maintainers)

Commits related to this issue

Most upvoted comments

Boom:

In [2]: t = {'frame_time': 'str', 'frame_numer': 'int', 'ip_src': 'str', 'tcp_srcport': 'int', 'ip_dst': 'str', 'tcp_dstport': 'int', 'frame_len': 'int', 'tcp_flags_syn': 'bool', 'tcp_flags_fin': '
   ...: bool'}

In [3]: cudf.read_csv(s, header=None, names=list(t), dtype=list(t.values()))
Out[3]:
                                  frame_time  frame_numer         ip_src  tcp_srcport         ip_dst  tcp_dstport  frame_len  tcp_flags_syn  tcp_flags_fin
0      "Jul  3, 2017 11:55:58.598308000 UTC"            1  8.254.250.126           80   192.168.10.5        49188         60          False           True
1      "Jul  3, 2017 11:55:58.598312000 UTC"            2  8.254.250.126           80   192.168.10.5        49188         60          False           True
2      "Jul  3, 2017 11:55:58.598313000 UTC"            3  8.254.250.126           80   192.168.10.5        49188         60          False           True
3      "Jul  3, 2017 11:55:58.598314000 UTC"            4  8.254.250.126           80   192.168.10.5        49188         60          False           True
4      "Jul  3, 2017 11:55:58.598315000 UTC"            5  8.254.250.126           80   192.168.10.5        49188         60          False           True
5      "Jul  3, 2017 11:55:58.598316000 UTC"            6  8.254.250.126           80   192.168.10.5        49188         60          False           True
6      "Jul  3, 2017 11:55:58.598317000 UTC"            7  8.254.250.126           80   192.168.10.5        49188         60          False           True
7      "Jul  3, 2017 11:55:58.598318000 UTC"            8  8.254.250.126           80   192.168.10.5        49188         60          False           True
8      "Jul  3, 2017 11:56:22.331018000 UTC"           20  8.253.185.121           80  192.168.10.14        49486         60          False           True
9      "Jul  3, 2017 11:56:22.331021000 UTC"           21  8.253.185.121           80  192.168.10.14        49486         60          False           True

In [4]: _.dtypes
Out[4]:
frame_time       object
frame_numer       int32
ip_src           object
tcp_srcport       int32
ip_dst           object
tcp_dstport       int32
frame_len         int32
tcp_flags_syn      bool
tcp_flags_fin      bool
dtype: object

Thanks for the suggestion @OlivierNV. If you think it’d be appropriate, I’d be happy to contribute some documentation to clarify the expected use of read_csv’s dtype parameter, to reflect our discussion. Let me know!

Hey @kkraus14, thanks for the tip – just wanted to report that it worked the first time I tried it. It doesn’t seem to be anywhere in the documentation which most people will consult when they’re stuck on this, but appreciate that you all are refactoring and cleaning things up. That seems like it might be a quick fix in the meantime though (clearing up documentation), or a blog post that will show up on SEO maybe.

I’ll draft something up!

Also, here’s our call to the legacy reader, since that came up:

https://github.com/rapidsai/cudf/blob/fff2bedc878f37b54fae61c7b4511f1a023556c3/python/cudf/cudf/io/csv.py#L52