cudf: [BUG] can't DMatrix cuDF in xgboost 0.90.rapidsdev1

I’m working with a CUDF called df2

type(df2)
cudf.core.dataframe.DataFrame

X, y = df2.drop('16', axis=1), df2['16']

type(X)
cudf.core.dataframe.DataFrame

type(y)
cudf.core.series.Series

param = {'objective': 'binary:logistic', 
         'tree_method': 'gpu_hist',
         #'tree_method': 'hist',
         'eval_metric': 'logloss',
         }

train=xgboost.DMatrix(X, label=y)

I got the following errors: ValueError: cannot copy sequence with size 629470 to array axis with dimension 70 ValueError: unrecognized csr_matrix constructor usage TypeError: can not initialize DMatrix from DataFrame

However if I convert X and y to pandas, everything works:

type(df2)
cudf.core.dataframe.DataFrame

X, y = df2.drop('16', axis=1).to_pandas(), df2['16'].to_pandas()

type(X)
pandas.core.frame.DataFrame

type(y)
cudf.core.series.Series

param = {'objective': 'binary:logistic', 
         'tree_method': 'gpu_hist',
         #'tree_method': 'hist',
         'eval_metric': 'logloss',
         }

%%time
train=xgboost.DMatrix(X, label=y)
model=xgboost.train(param,train)
_CPU times: user 1.5 s, sys: 834 ms, total: 2.34 s
Wall time: 2.14 s_

Am I missing something?

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 16 (9 by maintainers)

Most upvoted comments

@ivenzor , Got the dataset. Thanks for sharing. The issue seems to be at creating a dmatrix with int and float columns with nones .

You don’t see this error in pandas as pandas upcasts int columns with Nones to a float dtype.

Current Suggested workaround: You can upcast to floats like below to match pandas behavior .

for col in X.columns:
    X[col] = X[col].astype(np.float32).fillna(np.nan)

In the mean time, i will look into resolving this.

Minimal example for issue:

import cudf
import xgboost
import numpy as np

X = cudf.DataFrame({'x':cudf.Series([0,1,2,None],dtype=np.int32)})
y = cudf.Series([0,1,0,1])

train=xgboost.DMatrix(X, label=y)

Error Trace:

---------------------------------------------------------------------------
XGBoostError                              Traceback (most recent call last)
<ipython-input-3-ed0be75e1d09> in <module>
----> 1 train=xgboost.DMatrix(X, label=y)

/opt/conda/envs/rapids/lib/python3.7/site-packages/xgboost/core.py in __init__(self, data, label, missing, weight, silent, feature_names, feature_types, nthread)
    505             self._init_from_dt(data, nthread)
    506         elif _use_columnar_initializer(data):
--> 507             self._init_from_columnar(data, missing)
    508         else:
    509             try:

/opt/conda/envs/rapids/lib/python3.7/site-packages/xgboost/core.py in _init_from_columnar(self, df, missing)
    644             _LIB.XGDMatrixCreateFromArrayInterfaces(
    645                 interfaces, ctypes.c_int32(has_missing),
--> 646                 ctypes.c_float(missing), ctypes.byref(handle)))
    647         self.handle = handle
    648 

/opt/conda/envs/rapids/lib/python3.7/site-packages/xgboost/core.py in _check_call(ret)
    198     """
    199     if ret != 0:
--> 200         raise XGBoostError(py_str(_LIB.XGBGetLastError()))
    201 
    202 

XGBoostError: [20:50:46] /workspace/src/data/columnar.h:145: Check failed: get<Integer>(j_shape.front()) % 8 == 0 (4 vs. 0) : Length of validity mask must be a multiple of 8 bytes.
Stack trace:
  [bt] (0) /opt/conda/envs/rapids/xgboost/libxgboost.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x24) [0x7f8bf39aec64]
  [bt] (1) /opt/conda/envs/rapids/xgboost/libxgboost.so(xgboost::ArrayInterfaceHandler::ExtractMask(std::map<std::string, xgboost::Json, std::less<std::string>, std::allocator<std::pair<std::string const, xgboost::Json> > > const&, xgboost::common::Span<unsigned char, -1l>*)+0x2bd) [0x7f8bf3bb253d]
  [bt] (2) /opt/conda/envs/rapids/xgboost/libxgboost.so(void xgboost::data::CountValid<int>(std::vector<xgboost::Json, std::allocator<xgboost::Json> > const&, unsigned int, bool, float, xgboost::HostDeviceVector<unsigned long>*, thrust::device_vector<int, dh::detail::XGBCachingDeviceAllocatorImpl<int> >*, unsigned int*)+0x97) [0x7f8bf3bb4b17]
  [bt] (3) /opt/conda/envs/rapids/xgboost/libxgboost.so(xgboost::data::SimpleCSRSource::FromDeviceColumnar(std::vector<xgboost::Json, std::allocator<xgboost::Json> > const&, bool, float)+0xc3e) [0x7f8bf3baeade]
  [bt] (4) /opt/conda/envs/rapids/xgboost/libxgboost.so(xgboost::data::SimpleCSRSource::CopyFrom(std::string const&, bool, float)+0x1213) [0x7f8bf3a030b3]
  [bt] (5) /opt/conda/envs/rapids/xgboost/libxgboost.so(XGDMatrixCreateFromArrayInterfaces+0x6b) [0x7f8bf39a703b]
  [bt] (6) /opt/conda/envs/rapids/lib/python3.7/lib-dynload/../../libffi.so.6(ffi_call_unix64+0x4c) [0x7f8c38d82630]
  [bt] (7) /opt/conda/envs/rapids/lib/python3.7/lib-dynload/../../libffi.so.6(ffi_call+0x22d) [0x7f8c38d81fed]
  [bt] (8) /opt/conda/envs/rapids/lib/python3.7/lib-dynload/_ctypes.cpython-37m-x86_64-linux-gnu.so(_ctypes_callproc+0x2ce) [0x7f8c37de9fce]