cudf: [DISCUSSION] libcudf should not introspect input data to perform error checking

Is your feature request related to a problem? Please describe.

A few functions in libcudf optionally validate that input data is valid before performing the operation. For example, https://github.com/rapidsai/cudf/blob/31d746605fb4a3fd4b24a5732aed16ce7944adf7/cpp/include/cudf/copying.hpp#L62-L66

cudf::gather has a check_bounds bool parameter that enables verifying if the values in the gather_map are within bounds. This requires launching a kernel to introspect the gather map data: https://github.com/rapidsai/cudf/blob/31d746605fb4a3fd4b24a5732aed16ce7944adf7/cpp/src/copying/gather.cu#L30-L39

The reason for this verification is that cuDF Python expects to throw an exception if any of the values are out of bounds.

However, there is no reason for libcudf to be performing this verification directly inside of the gather implementation. The cuDF Python bindings for gather can easily add this bounds checking as a pre-processing step.

cudf::repeat is a similar example of this behavior:

https://github.com/rapidsai/cudf/blob/b72e6476325532ee0ea112f604e23db928a4b24d/cpp/include/cudf/filling.hpp#L118-L122

Having this bounds checking inside the libcudf function is detrimental for a number of reasons:

It complicates the libcudf implementation. The gather implementation is quite complicated because of all the branches inside, including whether or not we need to check bounds. It’s not as simple as just a single if/else because the check_bounds flag interacts with other internal flags such as allow_negative_indices. It would pretty significantly simplify gathers implementation to move this check outside of gather’s implementation.
It violates the single responsibility principle. Verifying the validity of input data should be independent of performing the actual operation.
It is not in line with the semantics of other C++ libraries. For example, thrust::gather does not perform any validation of the map data.

Describe the solution you’d like

Optional input data validation like in gather or repeat should be eliminated from libcudf features.

Python features that rely on this bounds checking should implement it as a pre-processing step using existing libcudf primitives, or identify new primitives that would enable the necessary validation.

Additional context

To be clear, I am not suggesting libcudf functions do not perform any error checking whatsoever. I am suggesting we remove any error checking that requires introspection of input data (i.e., launching a kernel). Performing checks for data to be the right data type, size, etc. must still be preserved. An obvious indication of functions doing this today are optional boolean flags like in gather or repeat.

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 23 (20 by maintainers)

Commits related to this issue

Initial draft of policies and guidelines for libcudf usage. (#11853) This PR adds a section to the developer documentation about various libcudf design decisions that affect users. These policies are... — committed to rapidsai/cudf by vyasr 2 years ago
Remove validation that requires introspection (#11938) This PR removes optional validation for some APIs. Performing these validations requires data introspection, which we do not want. This PR resol... — committed to rapidsai/cudf by vyasr 2 years ago

Most upvoted comments

Sounds good – I’ll benchmark and report here, and we can decide based on that?

shwina on Jul 22, 2020

We could even add a minmax reduction to do this in a single operation.

jrhemstad on Jul 23, 2020