cudf: [BUG] strings::concatenate can overflow and cause data corruption

Describe the bug This is actually a generic issue with a lot of string operations. If I try to call strings::concatenate on string columns that would produce more data than can fit in a string column (2 GiB) I can get an overflow where CUDF tries to allocate a negative amount of memory. If I go even larger and would go over 4 GiB of data in the final output the result overflows twice and we end up allocating memory, but walk off the end of the data.

I know we don’t want to introspect the data for error checks. I have tried to do a prototype of this myself and I am no C++ expert, but I have heard that @davidwendt did something similar in another place. If we could do the exclusive scan as an int64_t instead of as an int32_t and when writing the result to d_offsets we would cast it back to an int32_t. For the last offset, the one that we care about, we would also save it to another place as an int64_t. This would be the length of the character buffer. We could then do size checks on this and verify it will fit before we try to allocate it.

Even if it did require an extra kernel call in the common path, there is no way for a user to detect this type of overflow ahead of time. It would result in everyone doing two new kernel calls. The first one would compute the lengths just like today and another one would to a SUM on all of the values as a long to see if we would overflow. It is a huge overhead compared to making a hopefully small modification to the existing code. I also don’t see how it would slow things down because the int64_t would only happen within the GPU kernel it would not be read or written, except for the very last offset.

Steps/Code to reproduce bug make a byte column that contains 10 in it. The length should be max_value<int32_t> / 2 + 1. Convert that byte column to a string and the result will overflow and try to allocate a negative value. If you want it to walk off the end of the string use a short column with the value 1010 in it. This time when converting to a string it will write off the end of memory, and I often see “an illegal memory access was encountered”.

Expected behavior We get an exception back that the input is too large to allocate instead of trying to allocate a negative result or even worse a really small positive result.

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 17 (17 by maintainers)

Commits related to this issue

Most upvoted comments

The first bug I would fix is that we should raise an error if the concatenated size will exceed the limit. This is what cudf::concatenate does here:

This is not the same concatenate. This is a horizontal concatenate of row-wise elements. There is no computation of output_size and would require an extra kernel call.

The char_size() would be a good estimate but may not be accurate for sliced columns. Though it is a good maximum.