cudf: Use "ranger" to prevent grid stride loop overflow
(updated Aug 2023)
Background
We found a kernel indexing overflow issue, first discovered in the fused_concatenate kernels (https://github.com/rapidsai/cudf/issues/10333) and this issue is present in a number of our CUDA kernels that take the following form:
size_type output_index = threadIdx.x + blockIdx.x * blockDim.x;
while (output_index < output_size) {
output_index += blockDim.x * gridDim.x;
}
If we have an output_size of say 1.2 billion and a grid size that’s the same, the following happens: Some late thread id, say 1.19 billion attempts to add 1.2 billion (blockDim.x * gridDim.x) and overflows the size_type (signed 32 bits).
We made a round of fixes in #10448, and then later found another instance of this error in #13838. Our first pass of investigation was not adequate to contain the issue, so we need to take another close look.
Part 1 - First pass fix kernels with this issue
| Source file | Kernels | Status |
|---|---|---|
copying/concatenate.cu |
fused_concatenate_kernel |
#10448 |
valid_if.cuh |
valid_if_kernel |
#10448 |
scatter.cu |
marking_bitmask_kernel |
#10448 |
replace/nulls.cu |
replace_nulls_strings |
#10448 |
replace/nulls.cu |
replace_nulls |
#10448 |
rolling/rolling_detail.cuh |
gpu_rolling |
#10448 |
rolling/jit/kernel.cu |
gpu_rolling_new |
#10448 |
transform/compute_column.cu |
compute_column_kernel |
#10448 |
copying/concatenate.cu |
fused_concatenate_string_offset_kernel |
#13838 |
replace/replace.cu |
replace_strings_first_pass replace_strings_second_pass replace_kernel |
#13905 |
copying/concatenate.cu |
concatenate_masks_kernel fused_concatenate_string_offset_kernel fused_concatenate_string_chars_kernel fused_concatenate_kernel (int64) |
#13906 |
hash/helper_functions.cuh |
init_hashtbl |
#13895 |
null_mask.cu |
set_null_mask_kernel copy_offset_bitmask count_set_bits_kernel |
#13895 |
transform/row_bit_count.cu |
compute_row_sizes |
#13895 |
multibyte_split.cu |
multibyte_split_init_kernel multibyte_split_seed_kernel (auto??) multibyte_split_kernel |
#13910 |
| IO modules: parquet, orc, json | #13910 | |
io/utilities/parsing_utils.cu |
count_and_set_positions (uint64_t) |
#13910 |
conditional_join_kernels.cuh |
compute_conditional_join_output_size conditional_join |
#13971 |
merge.cu |
materialize_merged_bitmask_kernel |
#13972 |
partitioning.cu |
compute_row_partition_numbers compute_row_output_locations copy_block_partitions |
#13973 |
json_path.cu |
get_json_object_kernel |
#13962 |
tdigest |
compute_percentiles_kernel (int) |
#13962 |
strings/attributes.cu |
count_characters_parallel_fn |
#13968 |
strings/convert/convert_urls.cu |
url_decode_char_counter (int) url_decode_char_replacer (int) |
#13968 |
text/subword/data_normalizer.cu |
kernel_data_normalizer (uint32_t) |
#13915 |
text/subword/subword_tokenize.cu |
kernel_compute_tensor_metadata (uint32_t) |
#13915 |
text/subword/wordpiece_tokenizer.cu |
init_data_and_mark_word_start_and_ends (uint32_t) mark_string_start_and_ends (uint32_t) kernel_wordpiece_tokenizer (uint32_t) |
#13915 |
Part 2 - Take another pass over more challenging kernels
| Source file | Kernels | Status |
|---|---|---|
| null_mash.cuh | subtract_set_bits_range_boundaries_kernel | |
| valid_if.cuh | valid_if_n_kernel | |
| copy_if_else.cuh | copy_if_else_kernel | |
| gather.cuh | gather_chars_fn_string_parallel | |
| more? | search gridDim.x or blockDim.x to find more examples |
Part 3 - Use ranger to prevent grid stride loop overflow
- incorporate the ranger header as a libcudf utility
- use ranger instead of manual indexing in libcudf kernels
Additional information
There are also a number of kernels that have this pattern but probably don’t ever overflow because they are indexing by bitmask words. (Example)
Additional, In this kernel, source_idx probably overflows, but harmlessly.
A snippet of code to see this in action:
size_type const size = 1200000000;
auto big = cudf::make_fixed_width_column(data_type{type_id::INT32}, size, mask_state::UNALLOCATED);
auto x = cudf::rolling_window(*big, 1, 1, 1, cudf::detail::sum_aggregation{});
Note: rmm may mask out of bounds accesses in some cases, so it’s helpful to run with the plain cuda allocator.
About this issue
- Original URL
- State: open
- Created 2 years ago
- Comments: 15 (11 by maintainers)
Commits related to this issue
- Batch of fixes for index overflows in grid stride loops. (#10448) Partially addresses https://github.com/rapidsai/cudf/issues/10368 Specifically: - `valid_if` - `scatter` - `rolling_window` -... — committed to rapidsai/cudf by nvdbaranec 2 years ago
- Use grid_stride for stride computations. (#13996) This PR adds `grid_1d::grid_stride()` and uses it in a handful of kernels. Follow-up to #13910, which added a `grid_1d::global_thread_id()`. We'll ne... — committed to rapidsai/cudf by bdice 10 months ago
- Use thread_index_type to avoid out of bounds accesses in conditional joins (#13971) See #10368 (and more recently #13771 Authors: - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - Br... — committed to rapidsai/cudf by vyasr 10 months ago
wrt attempting to find locations where this might be happening. In host code, clang and gcc will warn if you add
-Wsign-conversion(not covered by-Wall -Wextra) under some circumstances. Unfortunately there is no such option for nvcc.