cudf: Use "ranger" to prevent grid stride loop overflow

(updated Aug 2023)

Background

We found a kernel indexing overflow issue, first discovered in the fused_concatenate kernels (https://github.com/rapidsai/cudf/issues/10333) and this issue is present in a number of our CUDA kernels that take the following form:

size_type output_index = threadIdx.x + blockIdx.x * blockDim.x;  
while (output_index < output_size) {
  output_index += blockDim.x * gridDim.x;
}

If we have an output_size of say 1.2 billion and a grid size that’s the same, the following happens: Some late thread id, say 1.19 billion attempts to add 1.2 billion (blockDim.x * gridDim.x) and overflows the size_type (signed 32 bits).

We made a round of fixes in #10448, and then later found another instance of this error in #13838. Our first pass of investigation was not adequate to contain the issue, so we need to take another close look.

Part 1 - First pass fix kernels with this issue

Source file	Kernels	Status
`copying/concatenate.cu`	`fused_concatenate_kernel`	#10448
`valid_if.cuh`	`valid_if_kernel`	#10448
`scatter.cu`	`marking_bitmask_kernel`	#10448
`replace/nulls.cu`	`replace_nulls_strings`	#10448
`replace/nulls.cu`	`replace_nulls`	#10448
`rolling/rolling_detail.cuh`	`gpu_rolling`	#10448
`rolling/jit/kernel.cu`	`gpu_rolling_new`	#10448
`transform/compute_column.cu`	`compute_column_kernel`	#10448
`copying/concatenate.cu`	`fused_concatenate_string_offset_kernel`	#13838
`replace/replace.cu`	`replace_strings_first_pass` `replace_strings_second_pass` `replace_kernel`	#13905
`copying/concatenate.cu`	`concatenate_masks_kernel` `fused_concatenate_string_offset_kernel` `fused_concatenate_string_chars_kernel` `fused_concatenate_kernel` (int64)	#13906
`hash/helper_functions.cuh`	`init_hashtbl`	#13895
`null_mask.cu`	`set_null_mask_kernel` `copy_offset_bitmask` `count_set_bits_kernel`	#13895
`transform/row_bit_count.cu`	`compute_row_sizes`	#13895
`multibyte_split.cu`	`multibyte_split_init_kernel` `multibyte_split_seed_kernel` (auto??) `multibyte_split_kernel`	#13910
IO modules: parquet, orc, json		#13910
`io/utilities/parsing_utils.cu`	`count_and_set_positions` (uint64_t)	#13910
`conditional_join_kernels.cuh`	`compute_conditional_join_output_size` `conditional_join`	#13971
`merge.cu`	`materialize_merged_bitmask_kernel`	#13972
`partitioning.cu`	`compute_row_partition_numbers` `compute_row_output_locations` `copy_block_partitions`	#13973
`json_path.cu`	`get_json_object_kernel`	#13962
`tdigest`	`compute_percentiles_kernel` (int)	#13962
`strings/attributes.cu`	`count_characters_parallel_fn`	#13968
`strings/convert/convert_urls.cu`	`url_decode_char_counter` (int) `url_decode_char_replacer` (int)	#13968
`text/subword/data_normalizer.cu`	`kernel_data_normalizer` (uint32_t)	#13915
`text/subword/subword_tokenize.cu`	`kernel_compute_tensor_metadata` (uint32_t)	#13915
`text/subword/wordpiece_tokenizer.cu`	`init_data_and_mark_word_start_and_ends` (uint32_t) `mark_string_start_and_ends` (uint32_t) `kernel_wordpiece_tokenizer` (uint32_t)	#13915

Part 2 - Take another pass over more challenging kernels

Source file	Kernels	Status
null_mash.cuh	subtract_set_bits_range_boundaries_kernel
valid_if.cuh	valid_if_n_kernel
copy_if_else.cuh	copy_if_else_kernel
gather.cuh	gather_chars_fn_string_parallel
more?	search `gridDim.x` or `blockDim.x` to find more examples

Part 3 - Use ranger to prevent grid stride loop overflow

incorporate the ranger header as a libcudf utility
use ranger instead of manual indexing in libcudf kernels

Additional information

There are also a number of kernels that have this pattern but probably don’t ever overflow because they are indexing by bitmask words. (Example) Additional, In this kernel, source_idx probably overflows, but harmlessly.

A snippet of code to see this in action:

size_type const size = 1200000000;
auto big = cudf::make_fixed_width_column(data_type{type_id::INT32}, size, mask_state::UNALLOCATED);  
auto x = cudf::rolling_window(*big, 1, 1, 1, cudf::detail::sum_aggregation{});

Note: rmm may mask out of bounds accesses in some cases, so it’s helpful to run with the plain cuda allocator.

About this issue

Original URL
State: open
Created 2 years ago
Comments: 15 (11 by maintainers)

Commits related to this issue

Batch of fixes for index overflows in grid stride loops. (#10448) Partially addresses https://github.com/rapidsai/cudf/issues/10368 Specifically: - `valid_if` - `scatter` - `rolling_window` -... — committed to rapidsai/cudf by nvdbaranec 2 years ago
Use grid_stride for stride computations. (#13996) This PR adds `grid_1d::grid_stride()` and uses it in a handful of kernels. Follow-up to #13910, which added a `grid_1d::global_thread_id()`. We'll ne... — committed to rapidsai/cudf by bdice 10 months ago
Use thread_index_type to avoid out of bounds accesses in conditional joins (#13971) See #10368 (and more recently #13771 Authors: - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - Br... — committed to rapidsai/cudf by vyasr 10 months ago

Most upvoted comments

wrt attempting to find locations where this might be happening. In host code, clang and gcc will warn if you add -Wsign-conversion (not covered by -Wall -Wextra) under some circumstances. Unfortunately there is no such option for nvcc.

#include <cstdint>
int what(int upper)
{
  int i = 0; // no warning if this is a std::int64_t
  unsigned int stride = 10;
  while (i < upper) {
    i = i + stride; // clang warns for this, so does gcc
  }
  i = 0;
  while (i < upper) {
    i += stride; // gcc warns for this, clang does not.
  }
  return i;
}

wence- on Aug 10, 2023