cudf: [FEA] Abstract away block reduce and block scan from cuIO kernels

ORC and parquet writer kernels use a block reduction or a block scan but write out the entire logic inside the kernel.

Here’s a block reduction:

https://github.com/rapidsai/cudf/blob/57ef76927373d7260b6a0eda781e59a4c563d36e/cpp/src/io/orc/dict_enc.cu#L431-L434

Here’s a block scan:

https://github.com/rapidsai/cudf/blob/57ef76927373d7260b6a0eda781e59a4c563d36e/cpp/src/io/orc/dict_enc.cu#L266-L284

This code is doing a block scan on dupes_before but the code for this operation is interspersed with other logic.

Proposed fix

  • Either implement a block scan and a block reduction and add it to block_utils.cuh. It would be helpful if this could be called by just the active threads and not all threads (perhaps with creative use of coalesced threads masks) but not too important as all the places it’s called in already ensures __syncthreads() can be called.
  • Use an off the shelf cub implementation.

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 16 (16 by maintainers)

Commits related to this issue

Most upvoted comments

Personally, I always use the ptxas -v option at build time during development (can’t imagine doing any serious cuda work without it). I also consider any non-zero lmem usage a bug with very rare exceptions.

Just trying to help. All I’m saying is make sure to check such assumptions against reality.

Use an off the shelf cub implementation

This is almost certainly the way to go. It’s likely that the CUB implementations will be faster as well.