cudf: [FEA] Abstract away block reduce and block scan from cuIO kernels

ORC and parquet writer kernels use a block reduction or a block scan but write out the entire logic inside the kernel.

Here’s a block reduction:

https://github.com/rapidsai/cudf/blob/57ef76927373d7260b6a0eda781e59a4c563d36e/cpp/src/io/orc/dict_enc.cu#L431-L434

Here’s a block scan:

https://github.com/rapidsai/cudf/blob/57ef76927373d7260b6a0eda781e59a4c563d36e/cpp/src/io/orc/dict_enc.cu#L266-L284

This code is doing a block scan on dupes_before but the code for this operation is interspersed with other logic.

Proposed fix

Either implement a block scan and a block reduction and add it to block_utils.cuh. It would be helpful if this could be called by just the active threads and not all threads (perhaps with creative use of coalesced threads masks) but not too important as all the places it’s called in already ensures __syncthreads() can be called.
Use an off the shelf cub implementation.

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 16 (16 by maintainers)

Commits related to this issue

Abstracting block reduce and block scan from cuIO kernels with `cub` apis (#7278) closes #6238 This PR replaces existing usages of `warp_reduce` or `warp_scans` which were used for block reductio... — committed to rapidsai/cudf by rgsl888prabhu 3 years ago

Most upvoted comments

Personally, I always use the ptxas -v option at build time during development (can’t imagine doing any serious cuda work without it). I also consider any non-zero lmem usage a bug with very rare exceptions.

OlivierNV on Sep 22, 2020

Just trying to help. All I’m saying is make sure to check such assumptions against reality.

OlivierNV on Sep 21, 2020

Use an off the shelf cub implementation

This is almost certainly the way to go. It’s likely that the CUB implementations will be faster as well.

jrhemstad on Sep 15, 2020