datafusion: Optimizer is slow: Avoid too many string cloning in the optimizer
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
I’m not sure this is a feature request, but at least this is not a bug (albeit it’s a performance problem), so I write this issue as a feature request.
I’m benchmarking the optimizers of DataFusion and Calcite.
I intended to compare the quality of the optimized plans between them, assuming that DataFusion’s optimizing speed would be faster (since it’s written in Rust).
But to my surprise, I found that Calcite’s optimizer is way faster (~ 20x) in some cases.
The case is as follows:
- Query statement is very simple: “select column_1 from table_1”
- table_1 has about 700 columns (Yes, it has so many columns, that’s the problem).
While Calcite finished the optimization in about 7 msec, DataFusion’s optimizer took about 120 msec.
At first, the number was worse, but it settled to about 120 msec when I set the global allocator to mimalloc. (I’ve tried snmalloc and it was somewhat faster - about 100 msec. But somehow snmalloc did not play well with valgrind, I chose mimalloc at least temporarily)
I ran the test program with valgrind / callgrind and drew the call graph. The graph showed that about half of the execution time is being spent on <alloc::string::String as core::clone::Clone>::clone
. The call count was 3,930,814.
I ran the optimizer for another table with fewer columns (about 200 columns), and it took much less time - about 12msec.
So, I suspect that the optimizer becomes slow (at least for a table with many columns) because it clones the strings related to the schema of the table too many times.
Describe the solution you’d like
Perhaps removing unnecessary cloning may help. Or, make the fields immutable and manage them with reference counted smart pointers.
Describe alternatives you’ve considered
No alternatives.
Additional context
The following attachment is the call graph in SVG format. It was created by gprof2dot.py and dot with callgrind’s output data. ‘batch_test’ is the name of my test program. Somewhat contrary to the name, The program only tests one query statement.
About this issue
- Original URL
- State: open
- Created a year ago
- Reactions: 3
- Comments: 18 (16 by maintainers)
I forgot out great that description was. FYI @matthewmturner for your inspiration
@alarmb Since there is an epic issue(#5637), this issue can be closed in favor of more specific issues. Thanks for your great work. By the way, at the moment, I’m generally satisfied with the performance of logical planning and optimization routines after applying some optimizations here and there, as I explained in https://github.com/apache/arrow-datafusion/issues/7698#issuecomment-1815885644. It would be nice if these optimizations could be incorporated into the official source code, perhaps with some adjustments if needed.
Seems reasonable to me. Maybe as some intermediate state we could have a
StringInterner
structure that computed the hash +Arc<str>
and tried to reuse any previously known about.Then we could thread through a StringInterner when possible (e.g. on the optimizer or the sql planner) but also be able to make these strings easily without one (so we could incrementally update the code)
Any such solution, I think, should strive very hard to keep the burden of working in the DataFusion codebase low (e.g. should be easy to work with for anyone used to String, be well documented, etc)
Agreed though then we have to thread the HashMap through
Another possibility would be to use
Arc<str>
(aka refcounted strings) – not quite as cheap to clone / create but also don’t need any external context