datafusion: Incorrect results for join condition against current master branch
Describe the bug
Join condition(on
with between
) works incorrectly.
Looks like ignored and returned cartesian product.
It used to work for latest stable release (15.0.0
from crates.io
)
But I tested it against current master branch, hash 3cc607de4ce6e9e1fd537091e471858c62f58653
.
To Reproduce
Steps to reproduce the behavior:
students.csv
:
name,mark
Stuart,28
Amina,89
Christen,50
Salma,77
Samantha,21
grades.csv
:
grade,min,max
1,0,14
2,15,35
3,36,55
4,56,79
5,80,100
MRE:
use datafusion::prelude::{CsvReadOptions, SessionContext};
#[tokio::main]
async fn main() -> anyhow::Result<()> {
let students_path = "../students.csv";
let grades_path = "../grades.csv";
// Datafusion execution
let ctx = SessionContext::new();
ctx.register_csv("students", students_path, CsvReadOptions::default())
.await?;
ctx.register_csv("grades", grades_path, CsvReadOptions::default())
.await?;
let data_frame = ctx.sql("SELECT s.*, g.grade FROM students s join grades g on s.mark between g.min and g.max WHERE grade > 2 ORDER BY s.mark DESC").await?;
data_frame.show().await?;
Ok(())
}
It will return:
+----------+------+-------+
| name | mark | grade |
+----------+------+-------+
| Amina | 89 | 3 |
| Amina | 89 | 4 |
| Amina | 89 | 5 |
| Salma | 77 | 3 |
| Salma | 77 | 4 |
| Salma | 77 | 5 |
| Christen | 50 | 3 |
| Christen | 50 | 4 |
| Christen | 50 | 5 |
| Stuart | 28 | 3 |
| Stuart | 28 | 4 |
| Stuart | 28 | 5 |
| Samantha | 21 | 3 |
| Samantha | 21 | 4 |
| Samantha | 21 | 5 |
+----------+------+-------+
Expected behavior
It should be the same as for datafusion = "15.0.0"
:
+----------+------+-------+
| name | mark | grade |
+----------+------+-------+
| Amina | 89 | 5 |
| Salma | 77 | 4 |
| Christen | 50 | 3 |
+----------+------+-------+
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 18 (18 by maintainers)
It seems the
EliminateCrossJoin
rule only collect and consider the equijoin predicates when finding an appropriate right input of join. We can extend equijoin to all join predicate(equijoin and non-equi join), this issue will be fixed I think.I will submit a pr later if others do not fix it.
Looks to be regression introduced by fddb3d3651041f41d66a801f10e27387e84374f7 (https://github.com/apache/arrow-datafusion/pull/4562)
On the commit prior to it (27921135e4ff4b644251db6ab42f1a25bd6523cb), I get this explain plan:
And on commit fddb3d3651041f41d66a801f10e27387e84374f7 I get this plan instead:
The actual regression seems to be caused by the SQL planner generating the initial logical plan with an Inner Join instead of a Cross Join, and this propagates down to cause the bug.
However it seems to highlight the actual flaw which is in
eliminate_cross_join
optimizer rule which converts the plan from:to
Where it completely discards the Filter on the Inner Join when converting it to a Cross Join, causing the buggy behaviour
I think the plan should be the
inner join
not thecross join
This problem remind me the importance of https://github.com/apache/arrow-datafusion/issues/4615.
I meet a strange problem when I do the pr #4866.
After I fix the test, the tcph-q11 will panic when run sqllogictests(the plan of q11 is modified), the direct panic message:
thread 'tokio-runtime-worker' panicked at 'partition not used yet', datafusion/core/src/physical_plan/repartition.rs:405:14 ... at tests/sqllogictests/test_files/tpch.slt:456
.I have no idea about it now, need to investigate more. More info: https://github.com/apache/arrow-datafusion/actions/runs/3884323174/jobs/6626733651
To not block the release, I think maybe we can skip
EliminateCrossJoin
rule when it meet any inner-join input who has non-empty filter(this is consistent with datafusion 15).With the patch, the sql of this issue will get the correct result. For the feature of #4866, some query in datafusion 15 will get wrong result, and if we skip
EliminateCrossJoin
, datafusion 16(release soon) will get correct result, although the plan is not better optimized.cc @alamb @liukun4515 @jackwener
Of course @alamb , I will submit a pr today or tomorrow.
I agree – I have mentioned it on the mailing list vote thread. @ygf11 can you work on a PR for this issue soon?
IMO it would be a good idea to fix this before releasing, this seems like a regression on fundamental functionality. Happy to help with reviewing the fix PR.
Wow, let’s make sure to add some tests so regressions like this do not stealthily go through in the future 🤔