datafusion: Incorrect results in datafusion
Describe the bug I came upon a bug while querying my custom Parquet dataset, which causes DataFusion to produce incoherent and incorrect results.
I tested my dataset in various ways, all of which produced the desired results:
- reading parquet files using python pandas, then merging and filtering the data there
- encoding into CSV, and reading the data with DataFusion
- creating an SQLite database using the provided CSV files, and using the same queries there
To Reproduce Steps to reproduce the behavior:
- Download all the code and data I used for testing:
Inside there are the Parquet files and CSVs with exactly the same data (also, there’s an sqlite database created from the provided CSV files).
- Use the instructions included in
README.md
to reproduce the issue:
The query, that fails when querying Parquet files with datafusion-cli:
-- 1. Distinct stop names
SELECT DISTINCT stop_name FROM stop INNER JOIN trip ON tid = trip_tid WHERE line = '176' ORDER BY stop_name NULLS LAST;
Change only in where
from line
to trip_line
produces the desired results.
Expected behavior Should produce these 27 rows:
Bartnicza
Bazyliańska
Bolesławicka
Brzezińska
Budowlana
Choszczówka
Chłodnia
Daniszewska
Fabryka Pomp
Insurekcji
Marcelin
Marywilska-Las
Ołówkowa
PKP Płudy
PKP Żerań
Parowozowa
Pelcowizna
Polnych Kwiatów
Raciborska
Rembielińska
Sadkowska
Smugowa
Starego Dębu
Zyndrama z Maszkowic
os.Marywilska
Śpiewaków
None
Query 1 from README.md
(mentioned above) produces this incorrect set of 33 rows:
+----------------------+
| stop_name |
+----------------------+
| Bartnicza |
| Bazyliańska |
| Bolesławicka |
| Brzezińska |
| Budowlana |
| Choszczówka |
| Chłodnia |
| Cygańska |
| Czołgistów |
| Daniszewska |
| Fabryka Pomp |
| Insurekcji |
| Majerankowa |
| Marcelin |
| Marywilska-Las |
| Ołówkowa |
| PKP Falenica |
| PKP Płudy |
| PKP Żerań |
| Parowozowa |
| Pelcowizna |
| Polnych Kwiatów |
| Raciborska |
| Rembielińska |
| Rokosowska |
| Sadkowska |
| Smugowa |
| Starego Dębu |
| Zbójna Góra |
| Zyndrama z Maszkowic |
| os.Marywilska |
| Śpiewaków |
| |
+----------------------+
Additional context Datafusion version:
$ datafusion-cli --version
DataFusion 5.1.0
My guess
Since the Parquet files have encoded NULLs, and reading the CSV files with datafusion-cli
gets rid of those, my best bet is on the usage of NULLs and some weir behavior when joining.
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 5
- Comments: 27 (24 by maintainers)
@alamb this reads like a detective 🕵️♂️
The plot thickens! 🕵️
Regarding parquet predicate pruning, amusingly in this case, I think row group pruning actually helps avoid the problem. As you may recall, when a filter is applied like this
The answer is correct (
stop_name
is"RONDO ZESŁAŃCÓW SYBERYJSKICH"
):However, when I disable pruning then the wrong answer comes out!
I added some debugging, and verified that the query does in fact skip several row groups:
I reran the queries from @franeklubi at 2e918184c502a40a041fb0163702cb6ab8de0af9:
Setup
Now the results are consistent when using csv and parquet:
Parquet
Csv
Interestingly, the CSV results don’t seem to have the
NULLS LAST
🤔I tested the fix from @yordan-pavlov in https://github.com/apache/arrow-rs/pull/1130 against my reproducer and with #1130 it now gets the correct answer ❤️ so that seems like progress
FYI https://github.com/apache/arrow-rs/pull/1110 runs into a similar issue, that appears to be fixed by switching to ComplexObjectArrayReader instead of ArrowArrayReader. I will have a poke around tomorrow if I have time, and see if I can spot what is going wrong.
Oh weird, but then a direct query
Seems to get the correct answer 🤔
hi @franeklubi thanks for the detailed sharing.
to simplify bug reproduction, can you help me understand the difference between parquet and csv data? specifically:
they seem to have different number of rows.
Same thing applies to trips data.
You could try using ComplexObjectArrayReader to decode the column instead of ArrowArrayReader. This might help narrow down if the bug lies in the RLE decoding or something higher up in ArrowArrayReader. Alternatively you could see if the issue occurs with https://github.com/apache/arrow-rs/pull/1082 which replaces ArrowArrayReader with an alternative that shares more with the PrimitiveArrayReader/ComplexObjectArrayReader implementations, again this might help narrow down the origin.
Not at a computer at the moment, otherwise would try myself
I will probably file a ticket in arrow-rs shortly with the slimmed down reproducer
This discrepancy looks like it comes out of
VariableLenDictionaryDecoder
which seems to have been introduced by @yordan-pavlov in https://github.com/apache/arrow-rs/pull/384I have not studied the code enough yet to fully understand what it is doing, and I need to attend to some other items now. If anyone has ideas (cc @tustvold ) on where to look next I would appreciate it
I have a smaller reproducer and am trying to narrow down where the problem is but probably won’t be able to work on this until next week sometime at the earliest
In case anyone else is interested, here is the repo: repro.zip
It should print out
But actually prints out
I spent some time looking at the parquet data and the csv data, and it looks to me like there may be something wrong with the parquet reader.
Specifically, I just ran a query that did a select *
dump_parquet.sql:
and
dump_csv.sql
:Like this:
The results are here: dump_csv.txt dump_parquet.txt
And a quick visual diff shows they aren’t the same
The first few lines of
dump_csv.txt
look likeWhile the first few lines of
dump_parquet.txt
l look like:(note that the stop name is different)
However, when I look for that mismatched line
trip_tid=54788307
in the data using pandas it does match with the csv:Here is the raw data in csv:
Here is what comes out when using pandas:
I’ll try and check this out in more detail tomorrow. Thanks for the investigation @Jimexist
seems like these values are absent in parquet readings:
and the scanning process was invalid:
notice the
Odkryta
row would be missing in the statement below.Given the investigation above I believe this might be related to the single distinct to group by optimization or the hash aggregation steps.
cc @Dandandan @houqp