presto: TPC-DS SF 1K Data mismatch (velox vs java) -> Query 95 (in both original and ahana rewritten queries)
TPC-DS SF 1K Data mismatch while comparing Velox vs Java results for query 95. Attached screenshot.
Original TPC-DS 95 query :
Rewitten TPC-DS 95 query:
Your Environment
Presto version used: 0.284 Deployment (Cloud or On-prem): Cloud
Expected Behavior
Data should match.
Current Behavior
Data mismatch in all columns.
Possible Solution
Steps to Reproduce
Run query 95.
with ws_wh as
(select ws1.ws_order_number,ws1.ws_warehouse_sk wh1,ws2.ws_warehouse_sk wh2
from web_sales ws1,web_sales ws2
where ws1.ws_order_number = ws2.ws_order_number
and ws1.ws_warehouse_sk <> ws2.ws_warehouse_sk)
select
count(distinct ws_order_number) as "order count"
,sum(ws_ext_ship_cost) as "total shipping cost"
,sum(ws_net_profit) as "total net profit"
from
web_sales ws1
,date_dim
,customer_address
,web_site
where
d_date between (cast('1999-2-01' as date)) and
(cast('1999-2-01' as date) + INTERVAL '60' DAY)
and ws1.ws_ship_date_sk = d_date_sk
and ws1.ws_ship_addr_sk = ca_address_sk
and ca_state = 'IL'
and ws1.ws_web_site_sk = web_site_sk
and web_company_name = 'pri'
and ws1.ws_order_number in (select ws_order_number
from ws_wh)
and ws1.ws_order_number in (select wr_order_number
from web_returns,ws_wh
where wr_order_number = ws_wh.ws_order_number)
order by count(distinct ws_order_number)
fetch first 100 rows only;
Screenshots (if appropriate)
Original TPC-DS 95 query :
Rewitten TPC-DS 95 query:
Context
About this issue
- Original URL
- State: closed
- Created 9 months ago
- Comments: 27 (21 by maintainers)
@karteekmurthys, @yzhang1991, @mohsaka : https://github.com/facebookincubator/velox/pull/6680 should be included in our testing. That has an important fix for HashJoin that would show up at this scale factor.
I checked the query plan and many HashJoins are present in it.
While trying to repro this on smaller dataset with query:
select count(ws_order_number) as "order count" from web_sales where ws_order_number in (select ws_order_number from web_sales);
, I found thatdynamicFilters
get applied. Dynamic filters are enabled to filter the probe input when there are upstream operators that accept filters. So, for smaller datasets, for this specific query input is copied as is.An even simpler reproducer is
The output is non-deterministic as well.