presto: TPC-DS SF 1K Data mismatch (velox vs java) -> Query 95 (in both original and ahana rewritten queries)

TPC-DS SF 1K Data mismatch while comparing Velox vs Java results for query 95. Attached screenshot. Original TPC-DS 95 query : image Rewitten TPC-DS 95 query: image

Your Environment

Presto version used: 0.284 Deployment (Cloud or On-prem): Cloud

Expected Behavior

Data should match.

Current Behavior

Data mismatch in all columns.

Possible Solution

Steps to Reproduce

Run query 95.

with ws_wh as
(select ws1.ws_order_number,ws1.ws_warehouse_sk wh1,ws2.ws_warehouse_sk wh2
 from web_sales ws1,web_sales ws2
 where ws1.ws_order_number = ws2.ws_order_number
   and ws1.ws_warehouse_sk <> ws2.ws_warehouse_sk)
 select  
   count(distinct ws_order_number) as "order count"
  ,sum(ws_ext_ship_cost) as "total shipping cost"
  ,sum(ws_net_profit) as "total net profit"
from
   web_sales ws1
  ,date_dim
  ,customer_address
  ,web_site
where
    d_date between (cast('1999-2-01' as date)) and 
           (cast('1999-2-01' as date) + INTERVAL '60' DAY)
and ws1.ws_ship_date_sk = d_date_sk
and ws1.ws_ship_addr_sk = ca_address_sk
and ca_state = 'IL'
and ws1.ws_web_site_sk = web_site_sk
and web_company_name = 'pri'
and ws1.ws_order_number in (select ws_order_number
                            from ws_wh)
and ws1.ws_order_number in (select wr_order_number
                            from web_returns,ws_wh
                            where wr_order_number = ws_wh.ws_order_number)
order by count(distinct ws_order_number)
 fetch first 100 rows only;

Screenshots (if appropriate)

Original TPC-DS 95 query : image Rewitten TPC-DS 95 query: image

Context

About this issue

  • Original URL
  • State: closed
  • Created 9 months ago
  • Comments: 27 (21 by maintainers)

Most upvoted comments

@karteekmurthys, @yzhang1991, @mohsaka : https://github.com/facebookincubator/velox/pull/6680 should be included in our testing. That has an important fix for HashJoin that would show up at this scale factor.

I checked the query plan and many HashJoins are present in it.

While trying to repro this on smaller dataset with query: select count(ws_order_number) as "order count" from web_sales where ws_order_number in (select ws_order_number from web_sales); , I found that dynamicFilters get applied. Dynamic filters are enabled to filter the probe input when there are upstream operators that accept filters. So, for smaller datasets, for this specific query input is copied as is.

An even simpler reproducer is

select count(ws_order_number) as "order count" from web_sales where ws_order_number in (select ws_order_number from web_sales);
 order count 
-------------
   289805994 
(1 row)

The output is non-deterministic as well.