presto: Cassandra connector IN query very slow planning on large list
A query like -
select col1
from table
where col2 in (<long list of integers>)
and col3 in (<long list of string>)
and col4 in (<another long list of integers>)
and col1 is not null
group by col1;
takes more than 5 minutes just planning. My cassandra table being queried has a lot of partitions and list length for IN
query I was experimenting with was anywhere between 50 to 200.
<col2, col3, col4>
together form the partition keys so I don’t imagine a full table scan to take place during planning or execution.
Any ideas?
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Comments: 34 (32 by maintainers)
oh, btw, @aandis thank you for your in-depth analysis!
I am do not think running concurrent queries is a good idea, let’s keep this option as a last resort.
If you have a tuple domain with multiple columns with values set, then the query which is send to Cassandra should use
IN
predicate as your original Presto query.Alright, it took some doing but I ended up setting presto up from source and adding debug statements for profiling this.
Problem
Turns out, for
IN
query over multiple columns, cassandra connector executesselect distinct
for each combination of values for the columns, which takes a long time to finish because the number of such queries is huge. Following log statements should make it clearThis goes on for close to a minute.
Possible fixes
select distinct
queries - I am not sure I understand the reason forSELECT DISTINCT col2,col3,col4 FROM keyspace.table WHERE col2=? AND col3=? AND col4=?
queries when the columns being queried form the complete partition key set.<col2, col3, col4>
in this case form the complete partition keyset, soSELECT DISTINCT
will always return atmost 1 record.I am open to sending a patch to fix this based on what’s decided here.