presto: Worker instance randomly exit during queries

Hi there. After updated to 0.150, I experienced a strange issue that worker randomly exit during queries (especially when multiple queries are running concurrently).

The following is last logs from a exited work:

Jul 14, 2016 5:36:13 PM INFO: parquet.hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 32068 records.
Jul 14, 2016 5:36:13 PM INFO: parquet.hadoop.InternalParquetRecordReader: at row 0. reading next block
Jul 14, 2016 5:36:13 PM INFO: parquet.hadoop.InternalParquetRecordReader: block read in memory in 11 ms. row count = 32068
Jul 14, 2016 5:36:27 PM WARNING: parquet.hadoop.ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAtt
2016-07-14T17:36:41.603+0800    ERROR   exchange-client-22      com.facebook.presto.operator.HttpPageBufferClient       Request to delete http://10.6.128.138:8088/v1/task/20160714_093641_00028_e4m62.2.19/results/20160714_093641_00028_
e4m62.1.41 failed java.util.concurrent.CancellationException: Task was cancelled.
2016-07-14T17:36:41.612+0800    ERROR   exchange-client-0       com.facebook.presto.operator.HttpPageBufferClient       Request to delete http://10.6.194.14:8088/v1/task/20160714_093641_00028_e4m62.2.28/results/20160714_093641_00028_e
4m62.1.41 failed java.util.concurrent.CancellationException: Task was cancelled.
2016-07-14T17:36:41.612+0800    ERROR   exchange-client-0       com.facebook.presto.operator.HttpPageBufferClient       Request to delete http://10.6.130.223:8088/v1/task/20160714_093641_00028_e4m62.2.70/results/20160714_093641_00028_
e4m62.1.41 failed java.util.concurrent.CancellationException: Task was cancelled.
2016-07-14T17:36:41.613+0800    ERROR   exchange-client-0       com.facebook.presto.operator.HttpPageBufferClient       Request to delete http://10.6.131.39:8088/v1/task/20160714_093641_00028_e4m62.2.21/results/20160714_093641_00028_e
4m62.1.41 failed java.util.concurrent.CancellationException: Task was cancelled.
2016-07-14T17:36:41.927+0800    ERROR   exchange-client-4       com.facebook.presto.operator.HttpPageBufferClient       Request to delete http://10.6.130.163:8088/v1/task/20160714_093641_00028_e4m62.3.9/results/20160714_093641_00028_e
4m62.2.41 failed java.util.concurrent.CancellationException: Task was cancelled.
2016-07-14T17:36:42.126+0800    ERROR   SplitRunner-7-43        com.facebook.presto.execution.TaskExecutor      Error processing Split 20160714_093641_00028_e4m62.3.43-1 {path=hdfs://hive_table0/date=20160608/hour=00/part-r-00006-2f743039-717f-4798-8ef3-27a85a54323b.snappy.parquet, start=1073741824, length=67108864, hosts=[n6-194-147.byted.org, n6-128-139.byted.org, n6-128-131.byted.org], database=default, table=fact_sns
sdk_nginx_log_daily, forceLocalScheduling=false, partitionName=date=20160608/hour=00, effectivePredicate=com.facebook.presto.spi.predicate.TupleDomain@41c917b9} (start = 1468489001523, wall = 603 ms, cpu = 0 ms, calls = 1): HIVE_CANNO
T_OPEN_SPLIT: Error opening Hive split hdfs://hive_table0/date=20160608/hour=00/part-r-00006-2f743039-717f-4798-8ef3-27a85a54323b.snappy.parquet (offset=1073741824, length=67108864): null
2016-07-14T17:36:42.152+0800    ERROR   SplitRunner-1-37        com.facebook.presto.execution.TaskExecutor      Error processing Split 20160714_093641_00028_e4m62.3.43-2 {path=hdfs://hive_table0/date=20160608/hour=00/part-r-00022-2f743039-717f-4798-8ef3-27a85a54323b.snappy.parquet, start=1946157056, length=67108864, hosts=[n6-195-018.byted.org, n6-130-142.byted.org, n6-130-149.byted.org], database=default, table=fact_sns
sdk_nginx_log_daily, forceLocalScheduling=false, partitionName=date=20160608/hour=00, effectivePredicate=com.facebook.presto.spi.predicate.TupleDomain@41c917b9} (start = 1468489001523, wall = 629 ms, cpu = 0 ms, calls = 1): HIVE_CANNO
T_OPEN_SPLIT: Error opening Hive split hdfs://hive_table0/date=20160608/hour=00/part-r-00022-2f743039-717f-4798-8ef3-27a85a54323b.snappy.parquet (offset=1946157056, length=67108864): null
2016-07-14T17:36:42.161+0800    ERROR   SplitRunner-3-39        com.facebook.presto.execution.TaskExecutor      Error processing Split 20160714_093641_00028_e4m62.3.43-3 {path=hdfs://hive_table0/date=20160608/hour=00/part-r-00005-2f743039-717f-4798-8ef3-27a85a54323b.snappy.parquet, start=939524096, length=67108864, hosts=[n6-128-212.byted.org, n6-128-205.byted.org, n6-194-229.byted.org], database=default, table=fact_snss
dk_nginx_log_daily, forceLocalScheduling=false, partitionName=date=20160608/hour=00, effectivePredicate=com.facebook.presto.spi.predicate.TupleDomain@41c917b9} (start = 1468489001523, wall = 638 ms, cpu = 0 ms, calls = 1): HIVE_CANNOT
_OPEN_SPLIT: Error opening Hive split hdfs://hive_table0/date=20160608/hour=00/part-r-00005-2f743039-717f-4798-8ef3-27a85a54323b.snappy.parquet (offset=939524096, length=67108864): null
2016-07-14T17:36:42.164+0800    ERROR   SplitRunner-6-42        com.facebook.presto.execution.TaskExecutor      Error processing Split 20160714_093641_00028_e4m62.3.43-6 {path=hdfs://hive_table0/date=20160608/hour=00/part-r-00001-2f743039-717f-4798-8ef3-27a85a54323b.snappy.parquet, start=1308622848, length=33554432, hosts=[n6-130-203.byted.org, n6-194-029.byted.org, n6-130-195.byted.org], database=default, table=fact_sns
sdk_nginx_log_daily, forceLocalScheduling=false, partitionName=date=20160608/hour=00, effectivePredicate=com.facebook.presto.spi.predicate.TupleDomain@41c917b9} (start = 1468489001538, wall = 626 ms, cpu = 0 ms, calls = 1): HIVE_CANNO
T_OPEN_SPLIT: Error opening Hive split hdfs://hive_table0/date=20160608/hour=00/part-r-00001-2f743039-717f-4798-8ef3-27a85a54323b.snappy.parquet (offset=1308622848, length=33554432): Failed 
on local exception: java.nio.channels.ClosedByInterruptException; Host Details : local host is: "n6-194-081/10.6.194.81"; destination host is: "n6-195-031.byted.org":65212; 
2016-07-14T17:36:42.165+0800    ERROR   SplitRunner-5-41        com.facebook.presto.execution.TaskExecutor      Error processing Split 20160714_093641_00028_e4m62.3.43-5 {path=hdfs://hive_table0/date=20160608/hour=00/part-r-00002-2f743039-717f-4798-8ef3-27a85a54323b.snappy.parquet, start=1476395008, length=33554432, hosts=[n6-130-143.byted.org, n6-131-050.byted.org, n6-130-144.byted.org], database=default, table=fact_sns
sdk_nginx_log_daily, forceLocalScheduling=false, partitionName=date=20160608/hour=00, effectivePredicate=com.facebook.presto.spi.predicate.TupleDomain@41c917b9} (start = 1468489001538, wall = 627 ms, cpu = 0 ms, calls = 1): HIVE_CANNO
T_OPEN_SPLIT: Error opening Hive split hdfs://hive_table0/date=20160608/hour=00/part-r-00002-2f743039-717f-4798-8ef3-27a85a54323b.snappy.parquet (offset=1476395008, length=33554432): Failed 
on local exception: java.nio.channels.ClosedByInterruptException; Host Details : local host is: "n6-194-081/10.6.194.81"; destination host is: "n6-195-031.byted.org":65212; 
2016-07-14T17:37:42.899+0800    ERROR   exchange-client-7       com.facebook.presto.operator.HttpPageBufferClient       Request to delete http://10.3.128.162:8088/v1/task/20160714_093742_00031_e4m62.2.30/results/20160714_093742_00031_
e4m62.1.58 failed java.util.concurrent.CancellationException: Task was cancelled.
2016-07-14T17:40:19.472+0800    ERROR   exchange-client-12      com.facebook.presto.operator.HttpPageBufferClient       Request to delete http://10.3.132.152:8088/v1/task/20160714_093917_00035_e4m62.2.74/results/20160714_093917_00035_e4m62.1.51 failed io.airlift.http.client.RuntimeIOException: java.net.SocketTimeoutException: Connect Timeout
Already running as 183625
2016-07-14T17:51:49.854+0800    INFO    Thread-22       io.airlift.bootstrap.LifeCycleManager   Life cycle stopping...

I’ve also examined other logs (http-request.log, for example), and there was nothing.

Any thoughts? Thanks.

About this issue

  • Original URL
  • State: closed
  • Created 8 years ago
  • Comments: 20 (10 by maintainers)

Most upvoted comments

I don’t think there is much public info about the internals of how/when clients are throttled, the only documented fact is clients may get a 503 when they get throttled.

Btw it is not only about throttling, although throttling may cause stragglers I have also seen clients timing out (on read/connect) too (see the stack traces in your worker logs).