risingwave: Failed in get: Hummock error: ObjectStore failed with IO error

Describe the bug

Slack link: https://risingwave-labs.slack.com/archives/C048NM5LNKX/p1671603133726859 Namespace: rwc-3-longevity-20221220-180642 Pod: risingwave-compute-2

2022-12-20T18:11:34.875003Z ERROR risingwave_storage::monitor::monitored_store: Failed in get: Hummock error: ObjectStore failed with IO error Internal error: read "rls-apse1-eks-a-rwc-3-longevity-20221220-180642/255/1.data" in block Some(BlockLocation { offset: 11008083, size: 37158 }) failed, error: timeout: error trying to connect: HTTP connect timeout occurred after 3.1s
  backtrace of `ObjectError`:
   0: <risingwave_object_store::object::error::ObjectError as core::convert::From<risingwave_object_store::object::error::ObjectErrorInner>>::from
             at ./risingwave/src/object_store/src/object/error.rs:38:10
   1: <T as core::convert::Into<U>>::into
             at ./rustc/bdb07a8ec8e77aa10fb84fae1d4ff71c21180bb4/library/core/src/convert/mod.rs:726:9
   2: <risingwave_object_store::object::error::ObjectError as core::convert::From<aws_smithy_http::result::SdkError<E>>>::from
             at ./risingwave/src/object_store/src/object/error.rs:81:9
   3: <core::result::Result<T,F> as core::ops::try_trait::FromResidual<core::result::Result<core::convert::Infallible,E>>>::from_residual
             at ./rustc/bdb07a8ec8e77aa10fb84fae1d4ff71c21180bb4/library/core/src/result.rs:2108:27
   4: <risingwave_object_store::object::s3::S3ObjectStore as risingwave_object_store::object::ObjectStore>::read::{{closure}}
             at ./risingwave/src/object_store/src/object/s3.rs:351:20
   5: <async_stack_trace::StackTraced<F,_> as core::future::future::Future>::poll
   6: risingwave_object_store::object::MonitoredObjectStore<OS>::read::{{closure}}
             at ./risingwave/src/object_store/src/object/mod.rs:643:13
   7: risingwave_object_store::object::ObjectStoreImpl::read::{{closure}}
             at ./risingwave/src/object_store/src/object/mod.rs:334:9
   8: risingwave_storage::hummock::sstable_store::SstableStore::sstable::{{closure}}::{{closure}}::{{closure}}
             at ./risingwave/src/storage/src/hummock/sstable_store.rs:346:25
   9: risingwave_common::cache::LruCache<K,T>::lookup_with_request_dedup::{{closure}}::{{closure}}
             at ./risingwave/src/common/src/cache.rs:818:58
  10: <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll
             at ./root/.cargo/registry/src/github.com-1ecc6299db9ec823/tracing-0.1.37/src/instrument.rs:272:9

To Reproduce

No response

Expected behavior

No response

Additional context

Or this is expected?

About this issue

Original URL
State: closed
Created 2 years ago
Reactions: 1
Comments: 16 (16 by maintainers)

Most upvoted comments

In terms of network bandwidth, neither of these 2 failed case reaches 10Gbps limit.

rwc-3-longevity-20230131-171156

rwc-3-longevity-20230201-170952

~~Will look into s3 client SDK’s connection cache/pool impl first, if any.~~ No, SDK uses hyper.

zwang28 on Feb 3, 2023

rwc-3-longevity-20230131-171156 seems not caused by bandwidth this time.

zwang28 on Feb 1, 2023

In rwc-3-longevity-20230104-180851 we are using c5a.8xlarge (10 Gbps network capacity) for compute nodes. The rate of (node_network_transmit_bytes + node_network_receive_bytes) does reach 10Gbps. We’d better impose more fixing besides merely increasing node’s network capacity.

Testing larger retry max attempts.

zwang28 on Jan 5, 2023

Some info:

We cannot tell the EC2 instance type because of lost test env. Based on the fact of 32GB instance memory, if it’s m5.2xlarge, it has a baseline bandwidth of 2.5 Gbps and a burst bandwidth of 10 Gbps.
When error occurs, network IO bytes do appear close to 2.5 Gbps baseline bandwidth, as shown below. Out bytes In bytes
But why there are two network devices loaded? Are there more than on risingwave nodes running in these instance?

Let’s wait and see another test running with higher network capacity.

zwang28 on Jan 10, 2023

DNS quotas

Amazon-provided DNS servers enforce a limit of 1024 packets
S3 client SDK uses hyper, which uses std’s to_socket_addrs to resolve address during connection initiation.
S3 is an external DNS resources can be answered by DNS cache. It’s not enabled by default for EC2 instance.

Update: DNS cache is already enabled in our cloud env. So we won’t hit DNS quotas even there is bursting S3 requests during MV creation.

zwang28 on Feb 7, 2023

But I’m not sure it’s a good idea to increase it.

Agree not a good idea

Can this kind of IO error be more of an indicator that current cluster size cannot handle the workload, and we should consider to reduce the load of current worker nodes?

As the kernel manages CPU resources by setting parallelism not more than the number of CPUs, and manages memory resources by having GlobalMemoryManager evict states from time to time, both in a proactive way,

it feels sort of strange if it manages network resources in a reactive way, although being proactive seems a more difficult task indeed.

made-up cases:

it’s a temporary increase in the usage of the network, and probably we just slow down the processing for a short period of time and then we can get back to normal again, e.g. another form of backpressure. Rescheduling or scaling in/out may cost too much when it comes to this kind of network usage fluctuation, or it’s not allowed because (2) or
users only want to allocate a certain of resources for some jobs but are willing to tolerate lower throughput and higher latency

lmatz on Jan 7, 2023

Increase the connect_timeout does work around this issue (3.1s by default, I use 60s which is a large enough but not a practical value). But I’m not sure it’s a good idea to increase it. Can this kind of IO error be more of an indicator that current cluster size cannot handle the workload, and we should consider to reduce the load of each worker node?

zwang28 on Jan 7, 2023