risingwave: Failed in get: Hummock error: ObjectStore failed with IO error

Describe the bug

Slack link: https://risingwave-labs.slack.com/archives/C048NM5LNKX/p1671603133726859 Namespace: rwc-3-longevity-20221220-180642 Pod: risingwave-compute-2

2022-12-20T18:11:34.875003Z ERROR risingwave_storage::monitor::monitored_store: Failed in get: Hummock error: ObjectStore failed with IO error Internal error: read "rls-apse1-eks-a-rwc-3-longevity-20221220-180642/255/1.data" in block Some(BlockLocation { offset: 11008083, size: 37158 }) failed, error: timeout: error trying to connect: HTTP connect timeout occurred after 3.1s
  backtrace of `ObjectError`:
   0: <risingwave_object_store::object::error::ObjectError as core::convert::From<risingwave_object_store::object::error::ObjectErrorInner>>::from
             at ./risingwave/src/object_store/src/object/error.rs:38:10
   1: <T as core::convert::Into<U>>::into
             at ./rustc/bdb07a8ec8e77aa10fb84fae1d4ff71c21180bb4/library/core/src/convert/mod.rs:726:9
   2: <risingwave_object_store::object::error::ObjectError as core::convert::From<aws_smithy_http::result::SdkError<E>>>::from
             at ./risingwave/src/object_store/src/object/error.rs:81:9
   3: <core::result::Result<T,F> as core::ops::try_trait::FromResidual<core::result::Result<core::convert::Infallible,E>>>::from_residual
             at ./rustc/bdb07a8ec8e77aa10fb84fae1d4ff71c21180bb4/library/core/src/result.rs:2108:27
   4: <risingwave_object_store::object::s3::S3ObjectStore as risingwave_object_store::object::ObjectStore>::read::{{closure}}
             at ./risingwave/src/object_store/src/object/s3.rs:351:20
   5: <async_stack_trace::StackTraced<F,_> as core::future::future::Future>::poll
   6: risingwave_object_store::object::MonitoredObjectStore<OS>::read::{{closure}}
             at ./risingwave/src/object_store/src/object/mod.rs:643:13
   7: risingwave_object_store::object::ObjectStoreImpl::read::{{closure}}
             at ./risingwave/src/object_store/src/object/mod.rs:334:9
   8: risingwave_storage::hummock::sstable_store::SstableStore::sstable::{{closure}}::{{closure}}::{{closure}}
             at ./risingwave/src/storage/src/hummock/sstable_store.rs:346:25
   9: risingwave_common::cache::LruCache<K,T>::lookup_with_request_dedup::{{closure}}::{{closure}}
             at ./risingwave/src/common/src/cache.rs:818:58
  10: <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll
             at ./root/.cargo/registry/src/github.com-1ecc6299db9ec823/tracing-0.1.37/src/instrument.rs:272:9

To Reproduce

No response

Expected behavior

No response

Additional context

Or this is expected?

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Reactions: 1
  • Comments: 16 (16 by maintainers)

Most upvoted comments

In terms of network bandwidth, neither of these 2 failed case reaches 10Gbps limit.

rwc-3-longevity-20230131-171156 rwc-3-longevity-20230131-171156

rwc-3-longevity-20230201-170952 rwc-3-longevity-20230201-170952

Will look into s3 client SDK’s connection cache/pool impl first, if any. No, SDK uses hyper.

rwc-3-longevity-20230131-171156 seems not caused by bandwidth this time.

In rwc-3-longevity-20230104-180851 we are using c5a.8xlarge (10 Gbps network capacity) for compute nodes. The rate of (node_network_transmit_bytes + node_network_receive_bytes) does reach 10Gbps. We’d better impose more fixing besides merely increasing node’s network capacity.

Testing larger retry max attempts.

Some info:

  • We cannot tell the EC2 instance type because of lost test env. Based on the fact of 32GB instance memory, if it’s m5.2xlarge, it has a baseline bandwidth of 2.5 Gbps and a burst bandwidth of 10 Gbps.
  • When error occurs, network IO bytes do appear close to 2.5 Gbps baseline bandwidth, as shown below. Out bytes out In bytes in
  • But why there are two network devices loaded? Are there more than on risingwave nodes running in these instance?

Let’s wait and see another test running with higher network capacity.

DNS quotas

Update: DNS cache is already enabled in our cloud env. So we won’t hit DNS quotas even there is bursting S3 requests during MV creation.

But I’m not sure it’s a good idea to increase it.

Agree not a good idea

Can this kind of IO error be more of an indicator that current cluster size cannot handle the workload, and we should consider to reduce the load of current worker nodes?

As the kernel manages CPU resources by setting parallelism not more than the number of CPUs, and manages memory resources by having GlobalMemoryManager evict states from time to time, both in a proactive way,

it feels sort of strange if it manages network resources in a reactive way, although being proactive seems a more difficult task indeed.

made-up cases:

  1. it’s a temporary increase in the usage of the network, and probably we just slow down the processing for a short period of time and then we can get back to normal again, e.g. another form of backpressure. Rescheduling or scaling in/out may cost too much when it comes to this kind of network usage fluctuation, or it’s not allowed because (2) or
  2. users only want to allocate a certain of resources for some jobs but are willing to tolerate lower throughput and higher latency

Increase the connect_timeout does work around this issue (3.1s by default, I use 60s which is a large enough but not a practical value). But I’m not sure it’s a good idea to increase it. Can this kind of IO error be more of an indicator that current cluster size cannot handle the workload, and we should consider to reduce the load of each worker node?