ray: [AIR] Significant data reading regression in Ray cluster from xgboost 100GB test
What happened + What you expected to happen
Issues observed
- We’re not able to evenly distribute read tasks evenly across a cluster anymore with significant skew that lead to imbalanced memory pattern
- Significantly increased memory cost on headnode that easily lead to OOM
Regression happend between ray commit 8ecd928c34db0b23e4aa2a4ea0c8cff25c37b413
and aadd82dcbd6bb0a8083550ef3edf39c98bf08ce0
, roughly in past 3 days from now.
Good nightly release run
Link to good nightly release run: https://console.anyscale-staging.com/o/anyscale-internal/projects/prj_mWECugke9RzMh79BZQqeykjN/clusters/ses_r2ZVKt7AzSsY1seXN5hqene4?command-history-section=command_history
Logs: https://gist.github.com/jiaodong/6fd5728e35b23e6f78c4c6049b754d09
Pip freeze: https://gist.github.com/jiaodong/7d84cc4a73eb80d2a7b40dc31b507138
Bad nightly release run
Link to bad nightly release run: https://console.anyscale-staging.com/o/anyscale-internal/projects/prj_mWECugke9RzMh79BZQqeykjN/clusters/ses_egWTLrS2PYDUJsQYzeKjiqzP?command-history-section=head_start_up_log
Logs: https://gist.github.com/jiaodong/73d716ea47a9319aa11e629719a1d735
Pip freeze: https://gist.github.com/jiaodong/8ea91f79b7af6602479a970b751a8679
Good run memory usage metrics

Bad run memory usage metrics

Corresponding Ray dashboard stats
Good run ray dashboard

Bad run ray dashboard

Versions / Dependencies
On master
Reproduction script
Re-run xgboost_benchmark.py
with 100GB data on 10 node ray cluster
max_workers: 9
--
|
| head_node_type:
| name: head_node
| instance_type: m5.4xlarge
|
| worker_node_types:
| - name: worker_node
| instance_type: m5.4xlarge
| max_workers: 9
| min_workers: 9
| use_spot: false
|
| aws:
| BlockDeviceMappings:
| - DeviceName: /dev/sda1
| Ebs:
| Iops: 5000
| Throughput: 1000
| VolumeSize: 1000
| VolumeType: gp3
Issue Severity
High: It blocks me from completing my task.
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 16 (15 by maintainers)
Commits related to this issue
- [Datasets] Use SPREAD scheduling strategy and turn off Parquet sampling by default (#27034) Why are these changes needed? The Parquet file sampling PR (#26868) caused nightly test regression - #2699... — committed to ray-project/ray by c21 2 years ago
- [Datasets] Use SPREAD scheduling strategy and turn off Parquet sampling by default (#27034) Why are these changes needed? The Parquet file sampling PR (#26868) caused nightly test regression - #2699... — committed to Rohan138/ray by c21 2 years ago
- [Datasets] Use SPREAD scheduling strategy and turn off Parquet sampling by default (#27034) Why are these changes needed? The Parquet file sampling PR (#26868) caused nightly test regression - #2699... — committed to franklsf95/ray by c21 2 years ago
- [Datasets] Use SPREAD scheduling strategy and turn off Parquet sampling by default (#27034) Why are these changes needed? The Parquet file sampling PR (#26868) caused nightly test regression - #2699... — committed to gramhagen/ray by c21 2 years ago
- [Datasets] Use SPREAD scheduling strategy and turn off Parquet sampling by default (#27034) Why are these changes needed? The Parquet file sampling PR (#26868) caused nightly test regression - #2699... — committed to gramhagen/ray by c21 2 years ago
- [Datasets] Use SPREAD scheduling strategy and turn off Parquet sampling by default (#27034) Why are these changes needed? The Parquet file sampling PR (#26868) caused nightly test regression - #26995... — committed to alipay/ant-ray by c21 2 years ago
- [Datasets] Use SPREAD scheduling strategy and turn off Parquet sampling by default (#27034) Why are these changes needed? The Parquet file sampling PR (#26868) caused nightly test regression - #26995... — committed to Stefan-1313/ray_mod by c21 2 years ago
After applying the corresponding fix below, I am able to get nightly test passed with sampling enabled:
Root cause analysis:
After some digging, I found there are two issues:
Here’s an example to read one Parquet file, and reader allocated ~400MB memory, but the actual file size in-memory is just 90MB.
Above test passes, which indicates that this indeed being caused by https://github.com/ray-project/ray/pull/26868 (https://github.com/ray-project/ray/commit/e19cf164fd51c4f6bf730e999cba46b30c39ff83)
@c21 could you take a deeper look at what’s happening here?
That’s interesting. Did the number of blocks increase? That could hit the known data imbalance issue from https://github.com/ray-project/ray/issues/26878
still having the issue. trying cut-off at https://github.com/ray-project/ray/commit/e19cf164fd51c4f6bf730e999cba46b30c39ff83 (exclusive) now, as Cheng suggested.
Link: https://buildkite.com/ray-project/release-tests-branch/builds/825
@xwjiang2010 looks like you ended up running it inclusive 😄
Kicking off one with the commit before (8fe439998ecd48b1da5216d882123e6bde3b8fb7): https://buildkite.com/ray-project/release-tests-branch/builds/826
git log --oneline da9581b7465c5e1e4903595b4107ed1fa601920f…8ecd928c34db0b23e4aa2a4ea0c8cff25c37b413 8ecd928c34 [Serve] Make the checkpoint and recover only from GCS (#26753) 193e824bc1 [AIR DOC] minor tweaks to checkpoint user guide for clarity and consistency subheadings (#26937) 1b06e7a83a [tune] Only sync down from cloud if needed (#26725) 4cc1ef1557 [Core] Refactoring Ray DAG object scanner (#26917) df217d15e0 [air] Raise error on path-like access for Checkpoints (#26970) 5315f1e643 [AIR] Enable other notebooks previously marked with # REGRESSION (#26896) 5030a4c1d3 [RLlib] Simplify agent collector (#26803) df638b3f0f [Datasets] Automatically cast tensor columns when building Pandas blocks. (#26924) 0e1b77d52a [Workflow] Fix flaky example(#26960) e8222ff600 [dashboard] Update cluster_activities endpoint to use pydantic. (#26609) aae0aaedbd [air] Un-revert “[air] remove unnecessary logs + improve repr for result” (#26942) bf1d9971f1 [setup-dev] Add flag to skip symlink certain folders (#26899) ec1995a662 [air/tune/docs] Cont. convert Tune examples to use Tuner.fit() (#26959) de7bd015a4 [air/tune/docs] Change Tuner() occurences in rest of ray/tune (#26961) 3ea80f6aa1 [data] set iter_batches default batch_size (#26955) b1594260ba [RLlib] Small SlateQ example fix. (#26948) 41c9ef709a [RLlib] Using PG when not doing microbatching kills A2C performance. (#26844) 794a81028b [ci] add repro-ci-requirements.txt (#26951) bf97a6944b [Dashboard] Actor Table UI Optimize (#26785) 4d6cbb0fd4 [Java]More efficient getAllNodeInfo() (#26872) abde2a5f97 [tune] Fix current best trial progress string for metric=0 (#26943) 4a1ad3e87a [Workflow] Support “retry_exceptions” of Ray tasks (#26913) a012033033 [ci] pin werkzeug (#26950) 15b711ae6a [State Observability] Warn if callsite is disabled when
ray list objects
+ raise exception on missing output (#26880) 1ac2a872e7 [docs] Editing pass over Dataset docs (#26935) d01a80eb11 [core] runtime context resource ids getter (#26907) acbab51d3e [Nightly] fix microbenchmark scripts (#26947) 0c16619475 [core] Make ray able to connect to redis without pip redis. (#25875) 8d7b865614 [air/tuner/docs] Update docs for Tuner() API 2a: Tune examples (non-docs) (#26931) 803c094534 [air/tuner/docs] Update docs for Tuner() API 2b: Tune examples (ipynb) (#26884) 008eecfbff [docs] Update the AIR data ingest guide (#26909) e19cf164fd [Datasets] Use sampling to estimate in-memory data size for Parquet data source (#26868) 8fe439998e [air/tuner/docs] Update docs for Tuner() API 1: RSTs, docs, move reuse_actors (#26930) c01bb831d4 [hotfix/data] Fix linter for test_split (#26944) e9503dbe2b [RLlib] Push suggested changes from #25652 docs wording Parametric Models Action Masking. (#26793) e9a8f7d9ae [RLlib] Unify gnorm mixin for tf and torch policies. (#26102) c44d9ff397 [core] Fix the deadlock in submit task when actor failed. (#26898) 90cea203be Ray 2.0 API deprecation (#26116) aaab4abad5 [Data][Split] stable version of split with hints (#26778) 37f4692aa8 [State Observability] Fix “No result for get crashing the formatting” and “Filtering not handled properly when key missing in the datum” #26881 d692a55018 [data] Make lazy mode non-experimental (#26934) b32c784c7f [RLLib] RE3 exploration algorithm TF2 framework support (#25221) bcec60d898 Revert "[data] set iter_batches default batch_size #26869 " (#26938)The last working run is last Saturday at 3pm The next commit after last working commit: https://github.com/ray-project/ray/commit/da9581b7465c5e1e4903595b4107ed1fa601920f
range: from da9581b7465c5e1e4903595b4107ed1fa601920f to 8ecd928c34db0b23e4aa2a4ea0c8cff25c37b413 both inclusive