ray: [AIR] Significant data reading regression in Ray cluster from xgboost 100GB test

What happened + What you expected to happen

Issues observed

  1. We’re not able to evenly distribute read tasks evenly across a cluster anymore with significant skew that lead to imbalanced memory pattern
  2. Significantly increased memory cost on headnode that easily lead to OOM

Regression happend between ray commit 8ecd928c34db0b23e4aa2a4ea0c8cff25c37b413 and aadd82dcbd6bb0a8083550ef3edf39c98bf08ce0, roughly in past 3 days from now.

Good nightly release run

Link to good nightly release run: https://console.anyscale-staging.com/o/anyscale-internal/projects/prj_mWECugke9RzMh79BZQqeykjN/clusters/ses_r2ZVKt7AzSsY1seXN5hqene4?command-history-section=command_history

Logs: https://gist.github.com/jiaodong/6fd5728e35b23e6f78c4c6049b754d09

Pip freeze: https://gist.github.com/jiaodong/7d84cc4a73eb80d2a7b40dc31b507138

Bad nightly release run

Link to bad nightly release run: https://console.anyscale-staging.com/o/anyscale-internal/projects/prj_mWECugke9RzMh79BZQqeykjN/clusters/ses_egWTLrS2PYDUJsQYzeKjiqzP?command-history-section=head_start_up_log

Logs: https://gist.github.com/jiaodong/73d716ea47a9319aa11e629719a1d735

Pip freeze: https://gist.github.com/jiaodong/8ea91f79b7af6602479a970b751a8679

Good run memory usage metrics

Screen Shot 2022-07-25 at 5 15 02 PM

Bad run memory usage metrics

Screen Shot 2022-07-25 at 5 15 43 PM

Corresponding Ray dashboard stats

Good run ray dashboard

Screen Shot 2022-07-25 at 5 14 39 PM

Bad run ray dashboard

Screen Shot 2022-07-25 at 5 15 38 PM

Versions / Dependencies

On master

Reproduction script

Re-run xgboost_benchmark.py

https://sourcegraph.com/github.com/ray-project/ray/-/blob/release/air_tests/air_benchmarks/workloads/xgboost_benchmark.py

with 100GB data on 10 node ray cluster

max_workers: 9
--
  |  
  | head_node_type:
  | name: head_node
  | instance_type: m5.4xlarge
  |  
  | worker_node_types:
  | - name: worker_node
  | instance_type: m5.4xlarge
  | max_workers: 9
  | min_workers: 9
  | use_spot: false
  |  
  | aws:
  | BlockDeviceMappings:
  | - DeviceName: /dev/sda1
  | Ebs:
  | Iops: 5000
  | Throughput: 1000
  | VolumeSize: 1000
  | VolumeType: gp3

Issue Severity

High: It blocks me from completing my task.

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 16 (15 by maintainers)

Commits related to this issue

Most upvoted comments

After applying the corresponding fix below, I am able to get nightly test passed with sampling enabled:

Screen Shot 2022-07-26 at 10 15 29 PM

Root cause analysis:

After some digging, I found there are two issues:

  • Sampling tasks is using DEFAULT scheduling strategy, not SPREAD. So during execution, all sampling tasks are packed into driver node, but not spread to other worker nodes.
  • PyArrow Parquet reader is allocating a lot of more memory than needed, when reading just a small chunk of data.

Here’s an example to read one Parquet file, and reader allocated ~400MB memory, but the actual file size in-memory is just 90MB.

>>> import pyarrow
>>> import pyarrow.parquet as pq
>>> 
>>> pq_ds = pq.ParquetDataset("xgboost_0.parquet", use_legacy_dataset=False)
>>> piece = pq_ds.pieces[0]
<stdin>:1: DeprecationWarning: 'ParquetDataset.pieces' attribute is deprecated as of pyarrow 5.0.0 and will be removed in a future version. Use the '.fragments' attribute instead
>>> piece.metadata
<pyarrow._parquet.FileMetaData object at 0x7f8b488ec130>
  created_by: parquet-cpp-arrow version 6.0.1
  num_columns: 43
  num_rows: 260000
  num_row_groups: 1
  format_version: 1.0
  serialized_size: 21785
>>> num_rows = 5
>>> pyarrow.default_memory_pool().max_memory()
65536
>>> piece.head(num_rows, batch_size=num_rows).nbytes
1763
>>> pyarrow.default_memory_pool().max_memory()
390484736
>>> pyarrow.default_memory_pool().bytes_allocated()
0
>>> ds = ray.data.read_parquet("/Users/chengsu/try/parquet/xgboost_0.parquet")
>>> ds.fully_executed().size_bytes()
90837500

Above test passes, which indicates that this indeed being caused by https://github.com/ray-project/ray/pull/26868 (https://github.com/ray-project/ray/commit/e19cf164fd51c4f6bf730e999cba46b30c39ff83)

image

@c21 could you take a deeper look at what’s happening here?

That’s interesting. Did the number of blocks increase? That could hit the known data imbalance issue from https://github.com/ray-project/ray/issues/26878

@xwjiang2010 looks like you ended up running it inclusive 😄

Kicking off one with the commit before (8fe439998ecd48b1da5216d882123e6bde3b8fb7): https://buildkite.com/ray-project/release-tests-branch/builds/826

git log --oneline da9581b7465c5e1e4903595b4107ed1fa601920f…8ecd928c34db0b23e4aa2a4ea0c8cff25c37b413 8ecd928c34 [Serve] Make the checkpoint and recover only from GCS (#26753) 193e824bc1 [AIR DOC] minor tweaks to checkpoint user guide for clarity and consistency subheadings (#26937) 1b06e7a83a [tune] Only sync down from cloud if needed (#26725) 4cc1ef1557 [Core] Refactoring Ray DAG object scanner (#26917) df217d15e0 [air] Raise error on path-like access for Checkpoints (#26970) 5315f1e643 [AIR] Enable other notebooks previously marked with # REGRESSION (#26896) 5030a4c1d3 [RLlib] Simplify agent collector (#26803) df638b3f0f [Datasets] Automatically cast tensor columns when building Pandas blocks. (#26924) 0e1b77d52a [Workflow] Fix flaky example(#26960) e8222ff600 [dashboard] Update cluster_activities endpoint to use pydantic. (#26609) aae0aaedbd [air] Un-revert “[air] remove unnecessary logs + improve repr for result” (#26942) bf1d9971f1 [setup-dev] Add flag to skip symlink certain folders (#26899) ec1995a662 [air/tune/docs] Cont. convert Tune examples to use Tuner.fit() (#26959) de7bd015a4 [air/tune/docs] Change Tuner() occurences in rest of ray/tune (#26961) 3ea80f6aa1 [data] set iter_batches default batch_size (#26955) b1594260ba [RLlib] Small SlateQ example fix. (#26948) 41c9ef709a [RLlib] Using PG when not doing microbatching kills A2C performance. (#26844) 794a81028b [ci] add repro-ci-requirements.txt (#26951) bf97a6944b [Dashboard] Actor Table UI Optimize (#26785) 4d6cbb0fd4 [Java]More efficient getAllNodeInfo() (#26872) abde2a5f97 [tune] Fix current best trial progress string for metric=0 (#26943) 4a1ad3e87a [Workflow] Support “retry_exceptions” of Ray tasks (#26913) a012033033 [ci] pin werkzeug (#26950) 15b711ae6a [State Observability] Warn if callsite is disabled when ray list objects + raise exception on missing output (#26880) 1ac2a872e7 [docs] Editing pass over Dataset docs (#26935) d01a80eb11 [core] runtime context resource ids getter (#26907) acbab51d3e [Nightly] fix microbenchmark scripts (#26947) 0c16619475 [core] Make ray able to connect to redis without pip redis. (#25875) 8d7b865614 [air/tuner/docs] Update docs for Tuner() API 2a: Tune examples (non-docs) (#26931) 803c094534 [air/tuner/docs] Update docs for Tuner() API 2b: Tune examples (ipynb) (#26884) 008eecfbff [docs] Update the AIR data ingest guide (#26909) e19cf164fd [Datasets] Use sampling to estimate in-memory data size for Parquet data source (#26868) 8fe439998e [air/tuner/docs] Update docs for Tuner() API 1: RSTs, docs, move reuse_actors (#26930) c01bb831d4 [hotfix/data] Fix linter for test_split (#26944) e9503dbe2b [RLlib] Push suggested changes from #25652 docs wording Parametric Models Action Masking. (#26793) e9a8f7d9ae [RLlib] Unify gnorm mixin for tf and torch policies. (#26102) c44d9ff397 [core] Fix the deadlock in submit task when actor failed. (#26898) 90cea203be Ray 2.0 API deprecation (#26116) aaab4abad5 [Data][Split] stable version of split with hints (#26778) 37f4692aa8 [State Observability] Fix “No result for get crashing the formatting” and “Filtering not handled properly when key missing in the datum” #26881 d692a55018 [data] Make lazy mode non-experimental (#26934) b32c784c7f [RLLib] RE3 exploration algorithm TF2 framework support (#25221) bcec60d898 Revert "[data] set iter_batches default batch_size #26869 " (#26938)

The last working run is last Saturday at 3pm The next commit after last working commit: https://github.com/ray-project/ray/commit/da9581b7465c5e1e4903595b4107ed1fa601920f

range: from da9581b7465c5e1e4903595b4107ed1fa601920f to 8ecd928c34db0b23e4aa2a4ea0c8cff25c37b413 both inclusive