ray: [tune] A worker dies after 1000 samples

What is the problem?

Workers crash while testing the Hyperopt example script: https://docs.ray.io/en/latest/tune/examples/hyperopt_example.html on a larger number of samples with fewer steps. The modified version of the script is attached below. The only changes are reducing the number of steps from 100 to 10 (for reaching the issue faster) and increasing the number of samples from 1000 to 10000. The worker crashes happen just beyond the 1000 samples, but at no fixed point.

The following error message appears when a worker crashes:

(pid=136534) F1009 07:02:26.344231 136534 136534 service_based_gcs_client.cc:207] Couldn't reconnect to GCS server. The last attempted GCS server address was 192.168.1.98:39953
(pid=136534) *** Check failure stack trace: ***
(pid=136534)     @     0x7f56afdb7f5d  google::LogMessage::Fail()
(pid=136534)     @     0x7f56afdb90bc  google::LogMessage::SendToLog()
(pid=136534)     @     0x7f56afdb7c39  google::LogMessage::Flush()
(pid=136534)     @     0x7f56afdb7e51  google::LogMessage::~LogMessage()
(pid=136534)     @     0x7f56afd6eff9  ray::RayLog::~RayLog()
(pid=136534)     @     0x7f56afab2a5a  ray::gcs::ServiceBasedGcsClient::ReconnectGcsServer()
(pid=136534)     @     0x7f56afab2b5f  ray::gcs::ServiceBasedGcsClient::GcsServiceFailureDetected()
(pid=136534)     @     0x7f56afab2d01  ray::gcs::ServiceBasedGcsClient::PeriodicallyCheckGcsServerAddress()
(pid=136534)     @     0x7f56afab5071  ray::gcs::ServiceBasedGcsClient::Connect()
(pid=136534)     @     0x7f56afa36746  ray::CoreWorker::CoreWorker()
(pid=136534)     @     0x7f56afa3a484  ray::CoreWorkerProcess::CreateWorker()
(pid=136534)     @     0x7f56afa3b6f2  ray::CoreWorkerProcess::CoreWorkerProcess()
(pid=136534)     @     0x7f56afa3c0bb  ray::CoreWorkerProcess::Initialize()
(pid=136534)     @     0x7f56af975dce  __pyx_pw_3ray_7_raylet_10CoreWorker_1__cinit__()
(pid=136534)     @     0x7f56af9775b5  __pyx_tp_new_3ray_7_raylet_CoreWorker()
(pid=136534)     @     0x556d5bd4e909  _PyObject_FastCallKeywords
(pid=136534)     @     0x556d5bdb439e  _PyEval_EvalFrameDefault
(pid=136534)     @     0x556d5bcf7160  _PyEval_EvalCodeWithName
(pid=136534)     @     0x556d5bd47107  _PyFunction_FastCallKeywords
(pid=136534)     @     0x556d5bdb0585  _PyEval_EvalFrameDefault
(pid=136534)     @     0x556d5bcf6829  _PyEval_EvalCodeWithName
(pid=136534)     @     0x556d5bcf7714  PyEval_EvalCodeEx
(pid=136534)     @     0x556d5bcf773c  PyEval_EvalCode
(pid=136534)     @     0x556d5be0ef14  run_mod
(pid=136534)     @     0x556d5be19331  PyRun_FileExFlags
(pid=136534)     @     0x556d5be19523  PyRun_SimpleFileExFlags
(pid=136534)     @     0x556d5be1a655  pymain_main
(pid=136534)     @     0x556d5be1a77c  _Py_UnixMain
(pid=136534)     @     0x7f56b0fb90b3  __libc_start_main
(pid=136534)     @     0x556d5bdbeff0  (unknown)
A worker died or was killed while executing task ffffffffffffffff92c1835601000000.

The main job does not crash but is stuck when all workers are dead. Killing the main job via keyboard results in:

^CTraceback (most recent call last):
  File "test.py", line 72, in <module>
    **tune_kwargs)
  File "/home/demattia/miniconda3/envs/test_tune/lib/python3.7/site-packages/ray/tune/tune.py", line 405, in run
    runner.step()
  File "/home/demattia/miniconda3/envs/test_tune/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 375, in step
    self._process_events()  # blocking
  File "/home/demattia/miniconda3/envs/test_tune/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 475, in _process_events
    trial = self.trial_executor.get_next_available_trial()  # blocking
  File "/home/demattia/miniconda3/envs/test_tune/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 463, in get_next_available_trial
    [result_id], _ = ray.wait(shuffled_results)
  File "/home/demattia/miniconda3/envs/test_tune/lib/python3.7/site-packages/ray/worker.py", line 1558, in wait
    worker.current_task_id,
  File "python/ray/_raylet.pyx", line 939, in ray._raylet.CoreWorker.wait
  File "python/ray/_raylet.pyx", line 144, in ray._raylet.check_status
KeyboardInterrupt
^CError in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/home/demattia/miniconda3/envs/test_tune/lib/python3.7/site-packages/ray/worker.py", line 784, in shutdown
    time.sleep(0.5)
KeyboardInterrupt

The file /tmp/ray/gcs_server.out is very large, I only provide snippet in case it is helpful. The following snippet is representative of the log while the workers are still alive:

I1009 07:02:24.829182 112317 112317 gcs_worker_manager.cc:31] Reporting worker failure, worker id = e9f747a7936a76818495\
506b3f11c238bda56539, node id = 2041b2a79d53a7b4776d1553e99f625ccd26c15b, address = 192.168.1.98
W1009 07:02:24.829221 112317 112317 gcs_actor_manager.cc:544] Actor 35557ddd01000000 is out of scope,, destroying actor \
child.
W1009 07:02:24.829221 112317 112317 gcs_actor_manager.cc:544] Actor 35557ddd01000000 is out of scope,, destroying actor \
child.
I1009 07:02:24.829226 112317 112317 gcs_actor_manager.cc:558] Destroying actor, actor id = 35557ddd01000000
I1009 07:02:24.829228 112317 112317 gcs_actor_manager.cc:1065] Erasing actor 35557ddd01000000 owned by 01000000fffffffff\
fffffffffffffffffffffff
W1009 07:02:24.829443 112317 112317 gcs_worker_manager.cc:67] Failed to report worker failure, the worker doesn't exist,\
 worker id = e9f747a7936a76818495506b3f11c238bda56539, node id = 2041b2a79d53a7b4776d1553e99f625ccd26c15b, address = 192\
.168.1.98
W1009 07:02:24.829443 112317 112317 gcs_worker_manager.cc:67] Failed to report worker failure, the worker doesn't exist,\
 worker id = e9f747a7936a76818495506b3f11c238bda56539, node id = 2041b2a79d53a7b4776d1553e99f625ccd26c15b, address = 192\
.168.1.98

After the workers crash and the main job is stuck the content becomes:

I1009 07:03:00.041805 112317 112317 gcs_actor_scheduler.cc:387] Retry creating actor 92c1835601000000 on worker 7bb2509f\
320bd6f298c0f4a49bb9debecd2349e2 at node 2041b2a79d53a7b4776d1553e99f625ccd26c15b
I1009 07:03:00.041824 112317 112317 gcs_actor_scheduler.cc:322] Start creating actor 92c1835601000000 on worker 7bb2509f\
320bd6f298c0f4a49bb9debecd2349e2 at node 2041b2a79d53a7b4776d1553e99f625ccd26c15b
I1009 07:03:00.111589 112317 112317 gcs_actor_scheduler.cc:387] Retry creating actor 7bb0b57001000000 on worker 3a3c4734\
91f3af24c7eb1796c06b1ed67a13690e at node 2041b2a79d53a7b4776d1553e99f625ccd26c15b
I1009 07:03:00.111609 112317 112317 gcs_actor_scheduler.cc:322] Start creating actor 7bb0b57001000000 on worker 3a3c4734\
91f3af24c7eb1796c06b1ed67a13690e at node 2041b2a79d53a7b4776d1553e99f625ccd26c15b
I1009 07:03:00.128782 112317 112317 gcs_actor_scheduler.cc:387] Retry creating actor 4d95c66301000000 on worker af65091d\
e52553f02a58a926af9ce296da7b45ac at node 2041b2a79d53a7b4776d1553e99f625ccd26c15b

Ray version and other system information (Python version, TensorFlow version, OS):

OS: Ubuntu 20.04.1 LTS (Focal Fossa) Ray version = 1.0.0 Python version = 3.7.9 Hyperopt version = 0.2.5

Reproduction (REQUIRED)

Setup the conda environment:

conda create --name test_tune python=3.7 pip install 'ray[tune]' torch torchvision (taken from the quickstart: https://docs.ray.io/en/latest/tune/index.html) pip install hyperopt

Full pip freeze:

aiohttp==3.6.2
aiohttp-cors==0.7.0
aioredis==1.3.1
async-timeout==3.0.1
attrs==20.2.0
beautifulsoup4==4.9.3
blessings==1.7
cachetools==4.1.1
certifi==2020.6.20
chardet==3.0.4
click==7.1.2
cloudpickle==1.6.0
colorama==0.4.3
colorful==0.5.4
decorator==4.4.2
filelock==3.0.12
flake8==3.8.4
future==0.18.2
google==3.0.0
google-api-core==1.22.4
google-auth==1.22.1
googleapis-common-protos==1.52.0
gpustat==0.6.0
grpcio==1.32.0
hiredis==1.1.0
hyperopt==0.2.5
idna==2.10
importlib-metadata==2.0.0
jsonschema==3.2.0
mccabe==0.6.1
msgpack==1.0.0
multidict==4.7.6
networkx==2.5
numpy==1.19.2
nvidia-ml-py3==7.352.0
opencensus==0.7.10
opencensus-context==0.1.1
pandas==1.1.3
Pillow==7.2.0
prometheus-client==0.8.0
protobuf==3.13.0
psutil==5.7.2
py-spy==0.3.3
pyasn1==0.4.8
pyasn1-modules==0.2.8
pycodestyle==2.6.0
pyflakes==2.2.0
pyrsistent==0.17.3
python-dateutil==2.8.1
pytz==2020.1
PyYAML==5.3.1
ray==1.0.0
redis==3.4.1
requests==2.24.0
rsa==4.6
scipy==1.5.2
six==1.15.0
soupsieve==2.0.1
tabulate==0.8.7
tensorboardX==2.1
torch==1.6.0
torchvision==0.7.0
tqdm==4.50.2
typing-extensions==3.7.4.3
urllib3==1.25.10
yarl==1.6.0
zipp==3.3.0

Script to reproduce the issue:

"""This test checks that HyperOpt is functional.

It also checks that it is usable with a separate scheduler.
"""
import time

import ray
from ray import tune
from ray.tune.suggest import ConcurrencyLimiter
from ray.tune.schedulers import AsyncHyperBandScheduler
from ray.tune.suggest.hyperopt import HyperOptSearch


def evaluation_fn(step, width, height):
    return (0.1 + width * step / 100)**(-1) + height * 0.1


def easy_objective(config):
    # Hyperparameters
    width, height = config["width"], config["height"]

    for step in range(config["steps"]):
        # Iterative training function - can be any arbitrary training procedure
        intermediate_score = evaluation_fn(step, width, height)
        # Feed the score back back to Tune.
        tune.report(iterations=step, mean_loss=intermediate_score)
        time.sleep(0.1)


if __name__ == "__main__":
    import argparse

    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--smoke-test", action="store_true", help="Finish quickly for testing")
    args, _ = parser.parse_known_args()
    ray.init(configure_logging=False)

    current_best_params = [
        {
            "width": 1,
            "height": 2,
            "activation": 0  # Activation will be relu
        },
        {
            "width": 4,
            "height": 2,
            "activation": 1  # Activation will be tanh
        }
    ]

    tune_kwargs = {
        "num_samples": 10 if args.smoke_test else 10000,
        "config": {
            "steps": 10,
            "width": tune.uniform(0, 20),
            "height": tune.uniform(-100, 100),
            # This is an ignored parameter.
            "activation": tune.choice(["relu", "tanh"])
        }
    }
    algo = HyperOptSearch(points_to_evaluate=current_best_params)
    algo = ConcurrencyLimiter(algo, max_concurrent=4)

    scheduler = AsyncHyperBandScheduler()
    tune.run(
        easy_objective,
        search_alg=algo,
        scheduler=scheduler,
        metric="mean_loss",
        mode="min",
        **tune_kwargs)

The script is a small variation of the example at https://docs.ray.io/en/latest/tune/examples/hyperopt_example.html with the only change being the num_samples increase to 10000 and the steps decrease to 10.

Save the script as test.py and run python test.py. Expect the workers to start crashing after about 1000 samples. The point at which they crash is not exactly reproducible. However, they fact that they crash is reproducible. I was never able to complete a full run over 10000 sample and on my setup they tend to crash between 1000 and 2000 samples (trials). For example, in the last attempt the workers started crashing just after I saw the message .. 1091 more trials not shown (1091 TERMINATED) on screen.

I have verified my script runs in a clean environment and reproduces the issue.
I have verified the issue also occurs with the latest wheels.

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 28 (14 by maintainers)

Most upvoted comments

yep, the tune slowdown is tracked https://github.com/ray-project/ray/issues/11233

richardliaw on Oct 21, 2020

After more investigation, I don’t believe this is really a leak of gRPC clients (I added constructor/destructor logging hooks and the number of gRPC clients remains bounded, and we seem to be cleaning up properly in all cases).

I believe that what’s going on is there are a large number of connections stuck closing in TIME_WAIT state: https://github.com/grpc/grpc/issues/3855 This means that after an actor is destroyed, the sockets used for communicating with it may be active for up to 120s: https://vincent.bernat.ch/en/blog/2014-tcp-time-wait-state-linux There are some kernel options to recycle sockets, but I didn’t have much success trying them.

Since 1.0 is managing actors from the GCS server, this at least doubles the number of connections per actor: you need one from the driver->actor, and another from the GCS->actor. We also seem to be opening more than 1 connection in the GCS, so the overall number of sockets is now 2-3x it was before, bumping into FD limits.

I’m not sure the best solution for this, but probably increasing the ulimit is the way to go in the short term.

ericl on Oct 21, 2020

This seems like a P0 since it’s a Ray core crash. cc @rkooo567 @stephanie-wang

(assuming it’s a new issue in 1.0)

ericl on Oct 16, 2020