ray: [Ray Tune, Core] Ray tune self terminates at ~100 trials on windows

What happened + What you expected to happen

Running a ray tune experiment with pytorch and num_samples set at >100 trials results in ray self terminating at around 100 trials due to SIGINT. On resuming experiment, it will then self terminate at another ~100 trial. This only happens on windows, the same script will run to finish on linux

Expected behaviour: Ray tune completes all trials without sigint in between

Useful information: Error message can vary depending on where it terminates, as sometimes it seems to just print the trace stack at where it terminated.

However, sometimes it returns with:

2022-08-08 17:44:41,788 WARNING worker.py:1404 -- The log monitor on node DESKTOP-N18AJ6S failed with the following error:
OSError: [WinError 87] The parameter is incorrect

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "C:\Users\Steven\anaconda3\envs\GRF_hip_outcomes\lib\site-packages\ray\_private\log_monitor.py", line 451, in <module>
    log_monitor.run()
  File "C:\Users\Steven\anaconda3\envs\GRF_hip_outcomes\lib\site-packages\ray\_private\log_monitor.py", line 376, in run
    self.open_closed_files()
  File "C:\Users\Steven\anaconda3\envs\GRF_hip_outcomes\lib\site-packages\ray\_private\log_monitor.py", line 218, in open_closed_files
    self.close_all_files()
  File "C:\Users\Steven\anaconda3\envs\GRF_hip_outcomes\lib\site-packages\ray\_private\log_monitor.py", line 130, in close_all_files
    os.kill(file_info.worker_pid, 0)
SystemError: <built-in function kill> returned a result with an error set

Versions / Dependencies

Ray tune 1.13.0 pytorch 1.11.0 windows 10/11

Reproduction script

This can be reproduced with the example ray tune pytorch script listed in Ray Tune documentation:

from __future__ import annotations

from functools import partial
import numpy as np
import os
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from filelock import FileLock
from torch.utils.data import random_split
import torchvision
import torchvision.transforms as transforms
import ray
from ray import tune
from ray.tune.schedulers import ASHAScheduler

def load_data(data_dir="./data"):
    transform = transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
    ])

    # We add FileLock here because multiple workers will want to
    # download data, and this may cause overwrites since
    # DataLoader is not threadsafe.
    with FileLock(os.path.expanduser("~/.data.lock")):
        trainset = torchvision.datasets.CIFAR10(
            root=data_dir, train=True, download=True, transform=transform)

        testset = torchvision.datasets.CIFAR10(
            root=data_dir, train=False, download=True, transform=transform)

    return trainset, testset

class Net(nn.Module):
    def __init__(self, l1=120, l2=84):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(3, 6, 5)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.fc1 = nn.Linear(16 * 5 * 5, l1)
        self.fc2 = nn.Linear(l1, l2)
        self.fc3 = nn.Linear(l2, 10)

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = x.view(-1, 16 * 5 * 5)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x
    
def train_cifar(config, checkpoint_dir=None):
    project_dir = os.getenv('TUNE_ORIG_WORKING_DIR')

    net = Net(config["l1"], config["l2"])

    device = "cpu"
    if torch.cuda.is_available():
        device = "cuda:0"
        if torch.cuda.device_count() > 1:
            net = nn.DataParallel(net)
    net.to(device)

    criterion = nn.CrossEntropyLoss()
    optimizer = optim.SGD(net.parameters(), lr=config["lr"], momentum=0.9)

    # The `checkpoint_dir` parameter gets passed by Ray Tune when a checkpoint
    # should be restored.
    if checkpoint_dir:
        checkpoint = os.path.join(checkpoint_dir, "checkpoint")
        model_state, optimizer_state = torch.load(checkpoint)
        net.load_state_dict(model_state)
        optimizer.load_state_dict(optimizer_state)

    data_dir = os.path.join(project_dir, "data")
    trainset, testset = load_data(data_dir)

    test_abs = int(len(trainset) * 0.8)
    train_subset, val_subset = random_split(
        trainset, [test_abs, len(trainset) - test_abs])

    trainloader = torch.utils.data.DataLoader(
        train_subset,
        batch_size=int(config["batch_size"]),
        shuffle=True,
        num_workers=8)
    valloader = torch.utils.data.DataLoader(
        val_subset,
        batch_size=int(config["batch_size"]),
        shuffle=True,
        num_workers=8)

    for epoch in range(10):  # loop over the dataset multiple times
        running_loss = 0.0
        epoch_steps = 0
        for i, data in enumerate(trainloader, 0):
            # get the inputs; data is a list of [inputs, labels]
            inputs, labels = data
            inputs, labels = inputs.to(device), labels.to(device)

            # zero the parameter gradients
            optimizer.zero_grad()

            # forward + backward + optimize
            outputs = net(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()

            # print statistics
            running_loss += loss.item()
            epoch_steps += 1
            if i % 2000 == 1999:  # print every 2000 mini-batches
                print("[%d, %5d] loss: %.3f" % (epoch + 1, i + 1,
                                                running_loss / epoch_steps))
                running_loss = 0.0

        # Validation loss
        val_loss = 0.0
        val_steps = 0
        total = 0
        correct = 0
        for i, data in enumerate(valloader, 0):
            with torch.no_grad():
                inputs, labels = data
                inputs, labels = inputs.to(device), labels.to(device)

                outputs = net(inputs)
                _, predicted = torch.max(outputs.data, 1)
                total += labels.size(0)
                correct += (predicted == labels).sum().item()

                loss = criterion(outputs, labels)
                val_loss += loss.cpu().numpy()
                val_steps += 1

        # Here we save a checkpoint. It is automatically registered with
        # Ray Tune and will potentially be passed as the `checkpoint_dir`
        # parameter in future iterations.
        with tune.checkpoint_dir(step=epoch) as checkpoint_dir:
            path = os.path.join(checkpoint_dir, "checkpoint")
            torch.save(
                (net.state_dict(), optimizer.state_dict()), path)

        tune.report(loss=(val_loss / val_steps), accuracy=correct / total)
    print("Finished Training")
    
    

def test_best_model(best_trial):
    best_trained_model = Net(best_trial.config["l1"], best_trial.config["l2"])
    device = "cuda:0" if torch.cuda.is_available() else "cpu"
    best_trained_model.to(device)

    checkpoint_path = os.path.join(best_trial.checkpoint.value, "checkpoint")

    model_state, optimizer_state = torch.load(checkpoint_path)
    best_trained_model.load_state_dict(model_state)

    trainset, testset = load_data()

    testloader = torch.utils.data.DataLoader(
        testset, batch_size=4, shuffle=False, num_workers=2)

    correct = 0
    total = 0
    with torch.no_grad():
        for data in testloader:
            images, labels = data
            images, labels = images.to(device), labels.to(device)
            outputs = best_trained_model(images)
            _, predicted = torch.max(outputs.data, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()


    print("Best trial test set accuracy: {}".format(correct / total))


def main(num_samples=10, max_num_epochs=10, gpus_per_trial: int|float=2):
    config = {
        "l1": tune.sample_from(lambda _: 2 ** np.random.randint(2, 9)),
        "l2": tune.sample_from(lambda _: 2 ** np.random.randint(2, 9)),
        "lr": tune.loguniform(1e-4, 1e-1),
        "batch_size": tune.choice([2, 4, 8, 16])
    }
    scheduler = ASHAScheduler(
        max_t=max_num_epochs,
        grace_period=1,
        reduction_factor=2)
    result = tune.run(
        tune.with_parameters(train_cifar),
        resources_per_trial={"cpu": 2, "gpu": gpus_per_trial},
        config=config,
        metric="loss",
        mode="min",
        num_samples=num_samples,
        scheduler=scheduler,
        verbose=2,
        local_dir='/e/GRF_outputs2',
        name='train_cifar_2022-08-08_14-32-23',
        resume=True
    )

    best_trial = result.get_best_trial("loss", "min", "last")
    print("Best trial config: {}".format(best_trial.config))
    print("Best trial final validation loss: {}".format(
        best_trial.last_result["loss"]))
    print("Best trial final validation accuracy: {}".format(
        best_trial.last_result["accuracy"]))

    if ray.util.client.ray.is_connected():
        # If using Ray Client, we want to make sure checkpoint access
        # happens on the server. So we wrap `test_best_model` in a Ray task.
        # We have to make sure it gets executed on the same node that
        # ``tune.run`` is called on.
        from ray.util.ml_utils.node import force_on_current_node
        remote_fn = force_on_current_node(ray.remote(test_best_model))
        ray.get(remote_fn.remote(best_trial))
    else:
        test_best_model(best_trial)


if __name__ == "__main__":
    # You can change the number of GPUs per trial here:
    main(num_samples=250, max_num_epochs=2, gpus_per_trial=0.5)

the most strip down script I’ve attempted without losing machinie learning is:

import logging
import os
import ray
import torch
from ray import tune
from ray.tune import CLIReporter
from torch.utils.data import TensorDataset, DataLoader

from torch import nn


class DummyModel(nn.Module):
    def __init__(self, a, output_size, channel_n=9):
        super().__init__()
        self.output_size=output_size

        self.layers = nn.Sequential(nn.Linear(1,a),
                                    nn.ReLU(inplace=True),
                                    nn.Linear(a,output_size))

        self.softmax = nn.LogSoftmax(dim=1)

    def forward(self, x):
        x = x.view(-1, 1)
        out = self.layers(x)
        return self.softmax(out)

os.environ["TUNE_DISABLE_AUTO_CALLBACK_SYNCER"] = "1"

logger = logging.getLogger(__name__)


def get_dataloader(batch_size=50, num_workers=4):
    class1 = [2, 4, 6, 8, 10, 12, 14, 16]
    class2 = [3, 6, 9, 12, 15, 18, 21, 24, 27]
    class3 = [1, 5, 7, 11, 13, 17, 19]

    data_x = torch.tensor(class1 + class2 + class3, dtype=torch.float)
    data_y = torch.tensor(len(class1) * [0] + len(class2) * [1] + len(class3) * [2])
    dataset = TensorDataset(data_x, data_y)

    return DataLoader(dataset, batch_size=batch_size, num_workers=num_workers)


def train_func(config, epochs=2):
    # cwd is changed to the trial folder
    batch_size = config["batch_size"]
    weight_decay = config['weight_decay']
    a = config["a"]
    lr = config['lr']

    # setup code
    device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

    # setup dataloaders
    data_loader = get_dataloader(batch_size, num_workers=4)

    output_size = 3
    channel_n = 9
    # setup model
    model = DummyModel(a=a, output_size=output_size, channel_n=channel_n)

    model = model.to(device)

    criterion = torch.nn.NLLLoss()

    optimizer = torch.optim.AdamW(model.parameters(), lr=lr, weight_decay=weight_decay, amsgrad=True)

    for epoch in range(0, epochs):  # loop over the dataset multiple times
        log = {}
        model.train()
        for i, data in enumerate(data_loader, 0):
            # get the inputs; data is a list of [inputs, labels]
            inputs, targets = data
            inputs, targets = inputs.to(device), targets.to(device)

            optimizer.zero_grad()

            # forward + backward
            outputs = model(inputs)
            loss = criterion(outputs, targets)

            loss.backward()
            optimizer.step()
            log = {"loss": loss.item()}

        # log metrics, log in checkpoint in case actor dies half way
        tune.report(**log)


def main():
    tune.run(
        train_func,
        num_samples=250,
        metric='loss',
        mode="min",
        resources_per_trial={"cpu": 4, "gpu": 0.15},
        local_dir="outputs/tune-dummy",
        verbose=2,
        config={
            "lr": ray.tune.loguniform(0.0001, 0.01),
            "weight_decay": ray.tune.loguniform(0.001, 0.01),
            "batch_size": ray.tune.choice([128, 256]),
            "a": ray.tune.randint(1, 30)
        },
        progress_reporter=CLIReporter(metric_columns=["training_iteration", "loss"])
    )

    print("I finished!")


if __name__ == '__main__':
    main()

Issue Severity

Medium: It is a significant difficulty but I can work around it.

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 40 (27 by maintainers)

Most upvoted comments

@stephano41 could you try with a nightly?

@mattip I had the same error using Ray’s latest release, but it stopped appearing on the nightly build. Just FYI 😃

I’m encountering this issue when using ray-1.13.0 through autogluon-0.5.2 on Windows / Python 3.9.3, everything installed with pip (no conda usage):

WARNING worker.py:1404 -- The log monitor on node DESKTOP-X4SMI53 failed with the following error:
OSError: [WinError 87] The parameter is incorrect

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "X:\gt\.pyenv\lib\site-packages\ray\_private\log_monitor.py", line 451, in <module>
    log_monitor.run()
  File "X:\gt\.pyenv\lib\site-packages\ray\_private\log_monitor.py", line 376, in run
    self.open_closed_files()
  File "X:\gt\.pyenv\lib\site-packages\ray\_private\log_monitor.py", line 218, in open_closed_files
    self.close_all_files()
  File "X:\gt\.pyenv\lib\site-packages\ray\_private\log_monitor.py", line 130, in close_all_files
    os.kill(file_info.worker_pid, 0)
SystemError: <built-in function kill> returned a result with an error set

Traceback (most recent call last):
  ...
  File "X:\gt\.pyenv\lib\site-packages\autogluon\core\models\ensemble\fold_fitting_strategy.py", line 421, in after_all_folds_scheduled
    finished, unfinished = self.ray.wait(unfinished, num_returns=1)
  File "X:\gt\.pyenv\lib\site-packages\ray\_private\client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "X:\gt\.pyenv\lib\site-packages\ray\worker.py", line 2015, in wait
    ready_ids, remaining_ids = worker.core_worker.wait(
  File "python\ray\_raylet.pyx", line 1403, in ray._raylet.CoreWorker.wait
  File "python\ray\_raylet.pyx", line 169, in ray._raylet.check_status
KeyboardInterrupt
^C
(.pyenv) X:\gt>

I did not press Ctrl+C as seems to be suggested by the above stack trace

It only seems to happen if there are other Python processes running, not started or related to ray. Windows is known to reuse pids, so could it be some sort of race condition, a ray process exits, an unrelated Python process starts (for example a short lived VS Code pylint helper process), gets assigned the same pid as the former ray process and the ray orechestrator somehow gets confused.

If I make sure no other Python processes are running except the autogluon one which uses ray the issue does not seem to happen (or maybe it happens much less frequently).

There is also a Python bug which sounds like it might be related, about os.kill failing with the same WinError 87 in relation to Ctrl+C: https://bugs.python.org/issue42962

Updating pytorch to 1.13 did not solve the problem

conda install -c pytorch pytorch==1.12 torchvision=0.13
REM fix the "paging file too small problem"
del  d:\miniconda\envs\issue27646\lib\site-packages\torch\lib\*.dll_bak
python d:\temp\fixNvPe.py --input  d:\miniconda\envs\issue27646\lib\site-packages\torch\lib\*.dll

Carefully watching taskmanager at the time of the failure, I see a spike in GPU % utilization. It is pretty steady at 20% but then hits 100%. I don’t know if that is the cause, or if that is a side-effect of the failure.

In the first script, after setting gpus_per_trial=0 it runs all 250 trials without failing in 5hr40min. I wonder if there is a problem with the torchvision GPU code causing a failure.

So I see this 2022-08-11 19:22:51,667 WARNING tune.py:682 -- Stop signal received (e.g. via SIGINT/Ctrl+C), ending Ray Tune run. This will try to checkpoint the experiment state one last time. Press CTRL+C (or send SIGINT/SIGKILL/SIGTERM) to skip. in the log that you posted @mattip