ray: [Ray Tune, Core] Ray tune self terminates at ~100 trials on windows
What happened + What you expected to happen
Running a ray tune experiment with pytorch and num_samples set at >100 trials results in ray self terminating at around 100 trials due to SIGINT. On resuming experiment, it will then self terminate at another ~100 trial. This only happens on windows, the same script will run to finish on linux
Expected behaviour: Ray tune completes all trials without sigint in between
Useful information: Error message can vary depending on where it terminates, as sometimes it seems to just print the trace stack at where it terminated.
However, sometimes it returns with:
2022-08-08 17:44:41,788 WARNING worker.py:1404 -- The log monitor on node DESKTOP-N18AJ6S failed with the following error:
OSError: [WinError 87] The parameter is incorrect
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "C:\Users\Steven\anaconda3\envs\GRF_hip_outcomes\lib\site-packages\ray\_private\log_monitor.py", line 451, in <module>
log_monitor.run()
File "C:\Users\Steven\anaconda3\envs\GRF_hip_outcomes\lib\site-packages\ray\_private\log_monitor.py", line 376, in run
self.open_closed_files()
File "C:\Users\Steven\anaconda3\envs\GRF_hip_outcomes\lib\site-packages\ray\_private\log_monitor.py", line 218, in open_closed_files
self.close_all_files()
File "C:\Users\Steven\anaconda3\envs\GRF_hip_outcomes\lib\site-packages\ray\_private\log_monitor.py", line 130, in close_all_files
os.kill(file_info.worker_pid, 0)
SystemError: <built-in function kill> returned a result with an error set
Versions / Dependencies
Ray tune 1.13.0 pytorch 1.11.0 windows 10/11
Reproduction script
This can be reproduced with the example ray tune pytorch script listed in Ray Tune documentation:
from __future__ import annotations
from functools import partial
import numpy as np
import os
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from filelock import FileLock
from torch.utils.data import random_split
import torchvision
import torchvision.transforms as transforms
import ray
from ray import tune
from ray.tune.schedulers import ASHAScheduler
def load_data(data_dir="./data"):
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])
# We add FileLock here because multiple workers will want to
# download data, and this may cause overwrites since
# DataLoader is not threadsafe.
with FileLock(os.path.expanduser("~/.data.lock")):
trainset = torchvision.datasets.CIFAR10(
root=data_dir, train=True, download=True, transform=transform)
testset = torchvision.datasets.CIFAR10(
root=data_dir, train=False, download=True, transform=transform)
return trainset, testset
class Net(nn.Module):
def __init__(self, l1=120, l2=84):
super(Net, self).__init__()
self.conv1 = nn.Conv2d(3, 6, 5)
self.pool = nn.MaxPool2d(2, 2)
self.conv2 = nn.Conv2d(6, 16, 5)
self.fc1 = nn.Linear(16 * 5 * 5, l1)
self.fc2 = nn.Linear(l1, l2)
self.fc3 = nn.Linear(l2, 10)
def forward(self, x):
x = self.pool(F.relu(self.conv1(x)))
x = self.pool(F.relu(self.conv2(x)))
x = x.view(-1, 16 * 5 * 5)
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
x = self.fc3(x)
return x
def train_cifar(config, checkpoint_dir=None):
project_dir = os.getenv('TUNE_ORIG_WORKING_DIR')
net = Net(config["l1"], config["l2"])
device = "cpu"
if torch.cuda.is_available():
device = "cuda:0"
if torch.cuda.device_count() > 1:
net = nn.DataParallel(net)
net.to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=config["lr"], momentum=0.9)
# The `checkpoint_dir` parameter gets passed by Ray Tune when a checkpoint
# should be restored.
if checkpoint_dir:
checkpoint = os.path.join(checkpoint_dir, "checkpoint")
model_state, optimizer_state = torch.load(checkpoint)
net.load_state_dict(model_state)
optimizer.load_state_dict(optimizer_state)
data_dir = os.path.join(project_dir, "data")
trainset, testset = load_data(data_dir)
test_abs = int(len(trainset) * 0.8)
train_subset, val_subset = random_split(
trainset, [test_abs, len(trainset) - test_abs])
trainloader = torch.utils.data.DataLoader(
train_subset,
batch_size=int(config["batch_size"]),
shuffle=True,
num_workers=8)
valloader = torch.utils.data.DataLoader(
val_subset,
batch_size=int(config["batch_size"]),
shuffle=True,
num_workers=8)
for epoch in range(10): # loop over the dataset multiple times
running_loss = 0.0
epoch_steps = 0
for i, data in enumerate(trainloader, 0):
# get the inputs; data is a list of [inputs, labels]
inputs, labels = data
inputs, labels = inputs.to(device), labels.to(device)
# zero the parameter gradients
optimizer.zero_grad()
# forward + backward + optimize
outputs = net(inputs)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
# print statistics
running_loss += loss.item()
epoch_steps += 1
if i % 2000 == 1999: # print every 2000 mini-batches
print("[%d, %5d] loss: %.3f" % (epoch + 1, i + 1,
running_loss / epoch_steps))
running_loss = 0.0
# Validation loss
val_loss = 0.0
val_steps = 0
total = 0
correct = 0
for i, data in enumerate(valloader, 0):
with torch.no_grad():
inputs, labels = data
inputs, labels = inputs.to(device), labels.to(device)
outputs = net(inputs)
_, predicted = torch.max(outputs.data, 1)
total += labels.size(0)
correct += (predicted == labels).sum().item()
loss = criterion(outputs, labels)
val_loss += loss.cpu().numpy()
val_steps += 1
# Here we save a checkpoint. It is automatically registered with
# Ray Tune and will potentially be passed as the `checkpoint_dir`
# parameter in future iterations.
with tune.checkpoint_dir(step=epoch) as checkpoint_dir:
path = os.path.join(checkpoint_dir, "checkpoint")
torch.save(
(net.state_dict(), optimizer.state_dict()), path)
tune.report(loss=(val_loss / val_steps), accuracy=correct / total)
print("Finished Training")
def test_best_model(best_trial):
best_trained_model = Net(best_trial.config["l1"], best_trial.config["l2"])
device = "cuda:0" if torch.cuda.is_available() else "cpu"
best_trained_model.to(device)
checkpoint_path = os.path.join(best_trial.checkpoint.value, "checkpoint")
model_state, optimizer_state = torch.load(checkpoint_path)
best_trained_model.load_state_dict(model_state)
trainset, testset = load_data()
testloader = torch.utils.data.DataLoader(
testset, batch_size=4, shuffle=False, num_workers=2)
correct = 0
total = 0
with torch.no_grad():
for data in testloader:
images, labels = data
images, labels = images.to(device), labels.to(device)
outputs = best_trained_model(images)
_, predicted = torch.max(outputs.data, 1)
total += labels.size(0)
correct += (predicted == labels).sum().item()
print("Best trial test set accuracy: {}".format(correct / total))
def main(num_samples=10, max_num_epochs=10, gpus_per_trial: int|float=2):
config = {
"l1": tune.sample_from(lambda _: 2 ** np.random.randint(2, 9)),
"l2": tune.sample_from(lambda _: 2 ** np.random.randint(2, 9)),
"lr": tune.loguniform(1e-4, 1e-1),
"batch_size": tune.choice([2, 4, 8, 16])
}
scheduler = ASHAScheduler(
max_t=max_num_epochs,
grace_period=1,
reduction_factor=2)
result = tune.run(
tune.with_parameters(train_cifar),
resources_per_trial={"cpu": 2, "gpu": gpus_per_trial},
config=config,
metric="loss",
mode="min",
num_samples=num_samples,
scheduler=scheduler,
verbose=2,
local_dir='/e/GRF_outputs2',
name='train_cifar_2022-08-08_14-32-23',
resume=True
)
best_trial = result.get_best_trial("loss", "min", "last")
print("Best trial config: {}".format(best_trial.config))
print("Best trial final validation loss: {}".format(
best_trial.last_result["loss"]))
print("Best trial final validation accuracy: {}".format(
best_trial.last_result["accuracy"]))
if ray.util.client.ray.is_connected():
# If using Ray Client, we want to make sure checkpoint access
# happens on the server. So we wrap `test_best_model` in a Ray task.
# We have to make sure it gets executed on the same node that
# ``tune.run`` is called on.
from ray.util.ml_utils.node import force_on_current_node
remote_fn = force_on_current_node(ray.remote(test_best_model))
ray.get(remote_fn.remote(best_trial))
else:
test_best_model(best_trial)
if __name__ == "__main__":
# You can change the number of GPUs per trial here:
main(num_samples=250, max_num_epochs=2, gpus_per_trial=0.5)
the most strip down script I’ve attempted without losing machinie learning is:
import logging
import os
import ray
import torch
from ray import tune
from ray.tune import CLIReporter
from torch.utils.data import TensorDataset, DataLoader
from torch import nn
class DummyModel(nn.Module):
def __init__(self, a, output_size, channel_n=9):
super().__init__()
self.output_size=output_size
self.layers = nn.Sequential(nn.Linear(1,a),
nn.ReLU(inplace=True),
nn.Linear(a,output_size))
self.softmax = nn.LogSoftmax(dim=1)
def forward(self, x):
x = x.view(-1, 1)
out = self.layers(x)
return self.softmax(out)
os.environ["TUNE_DISABLE_AUTO_CALLBACK_SYNCER"] = "1"
logger = logging.getLogger(__name__)
def get_dataloader(batch_size=50, num_workers=4):
class1 = [2, 4, 6, 8, 10, 12, 14, 16]
class2 = [3, 6, 9, 12, 15, 18, 21, 24, 27]
class3 = [1, 5, 7, 11, 13, 17, 19]
data_x = torch.tensor(class1 + class2 + class3, dtype=torch.float)
data_y = torch.tensor(len(class1) * [0] + len(class2) * [1] + len(class3) * [2])
dataset = TensorDataset(data_x, data_y)
return DataLoader(dataset, batch_size=batch_size, num_workers=num_workers)
def train_func(config, epochs=2):
# cwd is changed to the trial folder
batch_size = config["batch_size"]
weight_decay = config['weight_decay']
a = config["a"]
lr = config['lr']
# setup code
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
# setup dataloaders
data_loader = get_dataloader(batch_size, num_workers=4)
output_size = 3
channel_n = 9
# setup model
model = DummyModel(a=a, output_size=output_size, channel_n=channel_n)
model = model.to(device)
criterion = torch.nn.NLLLoss()
optimizer = torch.optim.AdamW(model.parameters(), lr=lr, weight_decay=weight_decay, amsgrad=True)
for epoch in range(0, epochs): # loop over the dataset multiple times
log = {}
model.train()
for i, data in enumerate(data_loader, 0):
# get the inputs; data is a list of [inputs, labels]
inputs, targets = data
inputs, targets = inputs.to(device), targets.to(device)
optimizer.zero_grad()
# forward + backward
outputs = model(inputs)
loss = criterion(outputs, targets)
loss.backward()
optimizer.step()
log = {"loss": loss.item()}
# log metrics, log in checkpoint in case actor dies half way
tune.report(**log)
def main():
tune.run(
train_func,
num_samples=250,
metric='loss',
mode="min",
resources_per_trial={"cpu": 4, "gpu": 0.15},
local_dir="outputs/tune-dummy",
verbose=2,
config={
"lr": ray.tune.loguniform(0.0001, 0.01),
"weight_decay": ray.tune.loguniform(0.001, 0.01),
"batch_size": ray.tune.choice([128, 256]),
"a": ray.tune.randint(1, 30)
},
progress_reporter=CLIReporter(metric_columns=["training_iteration", "loss"])
)
print("I finished!")
if __name__ == '__main__':
main()
Issue Severity
Medium: It is a significant difficulty but I can work around it.
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 40 (27 by maintainers)
@mattip I had the same error using Ray’s latest release, but it stopped appearing on the nightly build. Just FYI 😃
I’m encountering this issue when using
ray-1.13.0throughautogluon-0.5.2on Windows / Python 3.9.3, everything installed withpip(nocondausage):I did not press Ctrl+C as seems to be suggested by the above stack trace
It only seems to happen if there are other Python processes running, not started or related to
ray. Windows is known to reusepids, so could it be some sort of race condition, arayprocess exits, an unrelated Python process starts (for example a short lived VS Codepylinthelper process), gets assigned the samepidas the formerrayprocess and therayorechestrator somehow gets confused.If I make sure no other Python processes are running except the
autogluonone which usesraythe issue does not seem to happen (or maybe it happens much less frequently).There is also a Python bug which sounds like it might be related, about
os.killfailing with the sameWinError 87in relation toCtrl+C: https://bugs.python.org/issue42962Updating pytorch to 1.13 did not solve the problem
Carefully watching taskmanager at the time of the failure, I see a spike in GPU % utilization. It is pretty steady at 20% but then hits 100%. I don’t know if that is the cause, or if that is a side-effect of the failure.
In the first script, after setting
gpus_per_trial=0it runs all 250 trials without failing in 5hr40min. I wonder if there is a problem with the torchvision GPU code causing a failure.So I see this
2022-08-11 19:22:51,667 WARNING tune.py:682 -- Stop signal received (e.g. via SIGINT/Ctrl+C), ending Ray Tune run. This will try to checkpoint the experiment state one last time. Press CTRL+C (or send SIGINT/SIGKILL/SIGTERM) to skip.in the log that you posted @mattip …