gluonts: When I use the #898 code, there are some problems
Description
In the past, when I used gluon-ts-0.5.0, num_workers>1 could not be set. I saw the code lostella released in #898 the other day and pulled it down for use. Num_workers can be set to greater than 1, but there are some problems. Problem 1: when epoch reaches about 200, the training will end. Problem 2: CUDA initialization error occurs when training the next target.
I made a demo to reproduce the problem, and some of the data will be uploaded. Thanks! @lostella
To Reproduce
#!/usr/bin/env python3
# -*- coding:utf-8 -*-
import os
import mxnet as mx
import numpy as np
import pandas as pd
from gluonts.dataset import common
from gluonts.evaluation import Evaluator
from gluonts.evaluation.backtest import make_evaluation_predictions
from gluonts.model import deepar
from gluonts.trainer import Trainer
def model_train(df):
param_list = {
'epochs': [10000],
'num_layers': [4],
'learning_rate': [1e-2],
'mini_batch_size': [32],
'num_cells': [40],
'cell_type': ['lstm'],
}
prediction_length = 12
freq = '2H'
re_day = 7
train_time = df.iloc[-1 + (-24) * re_day].monitor_time
end_time = df.iloc[-1].monitor_time
test_time = pd.date_range(start=train_time, end=end_time, freq='H')[1:]
df = df.set_index('monitor_time')
model_i = 0
a = []
for i, _ in enumerate(test_time):
a.append({"start": df.index[0], "target": df.Measured[:str(test_time[i])]})
data = common.ListDataset([{"start": df.index[0],
"target": df.Measured[:train_time]}],
freq=freq)
val_data = common.ListDataset(a, freq=freq)
for epochs_i in param_list['epochs']:
for batch_i in param_list['mini_batch_size']:
for lr_i in param_list['learning_rate']:
for cells_i in param_list['num_cells']:
for layers_i in param_list['num_layers']:
for type_i in param_list['cell_type']:
estimator = deepar.DeepAREstimator(
prediction_length=prediction_length,
context_length=prediction_length,
freq=freq,
num_layers=layers_i,
num_cells=cells_i,
cell_type=type_i,
trainer=Trainer(
ctx=mx.gpu(),
epochs=epochs_i,
learning_rate=lr_i,
hybridize=True,
batch_size=batch_i,
),
)
predictor = estimator.train(training_data=data, num_workers=2, num_prefetch=96)
forecast_it, ts_it = make_evaluation_predictions(val_data, predictor=predictor,
num_samples=100)
forecasts = list(forecast_it)
tss = list(ts_it)
evaluator = Evaluator(quantiles=[0.5], seasonality=2016)
agg_metrics, item_metrics = evaluator(iter(tss), iter(forecasts), num_series=len(val_data))
if model_i == 0:
df_metrics = pd.DataFrame(columns=list(agg_metrics))
values_metrics = []
for k in agg_metrics:
values_metrics.append(agg_metrics[k])
df_metrics.loc[model_i, :] = values_metrics
model_i = model_i + 1
best_model_ind = np.argmin(df_metrics['RMSE'].values)
print('The best model index is {}, mae {}, rmese {}'.format(
best_model_ind, df_metrics.loc[best_model_ind, 'abs_error'] / prediction_length,
df_metrics.loc[best_model_ind, 'RMSE']))
return df_metrics, best_model_ind
def file_name_get(item, spe_file):
for root, dirs, files in os.walk(spe_file):
file = []
for i in files:
if item in i:
file.append(i)
return file
if __name__=='__main__':
data_file = 'data'
files = file_name_get('data', data_file)
for file in files:
df = pd.read_csv(file)
df_metrics, best_model_ind = model_train(df)
Error message or code output
100%|███| 50/50 [00:01<00:00, 31.63it/s, epoch=194/10000, avg_epoch_loss=-.0183]
100%|███| 50/50 [00:01<00:00, 31.55it/s, epoch=195/10000, avg_epoch_loss=0.0884]
WARNING:root:Serializing RepresentableBlockPredictor instances does not save the prediction network structure in a backwards-compatible manner. Be careful not to use this method in production.
Running evaluation: 100%|████████████████████| 336/336 [00:02<00:00, 112.13it/s]
Train process 0 ,Epochs 10000, Batch_size: 32, Learning_rate: 0.01, Num_cells: 40, Num_layers: 4, cell_type: lstm
0%| | 0/50 [00:00<?, ?it/s][06:25:22] src/engine/threaded_engine_perdevice.cc:101: Ignore CUDA Error [06:25:22] /home/ubuntu/mxnet-distro/mxnet-build/3rdparty/mshadow/mshadow/./tensor_gpu-inl.h:35: Check failed: e == cudaSuccess: CUDA: initialization error
Stack trace:
[bt] (0) /home/cjk/anaconda3/lib/python3.7/site-packages/mxnet/libmxnet.so(+0x6b8b5b) [0x7ff3de97fb5b]
[bt] (1) /home/cjk/anaconda3/lib/python3.7/site-packages/mxnet/libmxnet.so(+0x37ab842) [0x7ff3e1a72842]
[bt] (2) /home/cjk/anaconda3/lib/python3.7/site-packages/mxnet/libmxnet.so(+0x37ceece) [0x7ff3e1a95ece]
[bt] (3) /home/cjk/anaconda3/lib/python3.7/site-packages/mxnet/libmxnet.so(+0x37c19d1) [0x7ff3e1a889d1]
[bt] (4) /home/cjk/anaconda3/lib/python3.7/site-packages/mxnet/libmxnet.so(+0x37b74a1) [0x7ff3e1a7e4a1]
[bt] (5) /home/cjk/anaconda3/lib/python3.7/site-packages/mxnet/libmxnet.so(+0x37b83f4) [0x7ff3e1a7f3f4]
[bt] (6) /home/cjk/anaconda3/lib/python3.7/site-packages/mxnet/libmxnet.so(mxnet::NDArray::Chunk::~Chunk()+0x3c2) [0x7ff3e1cada42]
[bt] (7) /home/cjk/anaconda3/lib/python3.7/site-packages/mxnet/libmxnet.so(+0x6bc30a) [0x7ff3de98330a]
[bt] (8) /home/cjk/anaconda3/lib/python3.7/site-packages/mxnet/libmxnet.so(MXNDArrayFree+0x54) [0x7ff3e19e89c4]
Environment
- Operating system: Ubuntu 18.04.2
- Python version: Python 3.7.4
- GluonTS version: the code released in #898
- MXNet version: mxnet-cu101 1.6.0
- CPU cores: 14
- GPU information: 3b:00.0 3D controller: NVIDIA Corporation GV100 [Tesla V100 PCIe] (rev a1) b1:00.0 3D controller: NVIDIA Corporation GV100 [Tesla V100 PCIe] (rev a1)
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 22 (11 by maintainers)
It’s helpful.Thanks!
But some issue happened:
When i add this code, Evaluator results in an error:
Ok,i will try it and report result here.Thanks!
I wrote a little demo to test it, but it still gave me an error. Could it be that there is not enough Shared memory,The Shared memory is too large every time it is created ? I don’t know.
This is data:
deepar_test.zip
This is my code:
Error log: