gluonts: When I use the #898 code, there are some problems

Description

In the past, when I used gluon-ts-0.5.0, num_workers>1 could not be set. I saw the code lostella released in #898 the other day and pulled it down for use. Num_workers can be set to greater than 1, but there are some problems. Problem 1: when epoch reaches about 200, the training will end. Problem 2: CUDA initialization error occurs when training the next target.

I made a demo to reproduce the problem, and some of the data will be uploaded. Thanks! @lostella

data.zip

To Reproduce

#!/usr/bin/env python3
# -*- coding:utf-8 -*-
import os
import mxnet as mx
import numpy as np
import pandas as pd
from gluonts.dataset import common
from gluonts.evaluation import Evaluator
from gluonts.evaluation.backtest import make_evaluation_predictions
from gluonts.model import deepar
from gluonts.trainer import Trainer



def model_train(df):
    param_list = {
            'epochs': [10000],
            'num_layers': [4],
            'learning_rate': [1e-2],
            'mini_batch_size': [32],
            'num_cells': [40],
            'cell_type': ['lstm'],
        }
    prediction_length = 12
    freq = '2H'
    re_day = 7
    train_time = df.iloc[-1 + (-24) * re_day].monitor_time
    end_time = df.iloc[-1].monitor_time
    test_time = pd.date_range(start=train_time, end=end_time, freq='H')[1:]
    df = df.set_index('monitor_time')
    model_i = 0
    a = []
    for i, _ in enumerate(test_time):
        a.append({"start": df.index[0], "target": df.Measured[:str(test_time[i])]})
    data = common.ListDataset([{"start": df.index[0],
                                    "target": df.Measured[:train_time]}],
                                  freq=freq)

    val_data = common.ListDataset(a, freq=freq)
    for epochs_i in param_list['epochs']:
        for batch_i in param_list['mini_batch_size']:
            for lr_i in param_list['learning_rate']:
                for cells_i in param_list['num_cells']:
                    for layers_i in param_list['num_layers']:
                        for type_i in param_list['cell_type']:
                            estimator = deepar.DeepAREstimator(
                                prediction_length=prediction_length,
                                context_length=prediction_length,
                                freq=freq,
                                num_layers=layers_i,
                                num_cells=cells_i,
                                cell_type=type_i,

                                trainer=Trainer(
                                    ctx=mx.gpu(),
                                    epochs=epochs_i,
                                    learning_rate=lr_i,
                                    hybridize=True,
                                    batch_size=batch_i,
                                ),
                            )

                            predictor = estimator.train(training_data=data, num_workers=2, num_prefetch=96)
                            forecast_it, ts_it = make_evaluation_predictions(val_data, predictor=predictor,
                                                                             num_samples=100)
                            forecasts = list(forecast_it)
                            tss = list(ts_it)

                            evaluator = Evaluator(quantiles=[0.5], seasonality=2016)
                            agg_metrics, item_metrics = evaluator(iter(tss), iter(forecasts), num_series=len(val_data))

                            if model_i == 0:
                                df_metrics = pd.DataFrame(columns=list(agg_metrics))

                            values_metrics = []
                            for k in agg_metrics:
                                values_metrics.append(agg_metrics[k])

                            df_metrics.loc[model_i, :] = values_metrics
                            model_i = model_i + 1


    best_model_ind = np.argmin(df_metrics['RMSE'].values)
    print('The best model index is {}, mae {}, rmese {}'.format(
        best_model_ind, df_metrics.loc[best_model_ind, 'abs_error'] / prediction_length,
        df_metrics.loc[best_model_ind, 'RMSE']))
    return df_metrics, best_model_ind

def file_name_get(item, spe_file):
    for root, dirs, files in os.walk(spe_file):
        file = []
        for i in files:
            if item in i:
                file.append(i)
        return file


if __name__=='__main__':
    data_file = 'data'
    files = file_name_get('data', data_file)
    for file in files:
        df = pd.read_csv(file)
        df_metrics, best_model_ind = model_train(df)

Error message or code output

100%|███| 50/50 [00:01<00:00, 31.63it/s, epoch=194/10000, avg_epoch_loss=-.0183]
100%|███| 50/50 [00:01<00:00, 31.55it/s, epoch=195/10000, avg_epoch_loss=0.0884]
WARNING:root:Serializing RepresentableBlockPredictor instances does not save the prediction network structure in a backwards-compatible manner. Be careful not to use this method in production.
Running evaluation: 100%|████████████████████| 336/336 [00:02<00:00, 112.13it/s]

Train process 0 ,Epochs 10000, Batch_size: 32, Learning_rate: 0.01, Num_cells: 40, Num_layers: 4, cell_type: lstm
0%| | 0/50 [00:00<?, ?it/s][06:25:22] src/engine/threaded_engine_perdevice.cc:101: Ignore CUDA Error [06:25:22] /home/ubuntu/mxnet-distro/mxnet-build/3rdparty/mshadow/mshadow/./tensor_gpu-inl.h:35: Check failed: e == cudaSuccess: CUDA: initialization error
Stack trace:
[bt] (0) /home/cjk/anaconda3/lib/python3.7/site-packages/mxnet/libmxnet.so(+0x6b8b5b) [0x7ff3de97fb5b]
[bt] (1) /home/cjk/anaconda3/lib/python3.7/site-packages/mxnet/libmxnet.so(+0x37ab842) [0x7ff3e1a72842]
[bt] (2) /home/cjk/anaconda3/lib/python3.7/site-packages/mxnet/libmxnet.so(+0x37ceece) [0x7ff3e1a95ece]
[bt] (3) /home/cjk/anaconda3/lib/python3.7/site-packages/mxnet/libmxnet.so(+0x37c19d1) [0x7ff3e1a889d1]
[bt] (4) /home/cjk/anaconda3/lib/python3.7/site-packages/mxnet/libmxnet.so(+0x37b74a1) [0x7ff3e1a7e4a1]
[bt] (5) /home/cjk/anaconda3/lib/python3.7/site-packages/mxnet/libmxnet.so(+0x37b83f4) [0x7ff3e1a7f3f4]
[bt] (6) /home/cjk/anaconda3/lib/python3.7/site-packages/mxnet/libmxnet.so(mxnet::NDArray::Chunk::~Chunk()+0x3c2) [0x7ff3e1cada42]
[bt] (7) /home/cjk/anaconda3/lib/python3.7/site-packages/mxnet/libmxnet.so(+0x6bc30a) [0x7ff3de98330a]
[bt] (8) /home/cjk/anaconda3/lib/python3.7/site-packages/mxnet/libmxnet.so(MXNDArrayFree+0x54) [0x7ff3e19e89c4]

Environment

Operating system: Ubuntu 18.04.2
Python version: Python 3.7.4
GluonTS version: the code released in #898
MXNet version: mxnet-cu101 1.6.0
CPU cores: 14
GPU information: 3b:00.0 3D controller: NVIDIA Corporation GV100 [Tesla V100 PCIe] (rev a1) b1:00.0 3D controller: NVIDIA Corporation GV100 [Tesla V100 PCIe] (rev a1)

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 22 (11 by maintainers)

Most upvoted comments

@k-user as a side note, maybe unrelated: it seems like you’re training a model on a single time series, in which case multiprocessing won’t help you, so you might want to turn that off for that specific example.

You can check whether that’s what causing the problem, by doubling the size of the dataset that you’re using:
data = common.ListDataset(
    [{"start": df.index[0], "target": df.monitor_value[:]}] * 2,
    freq='H'
)

It’s helpful.Thanks！

But some issue happened：

import multiprocessing
multiprocessing.set_start_method("spawn", force=True)

When i add this code, Evaluator results in an error:

multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/home/cjk/anaconda3/envs/gluonts/lib/python3.6/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/home/cjk/anaconda3/envs/gluonts/lib/python3.6/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
  File "/home/cjk/anaconda3/envs/gluonts/lib/python3.6/site-packages/gluonts/evaluation/_base.py", line 833, in _worker_fun
    ), "Something went wrong with the worker initialization."
AssertionError: Something went wrong with the worker initialization.
"""

k-user on Nov 12, 2020

Hey @k-user, Thank you for the report. If I run your code I get the cuda initialization error which we discuss here: #1054

Could you try putting:
import multiprocessing
multiprocessing.set_start_method("spawn", force=True)
at the beginning of your script? For me, your code works fine using this.

Let me know if that resolves your issue.

Ok,i will try it and report result here.Thanks!

k-user on Nov 9, 2020

I wrote a little demo to test it, but it still gave me an error. Could it be that there is not enough Shared memory,The Shared memory is too large every time it is created ? I don’t know.

This is data:

deepar_test.zip

This is my code:

import pandas as pd
from gluonts.dataset import common
from gluonts.model import deepar
from gluonts.trainer import Trainer

df = pd.read_csv('deepar_test.csv')
df['monitor_time'] = pd.to_datetime(df['monitor_time'])
df = df.set_index('monitor_time')
data = common.ListDataset([{"start": df.index[0],
                                "target": df.Measured[:]}],
                              freq='H')

estimator = deepar.DeepAREstimator( prediction_length=24,
                                                                num_parallel_samples=100,
                                                                freq="H",
                                                                trainer=Trainer(ctx="gpu", epochs=200, learning_rate=1e-3))
predictor = estimator.train(training_data=data, num_workers=2)

Error log:

k-user on Nov 3, 2020

Have you tried using MXNet version: mxnet-cu92==1.6.0, or do you have CUDA 10.1 installed?

Thanks @AaronSpieler .My CUDA Version is 10.1 ,When i install MXNet , I pay attention to that.

k-user on Jul 28, 2020