neuralhydrology: Missing Data entering the training process

Hi guys!

I seem to be getting missing data in the .h5 files.

It might be because of an older version of CAMELS GB data but i seem to remember i did have to add some extra checks to find nan data to the multiple_forcing directory. I will dig

# The problem

I ran:

ipython --pdb neuralhydrology/nh_run.py train -- --config-file configs/aws_lstm_all
.yml

With the config file defined here: .

I got the following error:

RuntimeError: Loss was NaN for 1 times in a row. Stopped training.

ipdb> torch.isnan(data["y"]).sum()
tensor(2994, device='cuda:0')

The NoTrainDataError seems to only be defined in the tester now - how are you catching nans instead?

Thanks!

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 16 (9 by maintainers)

Most upvoted comments

@tommylees112 Check the current master version please

kratzert on Oct 23, 2020

That is nothing you can find on Google. The message stated that the dataloader tried to stack arrays of shape [1,1] with an array of shape [1,8]. Since it is “my” code, I know where the only place is that this can happen.

The problem is:

CAMELS GB has basins that start with the same numbers (e.g. ‘4001’, ‘40010’, ‘40011’ etc.)
We do allow for split training periods (e.g. you can define a list of start and end training dates), where we internally add something to the basin name.
When computing the per-basin std for the target (which is needed for the NSE loss), we search for all basins that start with the same string (which works in CAMELS US) and compute the target std together for splitted periods.
Here was an issue with the case of having multiple basins that start with the same name.

Fix will be pushed in a couple of minutes. Just doing some additional checks.

kratzert on Oct 23, 2020

Okay I think I found the issue. I am working on it, give me a sec

kratzert on Oct 23, 2020

I am guessing that I should remove those basins: ['107001', '25029', '40022', '42026', '42027', '46014'] ?

No, these basins are ignored (they were ignored before as well, this is just a notification so that you are aware that these basins do not contain training data.

Okay with your config and basin text file I get the same error, Had no problem with the full basin list (excluding the two basins from above) and the config from yesterday. I will have a look at it

kratzert on Oct 23, 2020

How did you speed up the calculations of the training data so much? That used to take significantly longer in the old repo!

We did a couple of changes. Most importantly, we do not reshape anymore to [number of samples, sequence length, input features] in the preprocessing but store the raw 2D array per basin (shape [time steps, features]). During training we slice single input sequences from those 2D arrays. The details are a bit tough to explain here in an answer but if you are really interested, all this happens in the BaseDataset. Storing the raw 2D arrays per basin and not the 3D arrays of preprocessing input samples is also what resulted in the huge save in memory requirements.

The other option is the same behaviour as before (or as when there are some missing targets as above), where you auto-magically exclude those basins and tell the user in a message that you are excluding them. I understand if you would rather not do this ‘magically’ and rather the user removed those basins manually from the list of basins.

No, we won’t magically exclude problematic features and let the model run on other inputs as the user defined. That would be rather bad practice in my opinion and those “warnings” are simple to overlook in the console log.

A lot of the code seems to be focused towards the different time-frequencies. I guess that this must be the direction of travel for the repository, I’m guessing that it’s the most useful feature from an operational setting? Very exciting!

Multi-temporal modeling is certainly one of the more recent and exciting changes we added. Having a single data loader to work on any (and multiple) frequencies at once does result in a lot of additional (more complex) code in the BaseDataset. However, for the normal user, this is not really important, since this class does not have to be touched to e.g. add new data sets. We tried to do all complicated stuff in the background and have an as simple as possible interface to the user.

kratzert on Oct 22, 2020

Hi Tommy,

the reason why you get this error is, because you are including attributes that are not defined for every basin. That is basin 18011 and 26006 do not have an elev_mean and dpsbar, which you both include in the camels_attributes list. Excluding those two basins from your basin file (I assume it is this one) or removing the two attributes will resolve your problem.

We do perform some checks on the input data but not on the camels attributes. We check for NaNs here https://github.com/neuralhydrology/neuralhydrology/blob/master/neuralhydrology/datasetzoo/basedataset.py#L389 but during training we only exclude input sequences were a) a single NaN is somewhere in the dynamic inputs or b) a NaN is in the static features (not the attributes) or c) all targets of the predict_last_n time steps are NaN. If just some of the target time steps are NaN, we filter them in the loss function.

But I agree that we should add checks for NaN in the attributes and throw a verbose error pointing at the basins and attributes that are NaN.

Other points, since you already worked with older versions of this code:

.h5 files do not exist anymore
the config argument zero_center_target does not exist anymore. We now have a config argument for custom_normalization and zero mean, unit variance is the defualt.
cache_validation_data I saw that you set it to False, maybe because of an older version where the data required a lot of CPU RAM. This has changed an the entire CAMELS data set is just a few GB in memory anymore. I would try to set it to True (or delete, since True is the default), which would speed up validation quite a lot.
I can recommend to use the NSE loss, not the MSE loss
Not sure if this is your final setting, but you are setting the learning_rate at epoch 20 to some value, but are only training for 15 epochs.

kratzert on Oct 22, 2020