neuralhydrology: Missing Data entering the training process
Hi guys!
I seem to be getting missing data in the .h5 files.
It might be because of an older version of CAMELS GB data but i seem to remember i did have to add some extra checks to find nan data to the multiple_forcing directory. I will dig
# The problem
I ran:
ipython --pdb neuralhydrology/nh_run.py train -- --config-file configs/aws_lstm_all
.yml
With the config file defined here: .
I got the following error:
RuntimeError: Loss was NaN for 1 times in a row. Stopped training.
ipdb> torch.isnan(data["y"]).sum()
tensor(2994, device='cuda:0')
The NoTrainDataError seems to only be defined in the tester now - how are you catching nans instead?
Thanks!
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 16 (9 by maintainers)
@tommylees112 Check the current master version please
That is nothing you can find on Google. The message stated that the dataloader tried to stack arrays of shape [1,1] with an array of shape [1,8]. Since it is “my” code, I know where the only place is that this can happen.
The problem is:
Fix will be pushed in a couple of minutes. Just doing some additional checks.
Okay I think I found the issue. I am working on it, give me a sec
No, these basins are ignored (they were ignored before as well, this is just a notification so that you are aware that these basins do not contain training data.
Okay with your config and basin text file I get the same error, Had no problem with the full basin list (excluding the two basins from above) and the config from yesterday. I will have a look at it
We did a couple of changes. Most importantly, we do not reshape anymore to [number of samples, sequence length, input features] in the preprocessing but store the raw 2D array per basin (shape [time steps, features]). During training we slice single input sequences from those 2D arrays. The details are a bit tough to explain here in an answer but if you are really interested, all this happens in the
BaseDataset. Storing the raw 2D arrays per basin and not the 3D arrays of preprocessing input samples is also what resulted in the huge save in memory requirements.No, we won’t magically exclude problematic features and let the model run on other inputs as the user defined. That would be rather bad practice in my opinion and those “warnings” are simple to overlook in the console log.
Multi-temporal modeling is certainly one of the more recent and exciting changes we added. Having a single data loader to work on any (and multiple) frequencies at once does result in a lot of additional (more complex) code in the
BaseDataset. However, for the normal user, this is not really important, since this class does not have to be touched to e.g. add new data sets. We tried to do all complicated stuff in the background and have an as simple as possible interface to the user.Hi Tommy,
the reason why you get this error is, because you are including attributes that are not defined for every basin. That is basin
18011and26006do not have anelev_meananddpsbar, which you both include in thecamels_attributeslist. Excluding those two basins from your basin file (I assume it is this one) or removing the two attributes will resolve your problem.We do perform some checks on the input data but not on the camels attributes. We check for NaNs here https://github.com/neuralhydrology/neuralhydrology/blob/master/neuralhydrology/datasetzoo/basedataset.py#L389 but during training we only exclude input sequences were a) a single NaN is somewhere in the dynamic inputs or b) a NaN is in the static features (not the attributes) or c) all targets of the
predict_last_ntime steps are NaN. If just some of the target time steps are NaN, we filter them in the loss function.But I agree that we should add checks for NaN in the attributes and throw a verbose error pointing at the basins and attributes that are NaN.
Other points, since you already worked with older versions of this code:
zero_center_targetdoes not exist anymore. We now have a config argument forcustom_normalizationand zero mean, unit variance is the defualt.cache_validation_dataI saw that you set it to False, maybe because of an older version where the data required a lot of CPU RAM. This has changed an the entire CAMELS data set is just a few GB in memory anymore. I would try to set it to True (or delete, since True is the default), which would speed up validation quite a lot.