galaxytools: Failure running my ML workflows

I have 3 workflows that use Galaxy’s ML tools (namely Keras for neural networks). They all worked fine last time I ran them (maybe a month ago?).

These 3 workflows are used in 3 neural network tutorials that I am presenting at GCC 2021. I decided to re-run them to make sure all is good. All 3 workflows fail now. Here is the error message for the first 2 workflows:

Traceback (most recent call last):
  File "/data/share/staging/21069371/tool_files/keras_train_and_eval.py", line 491, in <module>
    targets=args.targets, fasta_path=args.fasta_path)
  File "/data/share/staging/21069371/tool_files/keras_train_and_eval.py", line 405, in main
    estimator.fit(X_train, y_train)
  File "/data/share/tools/_conda/envs/mulled-v1-26f90eb9c8055941081cb6eaef4d0dffb23aadd383641e5d6e58562e0bb08f59/lib/python3.6/site-packages/galaxy_ml/keras_galaxy_models.py", line 911, in fit
    return super(KerasGRegressor, self)._fit(X, y, **kwargs)
  File "/data/share/tools/_conda/envs/mulled-v1-26f90eb9c8055941081cb6eaef4d0dffb23aadd383641e5d6e58562e0bb08f59/lib/python3.6/site-packages/galaxy_ml/keras_galaxy_models.py", line 644, in _fit
    validation_data = self.validation_data

Here are the histories:

Per @anuprulez’ suggestion, I downgraded the tool versions and the first and second workflow work now. Below is the downgrade:

Create a deep learning model architecture: downgraded to 0.4.2
Create a deep learning model with an optimizer, loss function and fit parameters: downgraded 0.4.2
Deep learning training and evaluation conduct deep training and evaluation either implicitly or explicitly: downgraded to 1.0.8.2

The third workflow still fails. BTW, it requires the most recent version of the third tool.

I started writing unit tests in galaxytools (https://github.com/kxk302/galaxytools/tree/nn_tests), so these workflows are run as part of the unit test. They would serve as regression tests and would guarantee future changes would not break old code. However, I ran into another issue: models saved to file cannot be loaded and error out. Not sure if this is related to the workflow error above. Here is the error message:

unzip cnn.zip
Archive: cnn.zip
End-of-central-directory signature not found. Either this file is not
a zipfile, or it constitutes one disk of a multi-part archive. In the
latter case the central directory and zipfile comment will be found on
the last disk(s) of this archive.
unzip: cannot find zipfile directory in one of cnn.zip or
cnn.zip.zip, and cannot find cnn.zip.ZIP, period.

About this issue

Original URL
State: open
Created 3 years ago
Comments: 22 (18 by maintainers)

Most upvoted comments

I only see “Failed to communicate with remote job server.”

That’s a job running error, you’ll want to check this with Nate, that is not a tool error.

mvdbeek on May 13, 2021

Sorry, I just say a general debugging process, not specific to any issue mentioned in this thread. From the stderr report @anuprulez provided, I feel the errors could be cleared by re-cleaning the input TSVs.

qiagu on May 13, 2021