katib: Experiment stuck due to hitting `Suggestion` custom resource size limits
/kind bug
What steps did you take and what happened:
Submitting a large (i.e. resulting in a large number of trials, in this case ~14500 with 4 hyperparameters with 10/11 values per hyperparameter) experiment results in the Suggestion
custom resource reaching the size limits of custom resources dictated by Kubernetes due to all suggestions being stored in this resource. This results in the following error being output by the Katib controller when trying to update the Suggestion custom resource: Request entity too large
and the experiment not being able to progress. This issue seems to describe the exact problem.
Argo Workflows seems to have encountered the same problem, described here and solved it by allowing for 1) compression of the data stored in the status field of the custom resource and 2) storage of information under the status field in a relational database as described here.
What did you expect to happen: I expected Katib to be able to handle search spaces or arbitrary size.
Anything else you would like to add: A workaround would be to manually split the experiment into smaller subexperiments to circumvent the size limits of custom resources. Ideally, this is solved by following a similar approach as Argo does for their Workflow custom resources.
Impacted by this bug? Give it a š We prioritize the issues with the most š
About this issue
- Original URL
- State: open
- Created 2 years ago
- Reactions: 4
- Comments: 35 (20 by maintainers)
Yes, please see below for the experiment
yaml
as well as the other dependencies to be able to reproduce the experiment. I create a configmapk create configmap script --from-file=run.py=run.py
based on arun.py
file to mock my actual implementation (but this also results in the same issues). The configmap then gets mounted in the main container of the experiment taking in the parameters and producing a value for the objective of interest. I have also attached therun.py
file below.Experiment
The mock implementation
run.py
EDIT: and to provide you an idea of the error messages arising from the
katib-controller
(this is a slightly different experiment, but the errors are identical)@robertzsun-dev Thanks for sharing.
@andreyvelich Agree.
Iām happy with participating in the discussion on google docs, although my bandwidth for the katib is limited since Iām focusing on distributed training and job scheduling in this quarter.
Good point. We may need to consider more clean architecture for the Stable Katib version (v1).