katib: Experiment stuck due to hitting `Suggestion` custom resource size limits

/kind bug

What steps did you take and what happened: Submitting a large (i.e. resulting in a large number of trials, in this case ~14500 with 4 hyperparameters with 10/11 values per hyperparameter) experiment results in the Suggestion custom resource reaching the size limits of custom resources dictated by Kubernetes due to all suggestions being stored in this resource. This results in the following error being output by the Katib controller when trying to update the Suggestion custom resource: Request entity too large and the experiment not being able to progress. This issue seems to describe the exact problem.

Argo Workflows seems to have encountered the same problem, described here and solved it by allowing for 1) compression of the data stored in the status field of the custom resource and 2) storage of information under the status field in a relational database as described here.

What did you expect to happen: I expected Katib to be able to handle search spaces or arbitrary size.

Anything else you would like to add: A workaround would be to manually split the experiment into smaller subexperiments to circumvent the size limits of custom resources. Ideally, this is solved by following a similar approach as Argo does for their Workflow custom resources.


Impacted by this bug? Give it a šŸ‘ We prioritize the issues with the most šŸ‘

About this issue

  • Original URL
  • State: open
  • Created 2 years ago
  • Reactions: 4
  • Comments: 35 (20 by maintainers)

Most upvoted comments

Yes, please see below for the experiment yaml as well as the other dependencies to be able to reproduce the experiment. I create a configmap k create configmap script --from-file=run.py=run.py based on a run.py file to mock my actual implementation (but this also results in the same issues). The configmap then gets mounted in the main container of the experiment taking in the parameters and producing a value for the objective of interest. I have also attached the run.py file below.

Experiment

apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
  name: debug
spec:
  objective:
    type: maximize
    goal: 500
    objectiveMetricName: cost
  algorithm:
    algorithmName: grid
  parallelTrialCount: 20
  maxTrialCount: 14641
  maxFailedTrialCount: 2000
  parameters:
    - name: a
      parameterType: double
      feasibleSpace:
        min: "1.30"
        max: "1.41"
        step: "0.01"
    - name: b
      parameterType: categorical
      feasibleSpace:
        list: ["0.0010", "0.0025", "0.0063", "0.0158", "0.0398", "0.1000", "0.2512", "0.6310", "1.5849", "3.9811", "10.0000"]
    - name: c
      parameterType: double
      feasibleSpace:
        min: "1.30"
        max: "1.41"
        step: "0.01"
    - name: d
      parameterType: categorical
      feasibleSpace:
        list: ["0.0010", "0.0025", "0.0063", "0.0158", "0.0398", "0.1000", "0.2512", "0.6310", "1.5849", "3.9811", "10.0000"]
  trialTemplate:
    retain: false
    primaryContainerName: training-container
    trialParameters:
      - name: a
        reference: a
        description: ""
      - name: b
        reference: b
        description: ""
      - name: c 
        reference: c
        description: ""
      - name: d 
        reference: d
        description: ""
    trialSpec:
      apiVersion: batch/v1
      kind: Job
      spec:
        template:
          spec:
            containers:
              - name: training-container
                image: docker.io/python:alpine3.15
                volumeMounts:
                - name: script
                  mountPath: /app/run.py
                  subPath: run.py
                command:
                  - "python3"
                  - "/app/run.py"
                  - "${trialParameters.a}"
                  - "${trialParameters.b}"
                  - "${trialParameters.c}"
                  - "${trialParameters.d}"
            restartPolicy: Never
            volumes:
              - name: script
                configMap:
                  name: script

The mock implementation run.py

import sys
import time
time.sleep(4)
cost = sum([float(x) for x in sys.argv[1:]])
print(f"cost={cost}")

EDIT: and to provide you an idea of the error messages arising from the katib-controller (this is a slightly different experiment, but the errors are identical)

{"level":"info","ts":1649925772.8911624,"logger":"suggestion-controller","msg":"Sync assignments","Suggestion":"simulation/simulation-nr-fb","Suggestion Requests":8165,"Suggestion Count":8143}
{"level":"info","ts":1649925774.5568578,"logger":"suggestion-client","msg":"Getting suggestions","Suggestion":"simulation/simulation-nr-fb","endpoint":"simulation-nr-fb-grid.simulation:6789","Number of current request parameters":22,"Number of response parameters":22}
{"level":"info","ts":1649925775.6414711,"logger":"suggestion-controller","msg":"Update suggestion instance status failed, reconciler requeued","Suggestion":"simulation/simulation-nr-fb","err":"rpc error: code = ResourceExhausted desc = trying to send message larger than max (2100613 vs. 2097152)"}

@robertzsun-dev Thanks for sharing.

I think, we can start Google doc to collaborate whether we should chose the DB approach or ConfigMaps. After that, we can convert it to one of the Katib Proposals WDYT @tenzen-y @robertzsun-dev @johnugeorge @nielsmeima ?

@andreyvelich Agree.

I’m happy with participating in the discussion on google docs, although my bandwidth for the katib is limited since I’m focusing on distributed training and job scheduling in this quarter.

IMO, if we use ConfigMap approach, do we really need Katib DB to store metrics ?

Good point. We may need to consider more clean architecture for the Stable Katib version (v1).