pipelines: External volume deletion stuck in infinite loop

What steps did you take:

[A clear and concise description of what the bug is.] I ran MNIST example to train and predict results heres the code:

import kfp
import sys
import kfp.dsl as dsl
import kfp.components as comp


def train(data_path, model_file):
    '''
        This definition contains MNIST training steps:
            * Data Import
            * Data Preprocessing
            * Keras model creation
            * Model optimizer : adam
            * Training with specified epoch
            * Print test accuracy
            * Save the model
    '''
    import pickle
    import tensorflow as tf
    from tensorflow.python import keras

    # Load dataset
    fashion_mnist = keras.datasets.fashion_mnist
    (train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data()

    # Normalize dataset
    train_images = train_images / 255.0
    test_images = test_images / 255.0

    # Define the model using keras
    model = keras.Sequential([
        keras.layers.Flatten(input_shape=(28, 28)),
        keras.layers.Dense(128, activation='relu'),
        keras.layers.Dense(10)

    ])

    model.compile(optimizer='adam', loss=keras.losses.SparseCategoricalCrossentropy(
        from_logits=True), metrics=['accuracy'])
    model.fit(train_images, train_labels, epochs=10)
    test_loss, test_acc = model.evaluate(test_images, test_labels, verbose=2)
    print('Test accuracy', test_acc)

    # IMP this specifies the path inside the Docker container where our model will be saved
    model.save(f'{data_path}/{model_file}')

    # Save test data pickle file
    with open(f'{data_path}/test_data', 'wb') as f:
        pickle.dump((test_images, test_labels), f)


def predict(data_path, model_file, image_number):
    
    # func_to_container_op requires packages to be imported inside of the function.
    import pickle

    import tensorflow as tf
    from tensorflow import keras

    import numpy as np
    
    # Load the saved Keras model
    model = keras.models.load_model(f'{data_path}/{model_file}')

    # Load and unpack the test_data
    with open(f'{data_path}/test_data','rb') as f:
        test_data = pickle.load(f)
    # Separate the test_images from the test_labels.
    test_images, test_labels = test_data
    # Define the class names.
    class_names = ['T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat',
                   'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot']

    # Define a Softmax layer to define outputs as probabilities
    probability_model = tf.keras.Sequential([model, 
                                            tf.keras.layers.Softmax()])

    # See https://github.com/kubeflow/pipelines/issues/2320 for explanation on this line.
    image_number = int(image_number)

    # Grab an image from the test dataset.
    img = test_images[image_number]

    # Add the image to a batch where it is the only member.
    img = (np.expand_dims(img,0))

    # Predict the label of the image.
    predictions = probability_model.predict(img)

    # Take the prediction with the highest probability
    prediction = np.argmax(predictions[0])

    # Retrieve the true label of the image from the test labels.
    true_label = test_labels[image_number]
    
    class_prediction = class_names[prediction]
    confidence = 100*np.max(predictions)
    actual = class_names[true_label]
    
    
    with open(f'{data_path}/result.txt', 'w') as result:
        result.write(" Prediction: {} | Confidence: {:2.0f}% | Actual: {}".format(class_prediction,
                                                                        confidence,
                                                                        actual))
    
    print('Prediction has be saved successfully!')

# Glue together training function to the docker container
train_op = comp.func_to_container_op(train, base_image='tensorflow/tensorflow:latest-gpu-py3')
predict_op = comp.func_to_container_op(predict, base_image='tensorflow/tensorflow:latest-gpu-py3')


# define pipeline metadata like name, description etc.
@dsl.pipeline(
    name='MNIST Pipeline',
    description='Pipeline that trains MNIST models on GPU'
)
# define virtual HDD space that the pipeline will take to run
def mnist_container_pipeline(data_path='/mnt', model_file='mnist_model.h5'):
    vop = dsl.VolumeOp(
        name='create_volume',
        resource_name='data-volume',
        size='1Gi',
        modes=["ReadWriteOnce"]
        #modes=dsl.VOLUME_MODE_RWM
    )

# We have already created a volume and a Glued component(Docker+Python script). This Glued component needs 
# to communicate with the volume so lets attach a volume to Glued component 
    mnist_training_container = train_op(data_path, model_file) \
        .add_pvolumes({data_path: vop.volume})

# Create MNIST prediction component.
    mnist_predict_container = predict_op(data_path, model_file, image_number=0) \
                                    .add_pvolumes({data_path: mnist_training_container.pvolume})

    

# Print the result of the prediction
    mnist_result_container = dsl.ContainerOp(
        name="print_prediction",
        image='library/bash:4.4.23',
        pvolumes={data_path: mnist_predict_container.pvolume},
        arguments=['cat', f'{data_path}/result.txt']
    )

    delete_vop = dsl.ResourceOp(
        name="delete-volume",
        k8s_resource=vop.k8s_resource,
        action="delete"
    ).after(mnist_result_container)

# pass these paths inside the volume that we made
pipeline_func = mnist_container_pipeline
#experiment_name = 'fashion_mnist_kubeflow_training2'
experiment_name = sys.argv[0]
run_name = pipeline_func.__name__ + ' run'


if __name__ == '__main__':
    kfp.compiler.Compiler().compile(pipeline_func, __file__ + '.yaml')

What happened:

The experiment Ran but the attached volume to this experiment is not getting deleted.

mnist

Screenshot from 2020-12-27 20-00-08

What did you expect to happen:

The attached volume should get deleted.

Environment:

How did you deploy Kubeflow Pipelines (KFP)? By converting above python script to -> yaml file and uploading this yaml file to Kubeflow dashboard

KFP version: 1.2

KFP SDK version:

kfp                                1.1.2   
kfp-pipeline-spec                  0.1.2   
kfp-server-api                     1.1.2rc1

Anything else you would like to add:

[Miscellaneous information that will assist in solving the issue.]

How do I delete this volume ?

/kind bug

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 17 (8 by maintainers)

Most upvoted comments

Current bug is still there with kf1.3 and minikube. Even deleting the pvc manually from command line kubectl delete PersistentVolumeClaim myvolume -n myuser --wait=false --output name cause it to get struck into Terminating state, without really deleting it. e.g. Status: Terminating (lasts 23m)