pipelines: [BUG] Downloading archived artifacts through UI truncates output file

What steps did you take:

Hi everyone. We’ve found a strange problem that sometimes artifacts cannot be properly opened after downloading through Pipelines UI. One step of our pipeline produces parquet file, which can be opened properly when we download tgz directly from Minio UI and extract it manually by tar -xzvf <artifact.tar.gz>.

What happened:

However, when we click on artifact link in Pipeline UI, browser “successfully” (without exception/bad response) downloads file which is, however, significantly smaller than its original version and it can’t be read by pandas.

What did you expect to happen:

Downloading artifacts from Pipelines UI should provide correct file, with its original size and content.

How to reproduce:

Put any tgz artifact file into Minio, then try to download it via backend’s /artifacts/get endpoint. Please, let me know if you can’t reproduce it. I’m really concerned that I couldn’t find any similar issue since 0.5.0 version, so it may be some outlandish bug in our environment.

How did you deploy Kubeflow Pipelines (KFP)?

Private AWS cluster

KFP version: 0.5.0

KFP SDK version: 0.5.1

Anything else you would like to add:

I’ve been haunting this bug for hours and for me it’s still quite unclear. Maybe this information can help to figure out root of the problem

  1. parquet files (and another testing file mentioned in point 5) are written inside component with (if engine and compression matter)
data_frame.to_parquet(args.output_path, engine='fastparquet', compression='gzip')
  1. I could see this problem for several tgz files, but for some reason it seems to be unlikely reproducible when archive’s size is small (in my experience, less than 20MB).

  2. The problems is not relevant for same non-archived files, I could download everything even though files are larger than their archives.

  3. Size of downloaded file is almost random, but when I download it few times repeatedly, size for two files can be the same.

  4. I meticulously tried to reproduce the problem for one of our artifacts and found out that all not-entirely-downloaded (and extracted by backend) files are beginning of original file. Why I think so? If downloaded file is K-bytes-length, I copy first K bytes or original file into third one (head -c K original_file > third_file) and diff cli tool considers binary files (downloaded and “third”, with equal size) as equal by content.

  5. Since I can’t share our artifact file in case if you need it for investigation and in order to be sure that it’s not clearly our mistake, I took random parquet file from kaggle, read it in ipython and re-wrote it to another file with options from point 0. The problem persists, however, for some reason, point 4 does not (maybe it depends on some content format, I’m not familiar with pandas and parquet files).

  6. I guess there might be some implicit problem with streaming unarchive modules in nodejs. My main argument here is that non-archived files seem to be downloaded smoothly. I looked into pod’s logs but there are only messages below

GET /pipeline/artifacts/get?source=minio&bucket=mlpipeline&key=artifacts ...
Getting storage artifact at: minio: mlpipeline/artifacts/...

/kind bug /area frontend

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 28 (18 by maintainers)

Most upvoted comments

I can’t figure out the issue, might need to do even more testing:

  • purely functional test - I used the function to retrieve the tarball from minio - no issue
  • light-weight service - I constructed a simple express server over the function (very similar to the actual server) - no issue
  • actual local test (no containerization, no istio, no k8s) - I build and run the actual node server in my local machine (config it to read from the minio service) - no issue
  • full deployment in k8s - early termination of the stream (but i check the data is right, just terminated early)

I managed to reproduce this. But I am still investigating the cause.

But I was unable to figure out where. I am suspecting there might be some “special” char in the binary. I suspect might be express or some middleware issue.

This is what I did so far:

  • uploaded the tar to minio
  • retrieve with API I get the same bug as u described. Incomplete file download.

Then I

  • do everything exactly the same, except no express, no middleware
    • i.e. I use port-forwarding, and use the getObjectStream function directly (this function stream from minio, deflate, and untar the artifact) and pipe to file I do not encounter this issue - I get the full file content

So I am suspecting the issue might be at the express server layer. Either the server terminate prematurely because of EOF or some null characters, or some other issues. Need to investigate.

I somehow missed “parquet file, which can be opened properly when we download tgz directly from Minio UI and extract it manually by tar -xzvf <artifact.tar.gz>.”.

I’ve assigned the people who know the most about that part of code.