pachyderm: pachd OOM killed when submitting multiple files in parallel

Pachd is being OOM killed when submitting multiple files in parallel, even though the size of the files is much less than the memory requested.

Here is pachd description:

Containers:
  pachd:
    Container ID:   docker://b2f29d5d9f12fb6d481f1e0a8056804aad9a90012037074c6b12b7fde42a2715
    Image:          pachyderm/pachd:1.9.11
    Image ID:       docker-pullable://pachyderm/pachd@sha256:6f6766741d88d9c57701e7df75f1eeaed0f3802f12e2fbe3c643127efd9eceea
    Ports:          650/TCP, 651/TCP, 652/TCP, 600/TCP
    Host Ports:     0/TCP, 0/TCP, 0/TCP, 0/TCP
    State:          Running
      Started:      Thu, 06 Feb 2020 10:05:06 +0100
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Thu, 06 Feb 2020 09:56:00 +0100
      Finished:     Thu, 06 Feb 2020 10:02:15 +0100
    Ready:          True
    Restart Count:  6
    Limits:
      cpu:     2
      memory:  1Gi
    Requests:
      cpu:     1
      memory:  1Gi
    Environment:
      TIMEOUT:                                        1m
      EXPOSE_OBJECT_API:                              false
      PACH_ROOT:                                      /pach
      NUM_SHARDS:                                     16
      STORAGE_UPLOAD_CONCURRENCY_LIMIT:               100
      STORAGE_BACKEND:                                MINIO
      STORAGE_HOST_PATH:                              
      GOOGLE_BUCKET:                                  <set to the key 'google-bucket' in secret 'pachyderm-storage-secret'>        Optional: true
      GOOGLE_CRED:                                    <set to the key 'google-cred' in secret 'pachyderm-storage-secret'>          Optional: true
      AMAZON_BUCKET:                                  <set to the key 'amazon-bucket' in secret 'pachyderm-storage-secret'>        Optional: true
      AMAZON_DISTRIBUTION:                            <set to the key 'amazon-distribution' in secret 'pachyderm-storage-secret'>  Optional: true
      AMAZON_ID:                                      <set to the key 'amazon-id' in secret 'pachyderm-storage-secret'>            Optional: true
      AMAZON_SECRET:                                  <set to the key 'amazon-secret' in secret 'pachyderm-storage-secret'>        Optional: true
      AMAZON_REGION:                                  <set to the key 'amazon-region' in secret 'pachyderm-storage-secret'>        Optional: true
      AMAZON_TOKEN:                                   <set to the key 'amazon-token' in secret 'pachyderm-storage-secret'>         Optional: true
      MICROSOFT_CONTAINER:                            <set to the key 'microsoft-container' in secret 'pachyderm-storage-secret'>  Optional: true
      MICROSOFT_ID:                                   <set to the key 'microsoft-id' in secret 'pachyderm-storage-secret'>         Optional: true
      MICROSOFT_SECRET:                               <set to the key 'microsoft-secret' in secret 'pachyderm-storage-secret'>     Optional: true
      MINIO_ENDPOINT:                                 <set to the key 'minio-endpoint' in secret 'pachyderm-storage-secret'>       Optional: true
      MINIO_BUCKET:                                   <set to the key 'minio-bucket' in secret 'pachyderm-storage-secret'>         Optional: true
      MINIO_SECURE:                                   <set to the key 'minio-secure' in secret 'pachyderm-storage-secret'>         Optional: true
      MINIO_ID:                                       <set to the key 'minio-id' in secret 'pachyderm-storage-secret'>             Optional: true
      MINIO_SECRET:                                   <set to the key 'minio-secret' in secret 'pachyderm-storage-secret'>         Optional: true
      MINIO_SIGNATURE:                                <set to the key 'minio-signature' in secret 'pachyderm-storage-secret'>      Optional: true
      PACHD_POD_NAMESPACE:                            metakube (v1:metadata.namespace)
      WORKER_IMAGE:                                   pachyderm/worker:1.9.11
      WORKER_SIDECAR_IMAGE:                           pachyderm/pachd:1.9.11
      WORKER_IMAGE_PULL_POLICY:                       IfNotPresent
      PACHD_VERSION:                                  1.9.11
      METRICS:                                        true
      LOG_LEVEL:                                      error
      BLOCK_CACHE_BYTES:                              100Mi
      PACHYDERM_AUTHENTICATION_DISABLED_FOR_TESTING:  false
    Mounts:
      /pach from pachdvol (rw)
      /pachyderm-storage-secret from pachyderm-storage-secret (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from metakube-pachyderm-token-ddz8q (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  pachdvol:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  pachyderm-storage-secret:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  pachyderm-storage-secret
    Optional:    false
  metakube-pachyderm-token-ddz8q:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  metakube-pachyderm-token-ddz8q
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s

Then when doing

#smallFile.txt size is 1743B << 
pachctl start commit input@master
for I in {0..5}; do pachctl put file input@master:/$I -f smallFile.txt & done
pachctl finish commit input@master

or even pachctl put file input@master:/ -i list.txt, I get transport is closing because pachd has been OOM killed.

Environment?:

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 1
  • Comments: 15 (8 by maintainers)

Most upvoted comments

@sinonkt I just reproduced this with the minio storage backend. It looks like that may be the problem. Can you please change your storage backend to use the Amazon driver, instead, and see if you still get the issue? It should work fine with minio. Its only limitation is that it doesn’t support S3V2 signatures. You can even replace the storage backend on your running version of Pachyderm. I can guide you through this; open a ticket at pachyderm.zendesk.com or get on the #users channel.

@marcadella Can you also change to using the Amazon storage backend? I can, likewise, guide you through converting if you open a ticket or get on the channel.

@brycemcanally @jdoliner Steps to reproduce.

  1. Deploy minio
  2. Deploy pachyderm 1.10.1 using the minio driver, not the Amazon one. I didn’t test 1.10.5, but the bug might be present there, as well.
  3. Run this script
rm -rf dir-with-empty-file
mkdir dir-with-empty-file
for file in  $(seq 1 130);  do  dd  if=/dev/urandom bs=1048576 count=1  > dir-with-empty-file/$file.bin ; done
touch dir-with-empty-file/empty1.bin

pachctl delete repo test
pachctl create repo test
pachctl put file test@master -r  -f $(pwd)/dir-with-empty-file/
  1. You’ll see it fail with an EOF and pachd will restart with an oomkill.

I’ve put pachd logs, heap profile, and debug dump in gdrive since it’s pretty big,