pachyderm: pachd OOM killed when submitting multiple files in parallel
Pachd is being OOM killed when submitting multiple files in parallel, even though the size of the files is much less than the memory requested.
Here is pachd description:
Containers:
pachd:
Container ID: docker://b2f29d5d9f12fb6d481f1e0a8056804aad9a90012037074c6b12b7fde42a2715
Image: pachyderm/pachd:1.9.11
Image ID: docker-pullable://pachyderm/pachd@sha256:6f6766741d88d9c57701e7df75f1eeaed0f3802f12e2fbe3c643127efd9eceea
Ports: 650/TCP, 651/TCP, 652/TCP, 600/TCP
Host Ports: 0/TCP, 0/TCP, 0/TCP, 0/TCP
State: Running
Started: Thu, 06 Feb 2020 10:05:06 +0100
Last State: Terminated
Reason: OOMKilled
Exit Code: 137
Started: Thu, 06 Feb 2020 09:56:00 +0100
Finished: Thu, 06 Feb 2020 10:02:15 +0100
Ready: True
Restart Count: 6
Limits:
cpu: 2
memory: 1Gi
Requests:
cpu: 1
memory: 1Gi
Environment:
TIMEOUT: 1m
EXPOSE_OBJECT_API: false
PACH_ROOT: /pach
NUM_SHARDS: 16
STORAGE_UPLOAD_CONCURRENCY_LIMIT: 100
STORAGE_BACKEND: MINIO
STORAGE_HOST_PATH:
GOOGLE_BUCKET: <set to the key 'google-bucket' in secret 'pachyderm-storage-secret'> Optional: true
GOOGLE_CRED: <set to the key 'google-cred' in secret 'pachyderm-storage-secret'> Optional: true
AMAZON_BUCKET: <set to the key 'amazon-bucket' in secret 'pachyderm-storage-secret'> Optional: true
AMAZON_DISTRIBUTION: <set to the key 'amazon-distribution' in secret 'pachyderm-storage-secret'> Optional: true
AMAZON_ID: <set to the key 'amazon-id' in secret 'pachyderm-storage-secret'> Optional: true
AMAZON_SECRET: <set to the key 'amazon-secret' in secret 'pachyderm-storage-secret'> Optional: true
AMAZON_REGION: <set to the key 'amazon-region' in secret 'pachyderm-storage-secret'> Optional: true
AMAZON_TOKEN: <set to the key 'amazon-token' in secret 'pachyderm-storage-secret'> Optional: true
MICROSOFT_CONTAINER: <set to the key 'microsoft-container' in secret 'pachyderm-storage-secret'> Optional: true
MICROSOFT_ID: <set to the key 'microsoft-id' in secret 'pachyderm-storage-secret'> Optional: true
MICROSOFT_SECRET: <set to the key 'microsoft-secret' in secret 'pachyderm-storage-secret'> Optional: true
MINIO_ENDPOINT: <set to the key 'minio-endpoint' in secret 'pachyderm-storage-secret'> Optional: true
MINIO_BUCKET: <set to the key 'minio-bucket' in secret 'pachyderm-storage-secret'> Optional: true
MINIO_SECURE: <set to the key 'minio-secure' in secret 'pachyderm-storage-secret'> Optional: true
MINIO_ID: <set to the key 'minio-id' in secret 'pachyderm-storage-secret'> Optional: true
MINIO_SECRET: <set to the key 'minio-secret' in secret 'pachyderm-storage-secret'> Optional: true
MINIO_SIGNATURE: <set to the key 'minio-signature' in secret 'pachyderm-storage-secret'> Optional: true
PACHD_POD_NAMESPACE: metakube (v1:metadata.namespace)
WORKER_IMAGE: pachyderm/worker:1.9.11
WORKER_SIDECAR_IMAGE: pachyderm/pachd:1.9.11
WORKER_IMAGE_PULL_POLICY: IfNotPresent
PACHD_VERSION: 1.9.11
METRICS: true
LOG_LEVEL: error
BLOCK_CACHE_BYTES: 100Mi
PACHYDERM_AUTHENTICATION_DISABLED_FOR_TESTING: false
Mounts:
/pach from pachdvol (rw)
/pachyderm-storage-secret from pachyderm-storage-secret (rw)
/var/run/secrets/kubernetes.io/serviceaccount from metakube-pachyderm-token-ddz8q (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
pachdvol:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
pachyderm-storage-secret:
Type: Secret (a volume populated by a Secret)
SecretName: pachyderm-storage-secret
Optional: false
metakube-pachyderm-token-ddz8q:
Type: Secret (a volume populated by a Secret)
SecretName: metakube-pachyderm-token-ddz8q
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Then when doing
#smallFile.txt size is 1743B <<
pachctl start commit input@master
for I in {0..5}; do pachctl put file input@master:/$I -f smallFile.txt & done
pachctl finish commit input@master
or even pachctl put file input@master:/ -i list.txt,
I get transport is closing because pachd has been OOM killed.
Environment?:
- Pachyderm 1.9.11
(related to Zendesk ticket #52)
gz#52
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 1
- Comments: 15 (8 by maintainers)
@sinonkt I just reproduced this with the minio storage backend. It looks like that may be the problem. Can you please change your storage backend to use the Amazon driver, instead, and see if you still get the issue? It should work fine with minio. Its only limitation is that it doesn’t support S3V2 signatures. You can even replace the storage backend on your running version of Pachyderm. I can guide you through this; open a ticket at pachyderm.zendesk.com or get on the #users channel.
@marcadella Can you also change to using the Amazon storage backend? I can, likewise, guide you through converting if you open a ticket or get on the channel.
@brycemcanally @jdoliner Steps to reproduce.
I’ve put pachd logs, heap profile, and debug dump in gdrive since it’s pretty big,