wandb: Cannot create an Artifact with 7 files of size 975244241 bytes
I have the following script:
import wandb
if __name__ == "__main__":
with wandb.init(
project="private",
entity="private",
tags=["DELETEME"],
) as wandb_run:
artifact = wandb.Artifact("deleteme", type="deleteme")
for epoch in range(100):
print(f"Epoch {epoch}")
with artifact.new_file(f"checkpoint{epoch}", mode="wb") as f:
contents = open("/dev/urandom", "rb").read(975244241)
print(len(contents))
f.write(contents)
Running this script I see the following error:
[nix-shell:/efs/research/lottery]$ python deleteme.py
wandb: Currently logged in as: skainswo. Use `wandb login --relogin` to force relogin
wandb: wandb version 0.12.18 is available! To upgrade, please run:
wandb: $ pip install wandb --upgrade
wandb: Tracking run with wandb version 0.12.16
wandb: Run data is saved locally in /tmp/wandb/run-20220621_073539-11aknk72
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run silvery-microwave-547
wandb: ⭐️ View project at https://wandb.ai/skainswo/playing-the-lottery
wandb: 🚀 View run at https://wandb.ai/skainswo/playing-the-lottery/runs/11aknk72
Epoch 0
975244241
Epoch 1
975244241
Epoch 2
975244241
Epoch 3
975244241
Epoch 4
975244241
Epoch 5
975244241
Epoch 6
975244241
wandb: Waiting for W&B process to finish... (failed 1). Press Control-C to abort syncing.
wandb:
wandb: Synced silvery-microwave-547: https://wandb.ai/skainswo/playing-the-lottery/runs/11aknk72
wandb: Synced 6 W&B file(s), 0 media file(s), 0 artifact file(s) and 1 other file(s)
wandb: Find logs at: /tmp/wandb/run-20220621_073539-11aknk72/logs
Traceback (most recent call last):
File "/efs/research/lottery/deleteme.py", line 16, in <module>
f.write(contents)
OSError: [Errno 28] No space left on device
[nix-shell:/efs/research/lottery]$ df -H
Filesystem Size Used Avail Use% Mounted on
devtmpfs 3.3G 0 3.3G 0% /dev
tmpfs 33G 0 33G 0% /dev/shm
tmpfs 17G 7.6M 17G 1% /run
tmpfs 33G 345k 33G 1% /run/wrappers
/dev/disk/by-label/nixos 136G 118G 11G 92% /
<censored IP>:/ 9.3E 2.7G 9.3E 1% /efs
tmpfs 6.5G 263k 6.5G 1% /run/user/1000
[nix-shell:/efs/research/lottery]$
which is incorrect, considering I have more than enough free space left on disk to fit the files. I’ve also replicated this on machine with terabytes of disk space left, and still see the same error.
For the sake of exact reproducibility, I’m running with the following shell.nix:
# Run with nixGL, eg `nixGLNvidia-510.47.03 python cifar10_convnet_run.py --test`
# To prevent JAX from allocating all GPU memory: XLA_PYTHON_CLIENT_PREALLOCATE=false
# To push build to cachix: nix-store -qR --include-outputs $(nix-instantiate shell.nix) | cachix push ploop
let
# pkgs = import (/home/skainswo/dev/nixpkgs) { };
# Last updated: 2022-05-16. Check for new commits at status.nixos.org.
pkgs = import (fetchTarball "https://github.com/NixOS/nixpkgs/archive/556ce9a40abde33738e6c9eac65f965a8be3b623.tar.gz") {
config.allowUnfree = true;
# These actually cause problems for some reason. bug report?
# config.cudaSupport = true;
# config.cudnnSupport = true;
};
in
pkgs.mkShell {
buildInputs = with pkgs; [
ffmpeg
python3
python3Packages.augmax
python3Packages.einops
python3Packages.flax
python3Packages.ipython
python3Packages.jax
# See https://discourse.nixos.org/t/petition-to-build-and-cache-unfree-packages-on-cache-nixos-org/17440/14
# as to why we don't use the source builds of jaxlib/tensorflow.
(python3Packages.jaxlib-bin.override {
cudaSupport = true;
})
python3Packages.matplotlib
# python3Packages.pandas
python3Packages.plotly
# python3Packages.scikit-learn
(python3Packages.tensorflow-bin.override {
cudaSupport = false;
})
# Thankfully tensorflow-datasets does not have tensorflow as a propagatedBuildInput. If that were the case for any
# of these dependencies, we'd be in trouble since Python does not like multiple versions of the same package in
# PYTHONPATH.
python3Packages.tensorflow-datasets
python3Packages.tqdm
python3Packages.wandb
yapf
];
# Don't clog EFS with wandb results. Wandb will create and use /tmp/wandb.
WANDB_DIR = "/tmp";
# Don't check this into version control!
WANDB_API_KEY = "... secret ...";
}
In short: I’m running on NixOS 22.05, wandb client version 0.12.16, Python 3.9.12 on a p3.2xlarge EC2 instance. I’ve also replicated the issue on Ubuntu, and with and without a custom WANDB_DIR setting.
Why can’t I save 7 files in an Artifact?
About this issue
- Original URL
- State: open
- Created 2 years ago
- Comments: 19 (3 by maintainers)
I take that back, it writes to
tempfile.TemporaryDirectory()from the python stdlib. You can set _artifact_dir on an instance of artifact to an empty directory on a partition that has sufficient space…